### Installations
<br> I install modules that I need and do not yet exist in my python environment using `pip`.

In [1]:
# !pip install transformers
# !pip install requests
# !pip install torch torchvision
# !pip install openai

### Import Libraries

In [2]:
# Import libraries

import requests
import time
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm
import string

### Scraping Using Requests

In [3]:
# Reddit thread to scrape
thread = 'Crypto_com' # replace with any other reddit thread you want to track

In [4]:
# Scraper 

header = {'User-agent': 'ep 0.1.3'}

# Set empty list to store posts
posts = []

# Set param as none for first iteration
after = None

# Iterate through 5 pages of 25 posts 
for i in tqdm(range(5)):
    if after == None:
        param = {}
    else:
        param = {'after': after}
    url = 'https://www.reddit.com/r/'+thread+'/.json'
    results = requests.get(url, params=param, headers=header)
    if results.status_code == 200: # Check if request successful
        res_json = results.json()
        posts.extend(res_json['data']['children'])
        after = res_json['data']['after']
    else:
        print(results.status_code)
        break
    #  Rest time in seconds
    time.sleep(1)


posts = pd.DataFrame(posts)
lst = {}
lst['post_title'] =[]
lst['content'] =[]
lst['name'] =[]
lst['upvotecount'] =[]
lst['num_comments'] =[]
lst['comment_url'] =[]
for i in posts['data']:
    if i['name'] not in lst['name']: # Records only posts from unique users
        lst['post_title'].append(i['title'])
        lst['content'].append(i['selftext'])
        lst['name'].append(i['name'])
        lst['upvotecount'].append(i['ups'])
        lst['num_comments'].append(i['num_comments'])
        lst['comment_url'].append(i['permalink'])

print('Successfully scrapped {} unique posts'.format(len(lst['post_title'])))
scrapped_thread = pd.DataFrame(lst)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i in tqdm(range(5)):


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Successfully scrapped 126 unique posts


In my use of this code, I encountered posts which are lengthy and exceed the token count for OpenAI API. Hence I separate the posts and process them separately 

In [5]:
# # remove problem thread (in this case, lengthy)
# prob_thread = "/r/Crypto_com/comments/15wolm7/does_fortune_favor_the_brave_a_lengthy_opinion/"
# scrapped_thread = scrapped_thread[scrapped_thread.comment_url != prob_thread]

In [6]:
# Save dataset 
file_name='./'+thread+'_thread.csv'
scrapped_thread.to_csv(file_name, index=False)

### Filtering posts by "Viral" definition

In [7]:
# Import data
df_thread = pd.read_csv(file_name)

#Define quartile to filter viral threads 
pct = 0.8

#Define threadholds
num_com_threshold = df_thread.num_comments.quantile(pct)
upvote_threshold = df_thread.upvotecount.quantile(pct)
print(
    
    'Number of Comments Stats - min: '+str(df_thread.num_comments.min())+', max: '+str(df_thread.num_comments.max())+
    '\nUpvotes Stats - min: '+str(df_thread.upvotecount.min())+', max: '+str(df_thread.upvotecount.max())+
    '\nPercentile used for identifying viral posts: '+str(pct*100)+'%'+
    '\n\nViral = posts with more than '+str(int(num_com_threshold))+' comments and '+str(int(upvote_threshold))+' upvotes'
    )

#Filter by comments and upvote count thresholds
df_thread = df_thread[(df_thread.num_comments>=num_com_threshold)&(df_thread.upvotecount>=upvote_threshold)]
print('\nNumber of viral posts identified: '+str(df_thread.shape[0]))

#Combine title and content
df_thread = df_thread.replace(np.nan, '', regex=True)
df_thread['title_content'] = df_thread.post_title + ' ' + df_thread.content

# Data processing 
# Remove rows with attachments
# df_title_content = df_title_content[
#                                         (df_title_content.title_content.str.contains("png")==False)&
#                                         (df_title_content.title_content.str.contains("ampx200b\n\nhttpsredditcom")==False)&
#                                         (df_title_content.title_content.str.contains("&amp;#x200B")==False)
#                                     ]
# df_title_content = df_title_content.reset_index(drop=True)
df_thread

Number of Comments Stats - min: 0, max: 209
Upvotes Stats - min: 0, max: 138
Percentile used for identifying viral posts: 80.0%

Viral = posts with more than 17 comments and 18 upvotes

Number of viral posts identified: 16


Unnamed: 0,post_title,content,name,upvotecount,num_comments,comment_url,title_content
0,Heads up #CROFam — the Crypto.com Shop has arr...,,t3_16yma4v,34,38,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,Heads up #CROFam — the Crypto.com Shop has arr...
13,CRO Defi estimated annual rewards...,Hi there\n\nI haven't been paying attention to...,t3_17kz6zx,18,31,/r/Crypto_com/comments/17kz6zx/cro_defi_estima...,CRO Defi estimated annual rewards... Hi there\...
37,CRO value restabilised? Maybe a good time to b...,It seems that CRO is stable again and it's bet...,t3_17idwf4,57,209,/r/Crypto_com/comments/17idwf4/cro_value_resta...,CRO value restabilised? Maybe a good time to b...
38,CRO pumping? Calm down boys? Zoom out,"So there's some excitement with CRO ""pumping"" ...",t3_17ie79l,31,54,/r/Crypto_com/comments/17ie79l/cro_pumping_cal...,CRO pumping? Calm down boys? Zoom out So there...
54,Cronos blockchain - when will CDC do something...,"Back at its launch, the Cronos blockchain was ...",t3_17hc7aa,20,19,/r/Crypto_com/comments/17hc7aa/cronos_blockcha...,Cronos blockchain - when will CDC do something...
55,This current CRO pump is just this tiny speck ...,See that half millimeter uptick? CRO .05 TO .0...,t3_17gsbwm,138,88,/r/Crypto_com/comments/17gsbwm/this_current_cr...,This current CRO pump is just this tiny speck ...
70,This is cool 🤣,,t3_17g0z4k,22,17,/r/Crypto_com/comments/17g0z4k/this_is_cool/,This is cool 🤣
81,Keep accumulating before a new breakthrough 🚀📈,Yesterday I bought another 20K CRO on top of t...,t3_17f7j3x,47,54,/r/Crypto_com/comments/17f7j3x/keep_accumulati...,Keep accumulating before a new breakthrough 🚀📈...
84,"With BTC price surging, CEXes experienced a su...",,t3_17ew1uc,104,43,/r/Crypto_com/comments/17ew1uc/with_btc_price_...,"With BTC price surging, CEXes experienced a su..."
90,Can’t Log into Exchange App,So BTC price goes up and now all of a sudden w...,t3_17exuw8,20,56,/r/Crypto_com/comments/17exuw8/cant_log_into_e...,Can’t Log into Exchange App So BTC price goes ...


### Scraping comments of the viral posts
<br> After filtering posts which have garnered some attention, I also scrape the comments in these posts to capture the discussions.

In [8]:
# Scrape comments of the viral posts

header = {'User-agent': 'ep 0.1.2'}
posts1=[]
for i in df_thread.comment_url:
    thread_comment = str(i)
    url = 'https://www.reddit.com'+thread_comment+'/.json'
    results = requests.get(url, headers=header)
    if results.status_code == 200: # Check if request successful
        res_json1 = results.json()
        posts1.extend(res_json1[1]['data']['children'])
    else:
        print(results.status_code)
        break
    #  Rest time in seconds
    time.sleep(1)

    
posts1 = pd.DataFrame(posts1)
lst = {}
lst['comment'] =[]
lst['comment_url'] =[]
lst['upvotecount'] =[]
for i in posts1['data']:
    try:
        lst['comment'].append(i['body'])
        lst['comment_url'].append(i['permalink'])
        lst['upvotecount'].append(i['ups'])
    except:
        # lst['comment_url'].append(i)
        pass


In [9]:
# Save dataset to file
comment_thread = pd.DataFrame(lst)
comment_thread.to_csv('reddit_viralpost_comments.csv', index=False)
comment_thread.shape
# comment_thread = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in lst.items() ]))

(250, 3)

### Data processing of the comments

In [10]:
#Import dataset
comment_thread_df = pd.read_csv('reddit_viralpost_comments.csv')


# #Define thresholds
# def extremities(array, pct):
#     return np.percentile(array, [100 - pct, pct])

# arr = comment_thread_df.upvotecount.sort_values().array
# upvote_threshold1 = extremities(arr, 90)[0]
# upvote_threshold2 = extremities(arr, 90)[1]

# print(
#     'Comments upvote thresholds: \nLower : '+"{:.2f}".format(upvote_threshold1)+', \nUpper : '+"{:.2f}".format(upvote_threshold2)
#     )

# #Filter by comments and upvote count thresholds
# comment_thread_df = comment_thread_df[(comment_thread_df.upvotecount<=upvote_threshold1)|(comment_thread_df.upvotecount>=upvote_threshold2)]


#  Remove rows where comment is '[deleted]'
comment_thread_df = comment_thread_df[~(comment_thread_df.comment == str('[deleted]'))].reset_index(drop=True)
print('\nNumber of rows: '+str(comment_thread_df.shape[0]))
comment_thread_df

#Clean up comment url link 
comment_thread_df.comment_url = comment_thread_df.comment_url.str[:-8]
# combined = comment_thread_df.merge(df_thread[['title_content','comment_url']], on='comment_url', how='left')

comment_thread_df



Number of rows: 249


Unnamed: 0,comment,comment_url,upvotecount
0,Finally! You dont need to be private to get th...,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,18
1,As a Private member that has received the Welc...,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,10
2,Before they open this stupid shop they should ...,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,5
3,I don’t understand why you’d want to walk arou...,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,9
4,Why would anyone pay money for this?,/r/Crypto_com/comments/16yma4v/heads_up_crofam...,5
...,...,...,...
244,Coinbase only does a $30 charge if you buy the...,/r/Crypto_com/comments/17bk0cj/best_place_to_s...,1
245,I found that if you make smaller transactions ...,/r/Crypto_com/comments/17bk0cj/best_place_to_s...,2
246,Coinbase 1:1 with no wire fees to cash out to ...,/r/Crypto_com/comments/17bk0cj/best_place_to_s...,1
247,May be a little bit of topic but basically the...,/r/Crypto_com/comments/17bk0cj/best_place_to_s...,1


### Sentiment analysis

Here I import a sentiment model from hugging face, which was trained on twitter data, to label the posts by positive, negative or neutral sentiment.

In [12]:
# Use sentiment model from hugging face: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment
# Import libraries 
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
from datetime import datetime
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

# now = datetime.now().strftime("%d%m%Y_%H%M%S")
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)


In [13]:

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
 
 
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


In [14]:
# check number of comments scraped for each viral post
comment_thread_df.comment_url.value_counts()

/r/Crypto_com/comments/17exuw8/cant_log_into_exchange_app/                           32
/r/Crypto_com/comments/17idwf4/cro_value_restabilised_maybe_a_good_time_to_buy/      32
/r/Crypto_com/comments/17gsbwm/this_current_cro_pump_is_just_this_tiny_speck_at/     25
/r/Crypto_com/comments/17ie79l/cro_pumping_calm_down_boys_zoom_out/                  21
/r/Crypto_com/comments/17ew1uc/with_btc_price_surging_cexes_experienced_a_sudden/    19
/r/Crypto_com/comments/17f7j3x/keep_accumulating_before_a_new_breakthrough/          14
/r/Crypto_com/comments/16yma4v/heads_up_crofam_the_cryptocom_shop_has_arrived/       14
/r/Crypto_com/comments/17bts50/is_there_any_news_on_the_future_of_crocronos/         14
/r/Crypto_com/comments/17bk0cj/best_place_to_swap_usdc/                              14
/r/Crypto_com/comments/17g0z4k/this_is_cool/                                         13
/r/Crypto_com/comments/17hc7aa/cronos_blockchain_when_will_cdc_do_something_with/    11
/r/Crypto_com/comments/17kz6zx/c

In [15]:
# # identify problem thread
# df_thread.to_csv('check.csv')
# # prob_thread = "/r/Crypto_com/comments/15wolm7/does_fortune_favor_the_brave_a_lengthy_opinion/"

Here I consolidate the comments and post content into one corpus, run through the sentiment analysis model and use OpenAI to summarise the discussion

In [16]:
# Get comments for viral posts and do sentiment analysis and summary
sums = [] #list to store summaries
for i in df_thread.comment_url.unique().tolist():
# for i in df_thread[df_thread.comment_url != prob_thread].comment_url.unique().tolist():
    list = comment_thread_df[comment_thread_df.comment_url == i].comment.to_list()
    list.insert(0,df_thread[df_thread.comment_url == i].title_content.iloc[0])
    df = pd.DataFrame(list, columns=['text'])
    sentiment=[]
    for k in list:
        text = preprocess(k)
        encoded_input = tokenizer(text, truncation=True, max_length=511, return_tensors='pt')
        output = model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)
        
        ranking = np.argsort(scores)
        ranking = ranking[::-1]
        sentiment.append(labels[ranking[0]])

    df['sentiment']=sentiment

    if df.shape[0] > 40:
        df = df.iloc[:31]
    else:
        df=df
        
    # Use openai API to summarise the dataset
    # Import library
    import openai

    def summarize_corpus(corpus):
        # Set up OpenAI API credentials
        openai.api_key = 'xx' #replace with own key

        # Provide the prompt and settings for the API call
        # prompt = 'Act as a market analyst, summarise the following text and detail what users want : ' + corpus
        prompt = 'Summarise the following text: ' + corpus
        max_tokens = 500  # Maximum number of tokens for the summary

        # Call the OpenAI API to generate the summary
        response = openai.Completion.create(
            engine='text-davinci-003',
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=0.3,
            n=1,
            stop=None
        )

        # Extract the generated summary from the API response
        summary = response.choices[0].text.strip()

        return summary
    
    corpus = df.text
    summary = summarize_corpus(corpus.to_json())
    sums.append(summary)
    print('Successful for '+str(i[31:-1]))




Successful for heads_up_crofam_the_cryptocom_shop_has_arrived
Successful for cro_defi_estimated_annual_rewards
Successful for cro_value_restabilised_maybe_a_good_time_to_buy
Successful for cro_pumping_calm_down_boys_zoom_out
Successful for cronos_blockchain_when_will_cdc_do_something_with
Successful for this_current_cro_pump_is_just_this_tiny_speck_at
Successful for this_is_cool
Successful for keep_accumulating_before_a_new_breakthrough
Successful for with_btc_price_surging_cexes_experienced_a_sudden
Successful for cant_log_into_exchange_app
Successful for weve_got_something_special_for_new_cryptocom_visa
Successful for ups_downs_is_so_whack
Successful for community_burn_pool_update
Successful for gm_its_really_that_simple
Successful for is_there_any_news_on_the_future_of_crocronos
Successful for best_place_to_swap_usdc


In [17]:
# # add back manual summary
# # prob_thread = "/r/Crypto_com/comments/165382g/finding_your_feelings_about_cdc_cro_based_off_of/"
# df_out = pd.DataFrame(comment_thread_df[comment_thread_df.comment_url == prob_thread].comment.to_list(), columns=['text'])
# corpus_out = df_out.text
# summary_out = summarize_corpus(corpus_out.to_json())
# summary_out
# sums.append(summary_out)


In [18]:
# summary_out

I also get another summary with different prompt to get a sensing of what users may be demanding for in the discussions

In [19]:
sums1 = [] #list to store summaries
for i in df_thread.comment_url.unique().tolist():
# for i in df_thread[df_thread.comment_url != prob_thread].comment_url.unique().tolist():
    list = comment_thread_df[comment_thread_df.comment_url == i].comment.to_list()
    list.insert(0,df_thread[df_thread.comment_url == i].title_content.iloc[0])
    df = pd.DataFrame(list, columns=['text'])
    sentiment=[]
    for k in list:
        text = preprocess(k)
        encoded_input = tokenizer(text, truncation=True, max_length=511, return_tensors='pt')
        output = model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)
        
        ranking = np.argsort(scores)
        ranking = ranking[::-1]
        sentiment.append(labels[ranking[0]])

    df['sentiment']=sentiment

    if df.shape[0] > 40:
        df = df.iloc[:31]
    else:
        df=df
        
    # Use openai API to summarise the dataset
    # Import library
    import openai

    def summarize_corpus(corpus):
        # Set up OpenAI API credentials
        openai.api_key = 'xx' #replace with own key

        # Provide the prompt and settings for the API call
        prompt = 'Act as a market analyst, summarise the following text and detail what users want : ' + corpus
        max_tokens = 500  # Maximum number of tokens for the summary

        # Call the OpenAI API to generate the summary
        response = openai.Completion.create(
            engine='text-davinci-003',
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=0.3,
            n=1,
            stop=None
        )

        # Extract the generated summary from the API response
        summary = response.choices[0].text.strip()

        return summary
    
    corpus = df.text
    summary = summarize_corpus(corpus.to_json())
    sums1.append(summary)
    print('Successful for '+str(i[31:-1]))

Successful for heads_up_crofam_the_cryptocom_shop_has_arrived
Successful for cro_defi_estimated_annual_rewards
Successful for cro_value_restabilised_maybe_a_good_time_to_buy
Successful for cro_pumping_calm_down_boys_zoom_out
Successful for cronos_blockchain_when_will_cdc_do_something_with
Successful for this_current_cro_pump_is_just_this_tiny_speck_at
Successful for this_is_cool
Successful for keep_accumulating_before_a_new_breakthrough
Successful for with_btc_price_surging_cexes_experienced_a_sudden
Successful for cant_log_into_exchange_app
Successful for weve_got_something_special_for_new_cryptocom_visa
Successful for ups_downs_is_so_whack
Successful for community_burn_pool_update
Successful for gm_its_really_that_simple
Successful for is_there_any_news_on_the_future_of_crocronos
Successful for best_place_to_swap_usdc


In [20]:
# # add back manual summary

# corpus_out1 = df_out.text
# summary_out1 = summarize_corpus(corpus_out1.to_json())
# summary_out1
# sums1.append(summary_out1)


In [22]:
df_fin = pd.DataFrame({'summary':sums,'summary1':sums1,'comment_url':df_thread.comment_url.unique().tolist()})

In [29]:
df_fin

Unnamed: 0,summary,summary1,comment_url
0,The Crypto.com Shop has arrived and is in beta...,Users want to be able to purchase Crypto.com m...,/r/Crypto_com/comments/16yma4v/heads_up_crofam...
1,CRO's DeFi returns have seen a decrease in the...,Users want to know why CRO Defi returns have f...,/r/Crypto_com/comments/17kz6zx/cro_defi_estima...
2,CRO value has restabilised between $0.04 and $...,Users want to buy CRO because it has restabili...,/r/Crypto_com/comments/17idwf4/cro_value_resta...
3,"There is some excitement with CRO ""pumping"" bu...","Users want to see CRO ""pumping"" and are excite...",/r/Crypto_com/comments/17ie79l/cro_pumping_cal...
4,"At its launch, the Cronos blockchain was well ...",Users of the Cronos blockchain are disappointe...,/r/Crypto_com/comments/17hc7aa/cronos_blockcha...
5,CRO has seen a half millimeter uptick from .05...,"Users want to buy crypto, particularly CRO, as...",/r/Crypto_com/comments/17gsbwm/this_current_cr...
6,This text is about a game with meme coins and ...,Users want to know more about a game with meme...,/r/Crypto_com/comments/17g0z4k/this_is_cool/
7,The text discusses the current market activity...,"Users want to accumulate CRO, a cryptocurrency...",/r/Crypto_com/comments/17f7j3x/keep_accumulati...
8,Crypto.com recorded over 1 billion spot volume...,Users want to see the crypto market continue t...,/r/Crypto_com/comments/17ew1uc/with_btc_price_...
9,Many users are experiencing difficulty logging...,Users are experiencing issues with logging int...,/r/Crypto_com/comments/17exuw8/cant_log_into_e...


In [24]:
df_fin.to_csv(thread+'_viralposts_url_summary.csv')