# Reddit Classification Project: Analyzing r/bitcoin and r/wallstreetbets

### Problem: Can we accurately predict a subreddit from a reddit post's title using classification modeling?

#### Data Collection/Generation
We will use pushshift api to obtain data from reddit.com, below is a test process.

In [1]:
import requests #imports
import pandas as pd

In [2]:
base_url = 'https://api.pushshift.io'
submission_endpt = '/reddit/submission/search' #we will use submission titles
comment_endpt = '/reddit/comment/search'

In [3]:
params = {
    'subreddit': 'bitcoin',
    'size': 500
}

In [4]:
res = requests.get(base_url + submission_endpt, params)

In [5]:
res.status_code #success

200

In [6]:
data = res.json()

In [7]:
posts = data['data']

In [8]:
posts[:1] #it works!

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'ElectronJonSA',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_eil99',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1620070234,
  'domain': 'self.Bitcoin',
  'full_link': 'https://www.reddit.com/r/Bitcoin/comments/n451ip/blockchain_funds_dissapeared/',
  'gildings': {},
  'id': 'n451ip',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 0,
  'num_crossposts': 0,
  'over_18': False

In [9]:
len(posts) #only 100 posts at a time though

100

In [10]:
posts[-1]['created_utc']

1620056031

In [11]:
df = pd.DataFrame(posts)

In [12]:
df.columns #check all columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_received',
       'treatment_tags', 'upvote_ratio',

In [13]:
df[['subreddit', 'author', 'title']]

Unnamed: 0,subreddit,author,title
0,Bitcoin,ElectronJonSA,Blockchain funds dissapeared?
1,Bitcoin,Azntigerlion,"I've been investing in Bitcoin since 2012, but..."
2,Bitcoin,alanalanal,HELP! Accidentally sent BTC to USDT account in...
3,Bitcoin,kadudu888,9 people out of 10 I talk to are still not in....
4,Bitcoin,kadudu888,BTFD is as equally important as HODL. I am lik...
...,...,...,...
95,Bitcoin,bitcointothemoon_,"eBay Still Looking at Crypto Payments, Mulls NFTs"
96,Bitcoin,Internet-profit,A very important way to profit from the Internet
97,Bitcoin,Courtneyanders22,u/rBitcoinMod
98,Bitcoin,viramarket,Qubit.life главные новости апреля.


In [14]:
df['created_utc'].min() #the utc time that we will use is: 1619467783 (Monday, April 26, 2021 8:09:43 PM)

1620056031

In [15]:
import time #we cannot make a request instantly every second, we will take a 2.5 second break when we design our function

#### Subreddit Posts Requesting Function
We create a function to request 100 posts at a time for the desired number of loops

In [16]:
def get_posts(subreddit, num_loops, epoch_time):
    subreddit_list = []
    
    for i in range(num_loops): #number of loops we want to get desired submission amount
        res = requests.get(base_url + submission_endpt, #we are looking at submissions (titles)
        params = {
            'subreddit': subreddit, 
            'size': 100, #maxed out at 100 posts per loop
            'before': epoch_time #need the epoch time to find new posts every loop before said epoch time
            }
        )
        
        df = pd.DataFrame(res.json()['data']) #put our data into a dataframe
        subreddit_list.append(df) #append the dataframe to an empty list
        epoch_time = df['created_utc'].min() #we take lowest epoch number (the farthest time) and restart the loop
        time.sleep(2.5) #we make a request every 2.5 seconds, anything faster I get interrupted
    
    return pd.concat(subreddit_list, axis = 0)

In [121]:
btc = get_posts('bitcoin', 100, 1619467783) #10000 posts from each subreddit seams fair

In [122]:
wsb = get_posts('wallstreetbets', 100, 1619467783)  # another 10000 

In [123]:
btc.to_csv('data/btc.csv', index = False) #make into a csv

In [124]:
wsb.to_csv('data/wsb.csv', index = False)

In [17]:
btc = pd.read_csv('../data/btc.csv') 
wsb = pd.read_csv('../data/wsb.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [18]:
btc #our bitcoin dataframe

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,banned_by,gallery_data,is_gallery,distinguished,suggested_sort,author_cakeday,link_flair_template_id,crosspost_parent,crosspost_parent_list,edited
0,[],False,Scream1e,,[],,text,t2_h86qf,False,False,...,,,,,,,,,,
1,[],False,bel-svoboda,,[],,text,t2_brdd7tig,False,False,...,,,,,,,,,,
2,[],False,heist95,,[],,text,t2_f4d3e,False,False,...,,,,,,,,,,
3,[],False,Manic_Miner2,noob,"[{'e': 'text', 't': 'redditor for 3 days'}]",redditor for 3 days,richtext,t2_as7o6s78,False,False,...,,,,,,,,,,
4,[],False,Crafty_Supermarket15,,[],,text,t2_8dg4924u,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,[],False,Knarson,,[],,text,t2_4hm8vwfq,False,False,...,,,,,,,,,,
9996,[],False,Felefix98,noob,"[{'e': 'text', 't': 'redditor for 3 months'}]",redditor for 3 months,richtext,t2_a0v0rnlt,False,False,...,,,,,,,,,,
9997,[],False,Fun-Recognition-5830,noob,"[{'e': 'text', 't': 'redditor for 2 weeks'}]",redditor for 2 weeks,richtext,t2_b0yccbrd,False,False,...,,,,,,,,,,
9998,[],False,[deleted],,,,,,,,...,,,,,,,,,,


In [19]:
wsb #our wallstreetbets dataframe

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata,author_flair_template_id,author_cakeday,banned_by,edited
0,[],False,DSDUDE2,,[],,text,t2_bmetrda7,False,False,...,,,,,,,,,,
1,[],False,geo_mvp_2nite,,[],,text,t2_8esgmn9k,False,False,...,,,,,,,,,,
2,[],False,IN-B4-404,,[],,text,t2_eweag,False,False,...,,,,,,,,,,
3,[],False,Dhaimoran,,[],,text,t2_f9j6l,False,False,...,,,,,,,,,,
4,[],False,Honestcapshonest,,[],,text,t2_a1cusiag,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,[],False,fk232323,,[],,text,t2_a155pl45,False,False,...,,,,,,,,,,
9996,[],False,StockWizard_,,[],,text,t2_a2v3udue,False,False,...,,,,,,,,,,
9997,[],False,z00tsuitnboogie,,[],,text,t2_a7gl7vpj,False,False,...,,,,,,,,,,
9998,[],False,DistinguishedJB,,[],,text,t2_7ftpfqo3,False,False,...,,,,,,,,,,


In [20]:
subreddit_df = pd.concat([btc, wsb], ignore_index = True) #combine them for analysis

In [21]:
subreddit_df.to_csv('../data/compiled_subreddit_data.csv', index = False) #save

In [22]:
subreddit_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,gallery_data,is_gallery,distinguished,suggested_sort,author_cakeday,link_flair_template_id,crosspost_parent,crosspost_parent_list,edited,link_flair_css_class
0,[],False,Scream1e,,[],,text,t2_h86qf,False,False,...,,,,,,,,,,
1,[],False,bel-svoboda,,[],,text,t2_brdd7tig,False,False,...,,,,,,,,,,
2,[],False,heist95,,[],,text,t2_f4d3e,False,False,...,,,,,,,,,,
3,[],False,Manic_Miner2,noob,"[{'e': 'text', 't': 'redditor for 3 days'}]",redditor for 3 days,richtext,t2_as7o6s78,False,False,...,,,,,,,,,,
4,[],False,Crafty_Supermarket15,,[],,text,t2_8dg4924u,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,[],False,fk232323,,[],,text,t2_a155pl45,False,False,...,,,,confidence,,da18a43a-83c5-11e8-9b6c-0e287561ddb8,,,,yolo
19996,[],False,StockWizard_,,[],,text,t2_a2v3udue,False,False,...,,,,confidence,,96f6c79e-b853-11e5-a4cb-0ebdf030e05d,,,,question
19997,[],False,z00tsuitnboogie,,[],,text,t2_a7gl7vpj,False,False,...,,,,confidence,,96f6c79e-b853-11e5-a4cb-0ebdf030e05d,,,,question
19998,[],False,DistinguishedJB,,[],,text,t2_7ftpfqo3,False,False,...,,,,confidence,,0513bea8-4f64-11e9-886d-0e2b4fe7300c,,,,meme


## We have our desired DataFrame! Now it's time to clean it.