<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Subreddit Classifier with Webscraping, NLP and ML
## Part 1: Webscraping and Data Gathering

## Contents:
1. [Create functions](#Create-functions)
2. [Get posts from r/bicycling](#Get-posts-from-r/bicycling)
3. [Get posts from r/motorcycles](#Get-posts-from-r/motorcycles)
4. [Run loops](#Run-loops)
5. [Initial cleaning](#Initial-cleaning)
6. [Export to csv](#Export-to-csv)

In [1]:
# import libraries
import numpy as np
import pandas as pd
import requests
import datetime
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(font_scale=1.5)
pd.set_option('display.max_columns',100)

In [2]:
base_url = 'https://api.pushshift.io/reddit/search/submission'

## Create functions
- We create functions to get the parameters from the subreddit posts, then transform these posts into a dataframe
- We then iterate through the process, as we are only able to get 100 posts at a time

*Back to [Contents](#Contents:)*

In [3]:
# parameters from subreddit
def get_params(df, subreddit):
    params = {
        'subreddit': subreddit,
        'size': 100,
        
        # look for the last time entry in the main df after concat
        'before': df.loc[df.shape[0]-1, 'created_utc']
    }
    return params

In [4]:
# get posts for subreddit
def get_posts(params, base_url = 'https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(base_url, params)
    if res.status_code != 200:
        print('Error:', res.status_code)
    else:
        data = res.json()
        posts = data['data']
    return posts

In [5]:
# transform list of posts into dataframe
def create_df(posts):
    return pd.DataFrame(posts)

In [6]:
# because we can only create a df with 100 posts at a time
# add on 100 posts continuously to a main df
def add_posts_to_df(main_df, subreddit):
    params = get_params(main_df, subreddit)
    posts = get_posts(params, base_url)
    next_df = create_df(posts)
    main_df = pd.concat([main_df, next_df], axis=0, ignore_index=True, sort=True)
    return main_df

## Get posts from r/bicycling
*Back to [Contents](#Contents:)*

In [7]:
# initiating parameters for the subreddit
params_bicycling = {
    'subreddit': 'bicycling',
    'size': 100
}

In [8]:
# getting a list of posts from subreddit
posts_bicycling = get_posts(params_bicycling)

In [9]:
# an example of how a post would look like
posts_bicycling[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Left_on_Burnside',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_6jam74xc',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1660018569,
 'domain': 'i.redd.it',
 'full_link': 'https://www.reddit.com/r/bicycling/comments/wjtonf/10_off_any_roadid/',
 'gildings': {},
 'id': 'wjtonf',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': True,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts'

In [10]:
# transforming the list into a dataframe
df_bicycling = create_df(posts_bicycling)
df_bicycling.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,gallery_data,is_gallery,media_metadata,author_flair_template_id,author_flair_text_color,crosspost_parent,crosspost_parent_list,author_cakeday,media,media_embed,secure_media,secure_media_embed,removed_by_category
0,[],False,Left_on_Burnside,,[],,text,t2_6jam74xc,False,False,False,[],False,False,1660018569,i.redd.it,https://www.reddit.com/r/bicycling/comments/wj...,{},wjtonf,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bicycling/comments/wjtonf/10_off_any_roadid/,False,image,"{'enabled': True, 'images': [{'id': 'rX40eCAj6...",6,1660018579,1,,True,False,False,bicycling,t5_2qi0s,1062638,public,https://b.thumbs.redditmedia.com/2lWzj9dhRm3Jc...,140.0,140.0,$10 off any RoadID,0,[],1.0,https://i.redd.it/1lqlauuw5mg91.jpg,https://i.redd.it/1lqlauuw5mg91.jpg,all_ads,6,,,,,,,,,,,,,
1,[],False,megaplaywknd,,[],,text,t2_qetyr,False,False,False,[],False,False,1660017570,i.redd.it,https://www.reddit.com/r/bicycling/comments/wj...,{},wjtd48,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bicycling/comments/wjtd48/is_this_a_good_de...,False,image,"{'enabled': True, 'images': [{'id': 'sO689QnyD...",6,1660017581,1,,True,False,False,bicycling,t5_2qi0s,1062636,public,https://a.thumbs.redditmedia.com/_x_slsoh3NqNC...,140.0,140.0,Is this a good deal on a bike? It’s a Fuji Thr...,0,[],1.0,https://i.redd.it/2pq7z3vz2mg91.jpg,https://i.redd.it/2pq7z3vz2mg91.jpg,all_ads,6,,,,,,,,,,,,,
2,[],False,Kevin2566,,[],,text,t2_6xwoi7bf,False,False,False,[],False,False,1660016616,self.bicycling,https://www.reddit.com/r/bicycling/comments/wj...,{},wjt2fs,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/bicycling/comments/wjt2fs/preventing_bike_t...,False,,,6,1660016626,1,"Hi, so there's been a string of bike theft in ...",True,False,False,bicycling,t5_2qi0s,1062634,public,self,,,Preventing Bike Theft,0,[],1.0,https://www.reddit.com/r/bicycling/comments/wj...,,all_ads,6,,,,,,,,,,,,,
3,[],False,HistoryExplainsALot,,[],,text,t2_x300f,False,False,False,[],False,False,1660016483,reddit.com,https://www.reddit.com/r/bicycling/comments/wj...,{},wjt0x1,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bicycling/comments/wjt0x1/specialized_hardr...,False,,,6,1660016494,1,,True,False,False,bicycling,t5_2qi0s,1062634,public,https://b.thumbs.redditmedia.com/thdhEAD6s3CUM...,105.0,140.0,Specialized Hardrock. Winnipeg Red River singl...,0,[],1.0,https://www.reddit.com/gallery/wjt0x1,https://www.reddit.com/gallery/wjt0x1,all_ads,6,"{'items': [{'id': 173543397, 'media_id': '2cli...",True,"{'2cli5earzlg91': {'e': 'Image', 'id': '2cli5e...",,,,,,,,,,
4,[],False,HistoryExplainsALot,,[],,text,t2_x300f,False,False,False,[],False,False,1660016030,self.bicycling,https://www.reddit.com/r/bicycling/comments/wj...,{},wjsvly,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bicycling/comments/wjsvly/specialized_hardr...,False,,,6,1660016040,1,,True,False,False,bicycling,t5_2qi0s,1062631,public,self,,,Specialized Hardrock. Winnipeg Red river Canop...,0,[],1.0,https://www.reddit.com/r/bicycling/comments/wj...,,all_ads,6,,,,,,,,,,,,,


In [11]:
df_bicycling.shape

(100, 77)

In [12]:
# returns all the columns of a single post
df_bicycling.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned',
       'post_hint', 'preview', 'pwls', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail',
   

In [13]:
# returns the title of the posts extracted
df_bicycling['title']

0                                    $10 off any RoadID
1     Is this a good deal on a bike? It’s a Fuji Thr...
2                                 Preventing Bike Theft
3     Specialized Hardrock. Winnipeg Red River singl...
4     Specialized Hardrock. Winnipeg Red river Canop...
                            ...                        
95    Thought any fellow Italians might appreciate t...
96                                       Amazon Rollers
97             Shoes slightly narrower than lake shoes?
98                                       Haro V3 (2006)
99                               need your guys opinion
Name: title, Length: 100, dtype: object

In [14]:
df_bicycling[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,bicycling,,$10 off any RoadID
1,bicycling,,Is this a good deal on a bike? It’s a Fuji Thr...
2,bicycling,"Hi, so there's been a string of bike theft in ...",Preventing Bike Theft
3,bicycling,,Specialized Hardrock. Winnipeg Red River singl...
4,bicycling,,Specialized Hardrock. Winnipeg Red river Canop...


## Get posts from r/motorcycles
*Back to [Contents](#Contents:)*

In [15]:
# initiating parameters for the subreddit
params_motorcycles = {
    'subreddit': 'motorcycles',
    'size': 100
}

In [16]:
# getting a list of posts from subreddit
posts_motorcycles = get_posts(params_motorcycles)

In [17]:
# an example of how a post would look like
posts_motorcycles[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'AUTOT3K',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_4fusrcl0',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1660015269,
 'domain': 'self.motorcycles',
 'full_link': 'https://www.reddit.com/r/motorcycles/comments/wjsmhc/anyone_running_an_xl_hornet_x2_w_13mm_crown_pad/',
 'gildings': {},
 'id': 'wjsmhc',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_c

In [18]:
# transforming the list into a dataframe
df_motorcycles = create_df(posts_motorcycles)
df_motorcycles.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,url_overridden_by_dest,author_flair_template_id,author_flair_text_color,thumbnail_height,thumbnail_width,author_flair_background_color,media,media_embed,secure_media,secure_media_embed,suggested_sort
0,[],False,AUTOT3K,,[],,text,t2_4fusrcl0,False,False,False,[],False,False,1660015269,self.motorcycles,https://www.reddit.com/r/motorcycles/comments/...,{},wjsmhc,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/motorcycles/comments/wjsmhc/anyone_running_...,False,6,1660015280,1,Hey everyone. \nThe time has come to replace t...,True,False,False,motorcycles,t5_2qi6d,1464560,public,self,Anyone Running An XL Hornet X2 /w 13mm Crown Pad?,0,[],1.0,https://www.reddit.com/r/motorcycles/comments/...,all_ads,6,,,,,,,,,,,,,,
1,[],False,Wise-Promise22,,[],,text,t2_ikfi07h3,False,False,False,[],False,False,1660013413,self.motorcycles,https://www.reddit.com/r/motorcycles/comments/...,{},wjrzkp,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/motorcycles/comments/wjrzkp/breaking_in_a_r7/,False,6,1660013423,1,I just recently bought a 2022 R7 anniversary e...,True,False,False,motorcycles,t5_2qi6d,1464516,public,self,Breaking in a R7,0,[],1.0,https://www.reddit.com/r/motorcycles/comments/...,all_ads,6,,,,,,,,,,,,,,
2,[],False,ceeballama,,[],,text,t2_ly63n,False,False,False,[],False,False,1660013058,self.motorcycles,https://www.reddit.com/r/motorcycles/comments/...,{},wjrv4r,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/motorcycles/comments/wjrv4r/where_does_this...,False,6,1660013068,1,"When I dropped my bike last time, this tube be...",True,False,False,motorcycles,t5_2qi6d,1464508,public,self,Where does this line connect to? 2004 Ninja ZX636,0,[],1.0,https://www.reddit.com/r/motorcycles/comments/...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'tmUhesfk...",,,,,,,,,,,,
3,[],False,Offered_Object_23,,[],,text,t2_8av7uk7i,False,False,False,[],False,False,1660012965,self.motorcycles,https://www.reddit.com/r/motorcycles/comments/...,{},wjru1a,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/motorcycles/comments/wjru1a/do_you_never_te...,False,6,1660012976,1,I want to tell my boyfriend I’m worried about ...,True,False,False,motorcycles,t5_2qi6d,1464505,public,self,Do you never tell a biker you’re worried about...,0,[],1.0,https://www.reddit.com/r/motorcycles/comments/...,all_ads,6,,,,,,,,,,,,,,
4,[],False,louiesalads69,,[],,text,t2_37g078zb,False,False,False,[],False,False,1660012385,self.motorcycles,https://www.reddit.com/r/motorcycles/comments/...,{},wjrmrl,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/motorcycles/comments/wjrmrl/milwaukee_tuners/,False,6,1660012395,1,[removed],True,False,False,motorcycles,t5_2qi6d,1464494,public,self,Milwaukee tuners,0,[],1.0,https://www.reddit.com/r/motorcycles/comments/...,all_ads,6,,,moderator,,,,,,,,,,,


In [19]:
df_motorcycles.shape

(100, 73)

In [20]:
# returns all the columns of a single post
df_motorcycles.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_rece

In [21]:
# returns the title of the posts extracted
df_motorcycles['title']

0     Anyone Running An XL Hornet X2 /w 13mm Crown Pad?
1                                      Breaking in a R7
2     Where does this line connect to? 2004 Ninja ZX636
3     Do you never tell a biker you’re worried about...
4                                      Milwaukee tuners
                            ...                        
95        Are saddle bags interchangeable acres brands?
96                             Throttle Rocker opinions
97                                    Throttle rockers?
98        Recommended tool kits and emergency tire kits
99                      Applying decals on a fresh coat
Name: title, Length: 100, dtype: object

In [22]:
df_motorcycles[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,motorcycles,Hey everyone. \nThe time has come to replace t...,Anyone Running An XL Hornet X2 /w 13mm Crown Pad?
1,motorcycles,I just recently bought a 2022 R7 anniversary e...,Breaking in a R7
2,motorcycles,"When I dropped my bike last time, this tube be...",Where does this line connect to? 2004 Ninja ZX636
3,motorcycles,I want to tell my boyfriend I’m worried about ...,Do you never tell a biker you’re worried about...
4,motorcycles,[removed],Milwaukee tuners


## Run loops
- Run loops to get 3000 posts per Subreddit

*Back to [Contents](#Contents:)*

In [23]:
for i in range(29):
    df_bicycling = add_posts_to_df(df_bicycling, 'bicycling')

In [24]:
df_bicycling.shape

(2997, 82)

In [25]:
df_bicycling[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,bicycling,,$10 off any RoadID
1,bicycling,,Is this a good deal on a bike? It’s a Fuji Thr...
2,bicycling,"Hi, so there's been a string of bike theft in ...",Preventing Bike Theft
3,bicycling,,Specialized Hardrock. Winnipeg Red River singl...
4,bicycling,,Specialized Hardrock. Winnipeg Red river Canop...


In [26]:
for i in range(29):
    df_motorcycles = add_posts_to_df(df_motorcycles, 'motorcycles')

In [27]:
df_motorcycles.shape

(3000, 75)

In [28]:
df_motorcycles[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,motorcycles,Hey everyone. \nThe time has come to replace t...,Anyone Running An XL Hornet X2 /w 13mm Crown Pad?
1,motorcycles,I just recently bought a 2022 R7 anniversary e...,Breaking in a R7
2,motorcycles,"When I dropped my bike last time, this tube be...",Where does this line connect to? 2004 Ninja ZX636
3,motorcycles,I want to tell my boyfriend I’m worried about ...,Do you never tell a biker you’re worried about...
4,motorcycles,[removed],Milwaukee tuners


## Initial cleaning
*Back to [Contents](#Contents:)*

In [29]:
def preprocess(df):
    df_copy = df.copy()
    
    # remove null values with empty strings
    df_copy['selftext'].replace(np.nan, '', inplace=True)
    
    # remove ['deleted','removed']
    df_copy.drop(df_copy.loc[(df_copy.selftext == '[deleted]') | (df_copy.selftext == '[removed]')].index, inplace=True)
    
    # combine columns 'selftext' and 'title'
    df_copy['text'] = df_copy['selftext'] + df_copy['title']
    
    # check for duplicates
    df_copy.drop_duplicates(subset=['text'], inplace=True)
    
    cols_to_keep =  ['subreddit','selftext','title','text']
    
    return df_copy[cols_to_keep].reset_index(drop=True)

In [30]:
df_bicycling = preprocess(df_bicycling)
df_bicycling.head()

Unnamed: 0,subreddit,selftext,title,text
0,bicycling,,$10 off any RoadID,$10 off any RoadID
1,bicycling,,Is this a good deal on a bike? It’s a Fuji Thr...,Is this a good deal on a bike? It’s a Fuji Thr...
2,bicycling,"Hi, so there's been a string of bike theft in ...",Preventing Bike Theft,"Hi, so there's been a string of bike theft in ..."
3,bicycling,,Specialized Hardrock. Winnipeg Red River singl...,Specialized Hardrock. Winnipeg Red River singl...
4,bicycling,,Specialized Hardrock. Winnipeg Red river Canop...,Specialized Hardrock. Winnipeg Red river Canop...


In [31]:
df_motorcycles = preprocess(df_motorcycles)
df_motorcycles.head()

Unnamed: 0,subreddit,selftext,title,text
0,motorcycles,Hey everyone. \nThe time has come to replace t...,Anyone Running An XL Hornet X2 /w 13mm Crown Pad?,Hey everyone. \nThe time has come to replace t...
1,motorcycles,I just recently bought a 2022 R7 anniversary e...,Breaking in a R7,I just recently bought a 2022 R7 anniversary e...
2,motorcycles,"When I dropped my bike last time, this tube be...",Where does this line connect to? 2004 Ninja ZX636,"When I dropped my bike last time, this tube be..."
3,motorcycles,I want to tell my boyfriend I’m worried about ...,Do you never tell a biker you’re worried about...,I want to tell my boyfriend I’m worried about ...
4,motorcycles,I just recently got my 2021 R3 and have been l...,R3 Exhaust,I just recently got my 2021 R3 and have been l...


## Export to csv
*Back to [Contents](#Contents:)*

In [32]:
df_bicycling.to_csv('./data/bicycling_subreddit.csv', index = False)
df_motorcycles.to_csv('./data/motorcycles_subreddit.csv', index = False)