<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP on Subreddit (CatAdvice & DogAdvice)

--- 
# Part 1

Part 1 contains web scraping from subreddit.

Data extraction on 25 Nov 2022
- Extracted data from r/CatAdvice by indicating 'subreddit' as 'CatAdvice', save as 'cat_adv.csv'
- Extracted data from r/DogAdvice by indicating 'subreddit' as 'DogAdvice', save as 'dog_adv.csv'

---

In [1]:
# Import libraries

import pandas as pd
import requests
import time

In [2]:
# Set display settings of dataframe

pd.set_option('display.max_columns',90)
pd.set_option('display.max_rows',50)

#### Extract sample to be familiar with workflow

In [3]:
# store base link in url
url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# Set parameters for get request
params = {
    'subreddit' : 'DogAdvice',
    'size' : 500
}


In [5]:
# Submit request
response_sample = requests.get(url, params)

In [6]:
# Check if request is okay
response_sample.status_code

200

In [7]:
# format text using json as the response output is in form of dictionary
data_sample = response_sample.json()

In [8]:
# Extract data and store them in variable 'posts' 
posts_sample = data_sample['data']

In [9]:
# Check len of the posts obtained - max 250 obtained even though size set at 500.
len(posts_sample)

250

In [10]:
# Create dataframe for posts and show top 3 rows
df_sample = pd.DataFrame(posts_sample)
df_sample.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,link_flair_css_class,link_flair_template_id,link_flair_text,media,media_embed,secure_media,secure_media_embed,crosspost_parent,crosspost_parent_list,removed_by_category,media_metadata
0,[],False,Yandxxl,,[],,text,t2_g4ggm8yq,False,False,False,[],False,False,1669251312,i.redd.it,https://www.reddit.com/r/DogAdvice/comments/z3...,{},z35x4y,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DogAdvice/comments/z35x4y/urgent_need_info_...,False,image,"{'enabled': True, 'images': [{'id': 'DMRN-aFwQ...",6,1669251323,1,Can anybody identify what this could possibly ...,True,False,False,DogAdvice,t5_367ex,65695,public,confidence,https://a.thumbs.redditmedia.com/ByDcm7ZutukEQ...,140.0,140.0,"Urgent, need info asap",0,[],1.0,https://i.redd.it/6q3fjj469u1a1.jpg,https://i.redd.it/6q3fjj469u1a1.jpg,all_ads,6,,,,,,,,,,,
1,[],False,beeflomix,,[],,text,t2_565uzskv,False,False,False,[],False,False,1669247552,self.DogAdvice,https://www.reddit.com/r/DogAdvice/comments/z3...,{},z34k6y,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DogAdvice/comments/z34k6y/behavior_problems...,False,,,6,1669247564,1,"Hello all, I've been having serious behavioral...",True,False,False,DogAdvice,t5_367ex,65693,public,confidence,self,,,Behavior problems with corgi,0,[],1.0,https://www.reddit.com/r/DogAdvice/comments/z3...,,all_ads,6,purple,7a1e7f14-7eb8-11e6-8a8d-0e4f4005713f,Advice,,,,,,,,
2,[],False,nipdeep,,[],,text,t2_x6nqo,False,False,False,[],False,False,1669246378,i.redd.it,https://www.reddit.com/r/DogAdvice/comments/z3...,{},z34484,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DogAdvice/comments/z34484/from_a_video_of_m...,False,image,"{'enabled': True, 'images': [{'id': 'sxYYqASlJ...",6,1669246389,1,,True,False,False,DogAdvice,t5_367ex,65692,public,confidence,https://b.thumbs.redditmedia.com/jRWXMJdx5qPPD...,140.0,140.0,"From a video of my dog yawning, how old would ...",0,[],1.0,https://i.redd.it/rugc2bmhut1a1.jpg,https://i.redd.it/rugc2bmhut1a1.jpg,all_ads,6,green,745f263c-7eb8-11e6-98c3-0ee3e4abe4ef,Discussion,,,,,,,,


In [11]:
# Display the number of rows and columns in df 
df_sample.shape

(250, 76)

In [12]:
# Display the timestamp of the last row
df_sample.created_utc.iloc[df_sample.index[-1]]

1668791062

#### Create function for extraction and perform data extraction

In [13]:
# Create a function to extract posts from subreddit
def webscrape_subreddit(subreddit,expected_post_len):
    # Create empty dataframe to store posts
    df = pd.DataFrame()
    # base url
    url = 'https://api.pushshift.io/reddit/search/submission'
    # Set parameters
    params = {
        'subreddit' : subreddit,
        'size' : 500
    }
    # Create a loop to extract the expected number of posts 
    # - extraction may be lesser than expected if subreddit has lesser posts than expected number. 
    while len(df) < expected_post_len:
        # Set parameters to extract posts before the timestamp of the last row
        # Note: first execution is with zero row, hence use try and except to bypass the error of first extraction
        try:
            params['before'] = df.created_utc.iloc[df.index[-1]]
        except:
            pass
        # Submit request
        response = requests.get(url, params)
        # Try to format text using json, 
        # it may fail if there is no post before the timestamp of the last row, hence 'break' extraction if required
        try:
            data_add = response.json()
        except:
            break
        # Extract required data from the additional posts
        posts_add = data_add['data']
        # store posts in dataframe
        df_add = pd.DataFrame(posts_add)
        # add new posts to the original dataframe
        df = pd.concat([df,df_add], axis = 0, ignore_index=True)
        # pause for 5 seconds to prevent block from subreddit
        time.sleep(5)
    return df

In [14]:
# webscrape from 'CatAdvice' subreddit - expected minimal posts to be 4000
cat_adv_df = webscrape_subreddit('CatAdvice',4000)
# Display number of rows and columns in dataframe
print(cat_adv_df.shape)
# To save df as csv
cat_adv_df.to_csv('../data/cat_adv.csv', index = False)

(4247, 74)


In [15]:
# Display last 5 rows of the cat_adv_df
cat_adv_df.tail()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_flair_background_color,author_flair_template_id,author_flair_text_color,author_cakeday,post_hint,preview,banned_by,thumbnail_height,thumbnail_width
4242,[],False,Ezihp,,[],,text,t2_lmztu,False,False,False,[],False,False,1665606221,self.CatAdvice,https://www.reddit.com/r/CatAdvice/comments/y2...,{},y2efj1,False,True,False,False,False,True,True,False,#80d323,"[{'e': 'text', 't': 'General'}]",b8d4f196-fb57-11ea-a3c7-0e460223f06d,General,light,richtext,False,False,True,0,0,False,all_ads,/r/CatAdvice/comments/y2efj1/household_items_f...,False,6,1665606231,1,Are there any alternatives for enzyme cleaners...,True,False,False,CatAdvice,t5_2sn56,113160,public,self,Household items for removing cat urine?,0,[],1.0,https://www.reddit.com/r/CatAdvice/comments/y2...,all_ads,6,,,,,,,,,,,,,
4243,[],False,Spiritpanda33,,[],,text,t2_r0s2p23h,False,False,False,[],False,False,1665604893,self.CatAdvice,https://www.reddit.com/r/CatAdvice/comments/y2...,{},y2duv5,False,True,False,False,False,True,True,False,#0dd3bb,"[{'e': 'text', 't': 'Behavioral'}]",6a7e639e-5faf-11eb-a259-0ec8e7ceabb1,Behavioral,dark,richtext,False,False,True,0,0,False,all_ads,/r/CatAdvice/comments/y2duv5/bin_kitty/,False,6,1665604904,1,We went away for 1 night and had relatives cam...,True,False,False,CatAdvice,t5_2sn56,113159,public,self,bin kitty,0,[],1.0,https://www.reddit.com/r/CatAdvice/comments/y2...,all_ads,6,,,,,,,,,,,,,
4244,[],False,Ginger_of_the_north,,[],,text,t2_4vznz4eq,False,False,False,[],False,False,1665603740,self.CatAdvice,https://www.reddit.com/r/CatAdvice/comments/y2...,{},y2dd02,False,True,False,False,False,True,True,False,#0dd3bb,"[{'e': 'text', 't': 'Nutrition/Water'}]",9f7eae6e-a761-11ec-860b-eef3d41bb389,Nutrition/Water,dark,richtext,False,False,False,0,0,False,all_ads,/r/CatAdvice/comments/y2dd02/she_wont_stop_puk...,False,6,1665603751,1,I adopted a ~4 year old black/ orange tortoise...,True,False,False,CatAdvice,t5_2sn56,113158,public,self,She won’t stop puking…what can I change??,0,[],1.0,https://www.reddit.com/r/CatAdvice/comments/y2...,all_ads,6,,,,,,,,,,,,,
4245,[],False,AdRepresentative7095,,[],,text,t2_a0xnmvs4,False,False,False,[],False,False,1665602253,self.CatAdvice,https://www.reddit.com/r/CatAdvice/comments/y2...,{},y2cqct,False,False,False,False,False,False,True,False,#0dd3bb,"[{'e': 'text', 't': 'Nutrition/Water'}]",9f7eae6e-a761-11ec-860b-eef3d41bb389,Nutrition/Water,dark,richtext,False,False,True,0,0,False,all_ads,/r/CatAdvice/comments/y2cqct/older_cat_calorie...,False,6,1665602264,1,[removed],True,False,False,CatAdvice,t5_2sn56,113155,public,self,Older Cat Calories Needed,0,[],1.0,https://www.reddit.com/r/CatAdvice/comments/y2...,all_ads,6,automod_filtered,,,,,,,,,,,,
4246,[],False,alxmg,,[],,text,t2_71l4vb6q,False,False,False,[],False,False,1665602147,self.CatAdvice,https://www.reddit.com/r/CatAdvice/comments/y2...,{},y2cosj,False,True,False,False,False,True,True,False,#0dd3bb,"[{'e': 'text', 't': 'Litterbox'}]",a259f8d0-634e-11eb-9c7b-0e213d85d8f7,Litterbox,dark,richtext,False,False,False,0,0,False,all_ads,/r/CatAdvice/comments/y2cosj/cat_peeing_everyw...,False,6,1665602158,1,Disclaimer so I don't violate the rules: I hav...,True,False,False,CatAdvice,t5_2sn56,113155,public,self,Cat peeing everywhere and I'm at my limit,0,[],1.0,https://www.reddit.com/r/CatAdvice/comments/y2...,all_ads,6,,,,,,,,,,,,,


In [16]:
# webscrape from 'DogAdvice' subreddit - expected minimal posts to be 4000
dog_adv_df = webscrape_subreddit('DogAdvice',4000)
# Display number of rows and columns in dataframe
print(dog_adv_df.shape)
# To save df as csv
dog_adv_df.to_csv('../data/dog_adv.csv', index = False)

(4000, 80)


In [17]:
# Display last 5 rows of the dog_adv_df
dog_adv_df.tail()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,link_flair_css_class,link_flair_template_id,link_flair_text,media,media_embed,secure_media,secure_media_embed,crosspost_parent,crosspost_parent_list,removed_by_category,media_metadata,author_flair_background_color,author_flair_text_color,author_cakeday,poll_data
3995,[],False,SpiritedTerm03,,[],,text,t2_nmbv1c4z,False,False,False,[],False,False,1661402281,self.DogAdvice,https://www.reddit.com/r/DogAdvice/comments/wx...,{},wx4pya,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/DogAdvice/comments/wx4pya/how_can_i_rehome_...,False,,,6,1661402291,1,I brought in a small boy that a homeless woman...,True,False,False,DogAdvice,t5_367ex,56303,public,confidence,self,,,How can I rehome a dog?,0,[],1.0,https://www.reddit.com/r/DogAdvice/comments/wx...,,all_ads,6,purple,7a1e7f14-7eb8-11e6-8a8d-0e4f4005713f,Advice,,,,,,,,,,,,
3996,[],False,CandyCornMushroom,,[],,text,t2_ro01mbxu,False,False,False,[],False,False,1661401963,self.DogAdvice,https://www.reddit.com/r/DogAdvice/comments/wx...,{},wx4mi0,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/DogAdvice/comments/wx4mi0/dog_wont_stop_whi...,False,,,6,1661401973,1,Our older cat has passed away unfortunately it...,True,False,False,DogAdvice,t5_367ex,56301,public,confidence,self,,,Dog won’t stop whining with new kitten,0,[],1.0,https://www.reddit.com/r/DogAdvice/comments/wx...,,all_ads,6,red,730c7708-7eb8-11e6-84f0-0e4f47e9f151,Question,,,,,,,,,,,,
3997,[],False,Waffle-Raccoon,,[],,text,t2_i3bq62h8,False,False,False,[],False,False,1661401328,self.DogAdvice,https://www.reddit.com/r/DogAdvice/comments/wx...,{},wx4fd0,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DogAdvice/comments/wx4fd0/dog_is_hiding_und...,False,,,6,1661401339,1,I recently got a new dog(bulldog) who is about...,True,False,False,DogAdvice,t5_367ex,56300,public,confidence,self,,,Dog is hiding under bed since new dog,0,[],1.0,https://www.reddit.com/r/DogAdvice/comments/wx...,,all_ads,6,red,730c7708-7eb8-11e6-84f0-0e4f47e9f151,Question,,,,,,,,,,,,
3998,[],False,A12354,,[],,text,t2_41t3aa99,False,False,False,[],False,False,1661400587,i.redd.it,https://www.reddit.com/r/DogAdvice/comments/wx...,{},wx46ka,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DogAdvice/comments/wx46ka/i_took_my_dog_to_...,False,image,"{'enabled': True, 'images': [{'id': 'Pa0iTyQqc...",6,1661400598,1,,True,False,False,DogAdvice,t5_367ex,56298,public,confidence,https://a.thumbs.redditmedia.com/Qb6m2DMqHQOJo...,105.0,140.0,I took my dog to doggie daycare. Since then he...,0,[],1.0,https://i.redd.it/enaka8xebsj91.jpg,https://i.redd.it/enaka8xebsj91.jpg,all_ads,6,red,730c7708-7eb8-11e6-84f0-0e4f47e9f151,Question,,,,,,,,,,,,
3999,[],False,PassengerSimilar7124,,[],,text,t2_jtxpugqa,False,False,False,[],False,False,1661397358,i.redd.it,https://www.reddit.com/r/DogAdvice/comments/wx...,{},wx343r,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,True,all_ads,/r/DogAdvice/comments/wx343r/help_dog_has_weir...,False,image,"{'enabled': True, 'images': [{'id': 'lh4WvX_Ru...",6,1661397369,1,,True,False,False,DogAdvice,t5_367ex,56298,public,confidence,nsfw,140.0,140.0,Help!!! Dog has weird growth on her paw. what ...,0,[],1.0,https://i.redd.it/jm86qp7t1sj91.jpg,https://i.redd.it/jm86qp7t1sj91.jpg,promo_adult_nsfw,3,,,,,,,,,,,,,,,
