# Project 3: Web APIs & Classification


## Problem Statement

The project intends to start building a prototype classification model that can assist Reddit to predict which subreddit a post belongs to by using the titles and self text(posts) combined. The aim is to see if the model can potentially speed up the process efficiency for Reddit in detecting if posts in the subreddit should be there or moved elsewhere. Hopefully, the model can eventually be used more broadly for classification to the rest of the subreddits in the future.

### Table of Contents
- [Create headers and url](#Create-headers-and-url)
- [Personal finance subreddit scrape](#Personal-finance-subreddit-scrape)
- [Student Loans subreddit scrape](#Student-Loans-subreddit-scrape)

In [2]:
#Import the neccessary libraries
import requests
import pandas as pd
import time
import random
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', None)

## Create headers and url

In [3]:
#create headers so reddit won't know its python agent
headers = {'User-agent': 'Geoff Inc 8.0'}

In [17]:
pfurl = 'https://www.reddit.com/r/personalfinance/new.json'

In [7]:
loanurl = 'https://www.reddit.com/r/StudentLoans/.json'

## Personal finance subreddit scrape

In [18]:
#scraping for personal finance subreddit posts
pf_posts = []
after = None

for a in range(40):
    if after == None:
        current_url = pfurl
    else:
        current_url = pfurl + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers=headers)
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    pf_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    
    if a > 0:
        prev_posts = pd.read_csv('./datasets/pf.csv')
        current_df = pd.DataFrame(current_posts)
        final = pd.concat([prev_posts,current_df])
        final.to_csv('./datasets/pf.csv',index=False)
        
    else:
        pd.DataFrame(current_posts).to_csv('./datasets/pf.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/personalfinance/new.json
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjmecc
2
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjkdic
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjgsid
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjf6lf
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjdnuf
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjc8zs
2
https://www.reddit.com/r/personalfinance/new.json?after=t3_gjb3tx
4
https://www.reddit.com/r/personalfinance/new.json?after=t3_gja5ii
3
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj8l9s
6
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj7b70
5
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj68kc
3
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj54da
5
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj41st
2
https://www.reddit.com/r/personalfinance/new.json?after=t3_gj2q7

In [19]:
#read investing csv
pfdf = pd.read_csv('./datasets/pf.csv')

In [20]:
pfdf.shape

(995, 104)

In [21]:
#see indf can successfully open
pfdf.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,crosspost_parent_list,crosspost_parent
0,,personalfinance,"So, the overview: I'm an out of work cook with...",t2_hdl1y,False,,0,False,Overwhelmed By My Finances &amp; Getting Force...,[],r/personalfinance,False,6,Unset,0,True,t3_gjnn7e,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},Other,False,1,,False,,False,,[],{},,True,,1589496000.0,text,6,,,text,self.personalfinance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,​,[],False,,,,t5_2qstm,,,,gjnn7e,True,,iigaijinne,,0,True,all_ads,False,[],False,dark,/r/personalfinance/comments/gjnn7e/overwhelmed...,all_ads,False,https://www.reddit.com/r/personalfinance/comme...,14166231,1589467000.0,0,,False,,,
1,,personalfinance,According to [tax.service.gov.uk](https://tax....,t2_6u5exw,False,,0,False,National Insurance - Year is Not Full,[],r/personalfinance,False,6,Insurance,0,True,t3_gjnmwp,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},Insurance,False,1,,False,,False,,[],{},,True,,1589496000.0,text,6,,,text,self.personalfinance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,​,[],False,,,,t5_2qstm,,,,gjnmwp,True,,custardy_cream,,0,True,all_ads,False,[],False,dark,/r/personalfinance/comments/gjnmwp/national_in...,all_ads,False,https://www.reddit.com/r/personalfinance/comme...,14166231,1589467000.0,0,,False,,,
2,,personalfinance,I’m in Wyoming. And I’m pursuing a refinance t...,t2_ejrdb,False,,0,False,Refinancing Process,[],r/personalfinance,False,6,Housing,0,True,t3_gjnlbs,False,light,1.0,,public,1,0,{},,False,[],,False,False,,{},Housing,False,1,,False,,False,,[],{},,True,,1589496000.0,text,6,,,text,self.personalfinance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,​,[],False,,,,t5_2qstm,,,#c313d3,gjnlbs,True,,Dr3s4ng,,1,True,all_ads,False,[],False,dark,/r/personalfinance/comments/gjnlbs/refinancing...,all_ads,False,https://www.reddit.com/r/personalfinance/comme...,14166231,1589467000.0,0,,False,1033dbd0-c078-11e4-b0f1-22000b3d8247,,
3,,personalfinance,"Like the title says, I am now financially able...",t2_22u3z24s,False,,0,False,Finally maxing out my SEP IRA (as an employee)...,[],r/personalfinance,False,6,Retirement,0,True,t3_gjned9,False,light,1.0,,public,1,0,{},,False,[],,False,False,,{},Retirement,False,1,,False,,False,,[],{},,True,,1589495000.0,text,6,,,text,self.personalfinance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qstm,,,#aa8c10,gjned9,True,,OliveandTina,,0,True,all_ads,False,[],False,,/r/personalfinance/comments/gjned9/finally_max...,all_ads,False,https://www.reddit.com/r/personalfinance/comme...,14166231,1589467000.0,0,,False,2361ec24-c078-11e4-acc1-22000b290247,,
4,,personalfinance,My fiance's father is going through a pretty r...,t2_1m5q9osg,False,,0,False,Advice for limiting damage from joint &amp; au...,[],r/personalfinance,False,6,Credit,0,True,t3_gjneqq,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},Credit,False,1,,False,,False,,[],{},,True,,1589495000.0,text,6,,,text,self.personalfinance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,​,[],False,,,,t5_2qstm,,,,gjneqq,True,,Pioneeress,,9,True,all_ads,False,[],False,dark,/r/personalfinance/comments/gjneqq/advice_for_...,all_ads,False,https://www.reddit.com/r/personalfinance/comme...,14166231,1589467000.0,0,,False,,,


In [22]:
#retain 'name','subreddit','title','selftext' columns
pfdf = pfdf[['name','subreddit','title','selftext']]

In [23]:
len(pfdf[pfdf['selftext']!=''])

995

In [24]:
pfdf.drop(pfdf.columns.difference(['name','subreddit','title','selftext']),1,inplace=True)

In [29]:
#drop duplicate rows in indf
pfdf.drop_duplicates(subset='title',keep='first',inplace=True,ignore_index=True)

In [30]:
pfdf.drop_duplicates(subset='selftext',keep='first',inplace=True,ignore_index=True)

In [31]:
pfdf.shape

(994, 4)

In [32]:
#save indf to csv
pfdf.to_csv('./datasets/ddup_personalfinance.csv')

## Student Loans subreddit scrape

In [9]:
#scraping for student loans subreddit posts
loan_posts = []
after = None

for a in range(100):
    if after == None:
        current_url = loanurl
    else:
        current_url = loanurl + '?after=' + after
    print(current_url)
    loanres = requests.get(current_url, headers=headers)
    
    if loanres.status_code != 200:
        print('Status error', loanres.status_code)
        break
    
    
    current_dict = loanres.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    loan_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    
    if a > 0:
        prev_posts = pd.read_csv('./datasets/loan.csv')
        current_df = pd.DataFrame(current_posts)
        finalloan = pd.concat([prev_posts,current_df])
        finalloan.to_csv('./datasets/loan.csv',index=False)
        
    else:
   
        pd.DataFrame(current_posts).to_csv('./datasets/loan.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/StudentLoans/.json
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_ghwirh
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_gh1msb
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_ggfrcb
4
https://www.reddit.com/r/StudentLoans/.json?after=t3_gfn4gj
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_ges7xz
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gdtia0
3
https://www.reddit.com/r/StudentLoans/.json?after=t3_gdcsex
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gc6jp5
2
https://www.reddit.com/r/StudentLoans/.json?after=t3_gbu91f
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gbhdg1
4
https://www.reddit.com/r/StudentLoans/.json?after=t3_gadqpb
3
https://www.reddit.com/r/StudentLoans/.json?after=t3_g9xgke
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_g91vbm
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_g8q4bo
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_g7mg1i
6
https://www.reddit.com/r

In [15]:
#read loan csv file
loandf = pd.read_csv('./datasets/loan.csv')

In [16]:
loandf.shape

(2468, 104)

In [33]:
#see loandf can successfully open
loandf.head()

Unnamed: 0,name,subreddit,title,selftext
0,t3_9w474g,StudentLoans,How to Identify a Student Loan Scam,It seems it's time to sticky another post abou...
1,t3_ghp77u,StudentLoans,Update on credit bureau reporting for COVID wa...,Hi there. This weekend many of you reported t...
2,t3_ghxdmi,StudentLoans,"""Average"" Person Paying Loans? Not a doctor/la...",Hey! Long-time lurker...\n\n Not sure if this ...
3,t3_gi5zt3,StudentLoans,Why would my student loan payment go down?,"So every month, I pay roughly about 100/month ..."
4,t3_ghz2d7,StudentLoans,Conflicting advice for mountain of debt (high ...,"Hello all! I'm extremely happy with my career,..."


In [17]:
loandf = loandf[['name','subreddit','title','selftext']]

In [18]:
len(loandf[loandf['selftext']!=''])

2468

In [22]:
loandf.drop(loandf.columns.difference(['name','subreddit','title','selftext']),1,inplace=True)

In [23]:
#drop duplicate rows in loandf
loandf.drop_duplicates(subset='title',keep='first',inplace=True,ignore_index=True)

In [30]:
loandf.drop_duplicates(subset='selftext',keep='first',inplace=True,ignore_index=True)

In [27]:
loandf.shape

(947, 4)

In [32]:
#save loandf to csv
loandf.to_csv('./datasets/ddup_loan.csv')