# Web APIs & Classification


## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


### Things included in the dataset 

1. the title of the thread
2. length of time has been on reddit
3. the number of comments on the thread
4. the self text 
5. the subreddit that the thread is correspond to


## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Subreddit Data Collection: AskWomen](#Subreddits-Data-Collection:-AskWomen)
- [Subreddit Data Collection: AskMen](#Subreddits-Data-Collection:-AskMen)
- [Subreddits Data Collection: Relationship_Advice](#Subreddits-Data-Collection:-Relationship_Advice)
- [Function to_dataframe](#Function-to_dataframe)

##### Libraries 

In [1]:
import requests 
import pandas as pd 
import time 

### Subreddits Data Collection: AskWomen

In [9]:
#reference code from Riely's video 

headers = {'User-agent': "Evelyn Li"}
women_posts = []
after = None
for i in range(40):
    if i % 10 == 0:
        print(i)
    if after == None:
        params = {}
    else:
        params = {'after' : after}
        
    url = 'https://www.reddit.com/r/AskWomen/.json'
    res = requests.get(url, params = params, headers = headers)
    
    if res.status_code == 200:
        raddit_json  = res.json ()
        women_posts.extend(raddit_json['data']['children'])
        after = raddit_json['data']['after']
    else:
        print(res.status_code)
        break 
    time.sleep(1)

0
10
20
30


In [10]:
#check for unique posts
print('Post length', len(women_posts))
print('Post with Unique ID', len(set([post['data']['name'] for post in women_posts])))

Post length 999
Post with Unique ID 723


### Subreddits Data Collection: AskMen

In [14]:
#reference code from Riely's video 
headers = {'User-agent': "Evelyn Li"}
men_posts = []
after = None
for i in range(40):
    if i % 10 == 0:
        print(i)
    if after == None:
        params = {}
    else:
        params = {'after' : after}
        
    url = 'https://www.reddit.com/r/AskMen/.json'
    res = requests.get(url, params = params, headers = headers)
    
    if res.status_code == 200:
        raddit_json  = res.json ()
        men_posts.extend(raddit_json['data']['children'])
        after = raddit_json['data']['after']
    else:
        print(res.status_code)
        break 
    time.sleep(1)

0
10
20
30


In [16]:
#check for unique posts 
print('Post length', len(men_posts))
print('Post with Unique ID', len(set([post['data']['name'] for post in men_posts])))

Post length 1001
Post with Unique ID 525


### Subreddits Data Collection: Relationship_Advice

In [17]:
#reference code from Riely's video 
headers = {'User-agent': "Evelyn Li"}
relationship_posts = []
after = None
for i in range(50):
    if i % 10 == 0:
        print(i)
    if after == None:
        params = {}
    else:
        params = {'after' : after}
        
    url = 'https://www.reddit.com/r/relationship_advice/.json'
    res = requests.get(url, params = params, headers = headers)
    
    if res.status_code == 200:
        raddit_json  = res.json ()
        relationship_posts.extend(raddit_json['data']['children'])
        after = raddit_json['data']['after']
    else:
        print(res.status_code)
        break 
    time.sleep(1)

0
10
20
30
40


In [18]:
#check for unique posts 
print('Post length', len(relationship_posts))
print('Post with Unique ID', len(set([post['data']['name'] for post in relationship_posts])))

Post length 1236
Post with Unique ID 984


### Notes on Subreddits: 
1. AskWomen has 727 posts 
2. AskMen has 548 posts 

Since there are only this many posts in these subreddit, this is the ones that I will try to build my model on. 

## Function to_dataframe

### Building Function to Extract Necessary Information From API to DataFrame

In [19]:
#function to build dataframe 

def to_dataframe(posts): 
    post_list = []
    
    for pst in posts: 
        post_dic = {}
        post_dic = {
            'ID':pst['data']['name'],
            'Title' : pst['data']['title'],
            'Length_of_time': pst['data']['created_utc'], 
            'Number_of_comment': pst['data']['num_comments'], 
            'Content' : pst['data']['selftext'], 
            'Subreddit':pst['data']['subreddit']
        }
        
        post_list.append(post_dic)
    
    return pd.DataFrame(post_list)

#### Save AskWomen to DataFrame

In [15]:
#convert women_post to dataframe 
women_df = to_dataframe(women_posts)
women_df.head()

Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title
0,**Welcome to AskWomen!**\n\nIn honor of the ch...,t3_b3r260,1553178000.0,0,AskWomen,Welcome to a new season! Spring/Fall AskWomen ...
1,,t3_b6vwfn,1553858000.0,45,AskWomen,What was a time you had to let a dream (i.e. c...
2,,t3_b6ndvr,1553802000.0,636,AskWomen,What's the lamest thing you ever did to get a ...
3,,t3_b6xgfr,1553867000.0,53,AskWomen,Global check in: how is everyone with anxiety/...
4,&amp;#x200B;\n\nHave you ever had a 'falling o...,t3_b6wh0j,1553861000.0,86,AskWomen,What's the dumbest reason someone decided not ...


In [39]:
#select rows without duplicates 
women_df = women_df.iloc[women_df[["ID"]].drop_duplicates().index]

In [40]:
#dataframe without duplicates 
women_df.shape

(723, 6)

In [41]:
#Save as csv
women_df.to_csv('./data/AskWomen.csv')

#### Save AskMen to DataFrame

In [42]:
#convert men_post to dataframe 
men_df = to_dataframe(men_posts)
men_df.head()

Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title
0,Hello and welcome to the final discussion thre...,t3_b68n3g,1553715000.0,4,AskMen,"The AskMen Book Club: ""The Picture of Dorian G..."
1,"My dads health has been declining for while, n...",t3_b6q6yr,1553817000.0,354,AskMen,I am starting to realise my dad wont live fore...
2,Constructive criticism is good so I will start...,t3_b6vx6s,1553858000.0,301,AskMen,What do you see on women's dating profiles tha...
3,Because constructive criticism is helpful \n\n...,t3_b6xrue,1553869000.0,20,AskMen,What could women put in their dating profiles ...
4,,t3_b6wej9,1553861000.0,38,AskMen,What are some things on your mind that you can...


In [43]:
men_df.shape

(1001, 6)

In [44]:
#select rows without duplicates 
men_df = men_df.iloc[men_df[["ID"]].drop_duplicates().index]

In [45]:
#dataframe without duplicates 
print(men_df.shape)
men_df.head()

(525, 6)


Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title
0,Hello and welcome to the final discussion thre...,t3_b68n3g,1553715000.0,4,AskMen,"The AskMen Book Club: ""The Picture of Dorian G..."
1,"My dads health has been declining for while, n...",t3_b6q6yr,1553817000.0,354,AskMen,I am starting to realise my dad wont live fore...
2,Constructive criticism is good so I will start...,t3_b6vx6s,1553858000.0,301,AskMen,What do you see on women's dating profiles tha...
3,Because constructive criticism is helpful \n\n...,t3_b6xrue,1553869000.0,20,AskMen,What could women put in their dating profiles ...
4,,t3_b6wej9,1553861000.0,38,AskMen,What are some things on your mind that you can...


In [46]:
#Save as csv
men_df.to_csv('./data/AskMen.csv')

#### Save Relationship_Adviceto DataFrame

In [48]:
relationship_df = to_dataframe(relationship_posts)
relationship_df.head()

Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title
0,###Applications are open.\n\nApplications may ...,t3_b11tx5,1552578000.0,23,relationship_advice,[Meta] Mod Applications
1,Since two or three times a week we end up remo...,t3_b2nc2f,1552939000.0,52,relationship_advice,[meta] Think of the comments as an inverted Ub...
2,So... my girlfriend tends to be sometimes jeal...,t3_b6wkec,1553862000.0,284,relationship_advice,My girlfriend (20F) secretly took my (22M) fac...
3,I’ll get right to it\n\nBF and I have been tog...,t3_b6nhjf,1553803000.0,10685,relationship_advice,My (24F) boyfriend (25M) had a bizarre reactio...
4,I don't want anyone having the slightest clue ...,t3_b6wak4,1553860000.0,163,relationship_advice,I (19f) dumped and blocked my boyfriend (19m) ...


In [49]:
relationship_df.shape

(1236, 6)

In [50]:
#select rows without duplicates 
relationship_df = relationship_df.iloc[relationship_df[["ID"]].drop_duplicates().index]

In [52]:
#dataframe without duplicates 
relationship_df.shape

(984, 6)

In [53]:
relationship_df.to_csv('./data/relationship_advice.csv')