# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.



### Data Cleaning Steps 

#### 1. Cleaning Length_of_time Variable 
-  Build the instance of this time, use it to subtract the time to get the time length of the post

#### 2. Cleaning Title 
- Remove unecessay indent and all that, keep text only. 

#### 3. Cleaning Content 
- Remove unecessay indent and all that, keep text only. 

#### 4. Combining DataFrame 
- AskMen vs AskWomen 
- AskWomen vs Relationship_Advice 
- AskMen vs Relationship_Advice 

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Cleaning Length_of_time Variable](#Cleaning-Length_of_time-Variable)
- [Cleaning Title](#Cleaning-Title)
- [Cleaning Content](#Cleaning-Content)
- [Combining DataFrame](#Combining-DataFrame)

In [1]:
#Import libraries 
import requests 
import pandas as pd 
import datetime
import time 
import re

In [2]:
#import data 
men_df = pd.read_csv("../data/AskMen.csv")
women_df = pd.read_csv("../data/AskWomen.csv")
relationship_df = pd.read_csv("../data/relationship_advice.csv")

## Cleaning Length_of_time Variable

In [3]:
#3/31 2019, 1:38 pm as the time stamp for now 
dt = datetime.datetime.now()
now  = time.mktime(dt.timetuple())
now 

1554436322.0

In [4]:
#build function to change time 

def time_stamp_to_day(time_column):
    post_length = []
    for time in time_column: 
        length = now - time
        length = length/86400  #86400 seconds in a day, we get number of days of the post has been made 
        day = round(length, 2)
        post_length.append(day)
    return post_length

In [5]:
#Engineering days for Men df 
men_df['Length_of_time_days'] = time_stamp_to_day(men_df['Length_of_time'])
men_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days
0,0,Hello and welcome to the final discussion thre...,t3_b68n3g,1553715000.0,4,AskMen,"The AskMen Book Club: ""The Picture of Dorian G...",8.34
1,1,"My dads health has been declining for while, n...",t3_b6q6yr,1553817000.0,354,AskMen,I am starting to realise my dad wont live fore...,7.17


In [6]:
#Engineering days for women df 
women_df['Length_of_time_days'] = time_stamp_to_day(women_df['Length_of_time'])
women_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days
0,0,**Welcome to AskWomen!**\n\nIn honor of the ch...,t3_b3r260,1553178000.0,0,AskWomen,Welcome to a new season! Spring/Fall AskWomen ...,14.57
1,1,,t3_b6vwfn,1553858000.0,45,AskWomen,What was a time you had to let a dream (i.e. c...,6.7


In [7]:
#Engineering days for relationship df 
relationship_df['Length_of_time_days'] = time_stamp_to_day(relationship_df['Length_of_time'])
relationship_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days
0,0,###Applications are open.\n\nApplications may ...,t3_b11tx5,1552578000.0,23,relationship_advice,[Meta] Mod Applications,21.51
1,1,Since two or three times a week we end up remo...,t3_b2nc2f,1552939000.0,52,relationship_advice,[meta] Think of the comments as an inverted Ub...,17.32


## Cleaning Title

In [8]:
#cleaning title fuction 
def clean_title(title_columns): 
    clean_title = []
    for m in title_columns: 
        col = re.sub("[^a-zA-Z]", " ", m)
        clean_title.append(col)
    return clean_title

In [9]:
#Cleaning title columns for Men df
men_df["Clean_Title"] = clean_title(men_df['Title'])
men_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title
0,0,Hello and welcome to the final discussion thre...,t3_b68n3g,1553715000.0,4,AskMen,"The AskMen Book Club: ""The Picture of Dorian G...",8.34,The AskMen Book Club The Picture of Dorian G...
1,1,"My dads health has been declining for while, n...",t3_b6q6yr,1553817000.0,354,AskMen,I am starting to realise my dad wont live fore...,7.17,I am starting to realise my dad wont live fore...


In [10]:
#Cleaning title columns for women df
women_df["Clean_Title"] = clean_title(women_df['Title'])
women_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title
0,0,**Welcome to AskWomen!**\n\nIn honor of the ch...,t3_b3r260,1553178000.0,0,AskWomen,Welcome to a new season! Spring/Fall AskWomen ...,14.57,Welcome to a new season Spring Fall AskWomen ...
1,1,,t3_b6vwfn,1553858000.0,45,AskWomen,What was a time you had to let a dream (i.e. c...,6.7,What was a time you had to let a dream i e c...


In [11]:
#Cleaning title columns for relationship df
relationship_df["Clean_Title"] = clean_title(relationship_df['Title'])
relationship_df.head(2)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title
0,0,###Applications are open.\n\nApplications may ...,t3_b11tx5,1552578000.0,23,relationship_advice,[Meta] Mod Applications,21.51,Meta Mod Applications
1,1,Since two or three times a week we end up remo...,t3_b2nc2f,1552939000.0,52,relationship_advice,[meta] Think of the comments as an inverted Ub...,17.32,meta Think of the comments as an inverted Ub...


## Cleaning Content

#### Notes : 
Some of the reddit posts does not have content because some of them are image/gif, and others put their content as part of the title. To solve this problem, I will combine the title and the content together into one cell. 

In [12]:
# Men df missing values 
men_df['Content'].isnull().sum()

229

In [13]:
# Women df missing values 
women_df['Content'].isnull().sum()

512

In [14]:
# relationship df missing values 
relationship_df['Content'].isnull().sum()

5

#### Combining Title Column with Content Column For men_df

In [15]:
#Combinding columns for men_df 
title_content = []

for i in list(men_df.index): 
    if str(men_df['Content'][i])  == "nan":
        title_content.append(str(men_df['Clean_Title'][i]))
        
    else: 
        title_content.append(str(men_df['Clean_Title'][i]) + str(men_df['Content'][i]))
        
#cleaning the column 
men_df['Title_Content'] = title_content
men_df["Title_Content"] = clean_title(men_df['Title_Content'])

In [16]:
title_content[4]

'What are some things on your mind that you can t talk about without being told  be a man  '

In [17]:
men_df['Title_Content'].isnull().sum()

0

In [18]:
men_df.head(5)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title,Title_Content
0,0,Hello and welcome to the final discussion thre...,t3_b68n3g,1553715000.0,4,AskMen,"The AskMen Book Club: ""The Picture of Dorian G...",8.34,The AskMen Book Club The Picture of Dorian G...,The AskMen Book Club The Picture of Dorian G...
1,1,"My dads health has been declining for while, n...",t3_b6q6yr,1553817000.0,354,AskMen,I am starting to realise my dad wont live fore...,7.17,I am starting to realise my dad wont live fore...,I am starting to realise my dad wont live fore...
2,2,Constructive criticism is good so I will start...,t3_b6vx6s,1553858000.0,301,AskMen,What do you see on women's dating profiles tha...,6.7,What do you see on women s dating profiles tha...,What do you see on women s dating profiles tha...
3,3,Because constructive criticism is helpful \n\n...,t3_b6xrue,1553869000.0,20,AskMen,What could women put in their dating profiles ...,6.57,What could women put in their dating profiles ...,What could women put in their dating profiles ...
4,4,,t3_b6wej9,1553861000.0,38,AskMen,What are some things on your mind that you can...,6.66,What are some things on your mind that you can...,What are some things on your mind that you can...


#### Combining Title Column with Content Column For women_df

In [19]:
#Combinding columns for wowomen_df 
title_content = []

for i in list(women_df.index): 
    if str(women_df['Content'][i])  == "nan":
        title_content.append(str(women_df['Clean_Title'][i]))
        
    else: 
        title_content.append(str(women_df['Clean_Title'][i]) + str(women_df['Content'][i]))
        
#cleaning the column 
women_df['Title_Content'] = title_content
women_df["Title_Content"] = clean_title(women_df['Title_Content'])

In [20]:
title_content[1]

'What was a time you had to let a dream  i e  career  hopes of being with someone  etc  go and how did you bounce back '

In [21]:
women_df['Title_Content'].isnull().sum()

0

In [22]:
women_df.head(5)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title,Title_Content
0,0,**Welcome to AskWomen!**\n\nIn honor of the ch...,t3_b3r260,1553178000.0,0,AskWomen,Welcome to a new season! Spring/Fall AskWomen ...,14.57,Welcome to a new season Spring Fall AskWomen ...,Welcome to a new season Spring Fall AskWomen ...
1,1,,t3_b6vwfn,1553858000.0,45,AskWomen,What was a time you had to let a dream (i.e. c...,6.7,What was a time you had to let a dream i e c...,What was a time you had to let a dream i e c...
2,2,,t3_b6ndvr,1553802000.0,636,AskWomen,What's the lamest thing you ever did to get a ...,7.34,What s the lamest thing you ever did to get a ...,What s the lamest thing you ever did to get a ...
3,3,,t3_b6xgfr,1553867000.0,53,AskWomen,Global check in: how is everyone with anxiety/...,6.59,Global check in how is everyone with anxiety ...,Global check in how is everyone with anxiety ...
4,4,&amp;#x200B;\n\nHave you ever had a 'falling o...,t3_b6wh0j,1553861000.0,86,AskWomen,What's the dumbest reason someone decided not ...,6.65,What s the dumbest reason someone decided not ...,What s the dumbest reason someone decided not ...


### Combining Title Column with Content Column For relationship_df

In [23]:
#Combinding columns for worelationship_df 
title_content = []

for i in list(relationship_df.index): 
    if str(relationship_df['Content'][i])  == "nan":
        title_content.append(str(relationship_df['Clean_Title'][i]))
        
    else: 
        title_content.append(str(relationship_df['Clean_Title'][i]) + str(relationship_df['Content'][i]))
        
#cleaning the column 
relationship_df['Title_Content'] = title_content
relationship_df["Title_Content"] = clean_title(relationship_df['Title_Content'])

In [24]:
relationship_df['Title_Content'].isnull().sum()

0

In [25]:
relationship_df.head(5)

Unnamed: 0.1,Unnamed: 0,Content,ID,Length_of_time,Number_of_comment,Subreddit,Title,Length_of_time_days,Clean_Title,Title_Content
0,0,###Applications are open.\n\nApplications may ...,t3_b11tx5,1552578000.0,23,relationship_advice,[Meta] Mod Applications,21.51,Meta Mod Applications,Meta Mod Applications Applications are ope...
1,1,Since two or three times a week we end up remo...,t3_b2nc2f,1552939000.0,52,relationship_advice,[meta] Think of the comments as an inverted Ub...,17.32,meta Think of the comments as an inverted Ub...,meta Think of the comments as an inverted Ub...
2,2,So... my girlfriend tends to be sometimes jeal...,t3_b6wkec,1553862000.0,284,relationship_advice,My girlfriend (20F) secretly took my (22M) fac...,6.65,My girlfriend F secretly took my M fac...,My girlfriend F secretly took my M fac...
3,3,I’ll get right to it\n\nBF and I have been tog...,t3_b6nhjf,1553803000.0,10685,relationship_advice,My (24F) boyfriend (25M) had a bizarre reactio...,7.33,My F boyfriend M had a bizarre reactio...,My F boyfriend M had a bizarre reactio...
4,4,I don't want anyone having the slightest clue ...,t3_b6wak4,1553860000.0,163,relationship_advice,I (19f) dumped and blocked my boyfriend (19m) ...,6.67,I f dumped and blocked my boyfriend m ...,I f dumped and blocked my boyfriend m ...


## Combining DataFrames 

In [26]:
askmen_df = men_df[['Subreddit', 'Title_Content']]
askwomen_df = women_df[['Subreddit', 'Title_Content']]
relationshipadvice_df = relationship_df[['Subreddit', 'Title_Content']]

#### Dataframe 1: AskMen vs AskWomen

In [27]:
men_women_df = pd.concat([askmen_df, askwomen_df])
men_women_df.head()

Unnamed: 0,Subreddit,Title_Content
0,AskMen,The AskMen Book Club The Picture of Dorian G...
1,AskMen,I am starting to realise my dad wont live fore...
2,AskMen,What do you see on women s dating profiles tha...
3,AskMen,What could women put in their dating profiles ...
4,AskMen,What are some things on your mind that you can...


In [28]:
#Changing the Subreddit Key 
dic = {
    "AskMen": 0, 
    "AskWomen": 1
}

In [29]:
men_women_df['Subreddit'] = men_women_df['Subreddit'].map(dic)

In [30]:
men_women_df.head()

Unnamed: 0,Subreddit,Title_Content
0,0,The AskMen Book Club The Picture of Dorian G...
1,0,I am starting to realise my dad wont live fore...
2,0,What do you see on women s dating profiles tha...
3,0,What could women put in their dating profiles ...
4,0,What are some things on your mind that you can...


In [31]:
#save as csv 
men_women_df.to_csv('../data/AskMenAskWomen.csv')

#### Dataframe 2: AskMen vs Relationship Advice 

In [32]:
men_relation_df = pd.concat([askmen_df, relationshipadvice_df])
men_relation_df.head()

Unnamed: 0,Subreddit,Title_Content
0,AskMen,The AskMen Book Club The Picture of Dorian G...
1,AskMen,I am starting to realise my dad wont live fore...
2,AskMen,What do you see on women s dating profiles tha...
3,AskMen,What could women put in their dating profiles ...
4,AskMen,What are some things on your mind that you can...


In [33]:
#Changing the Subreddit Key 
dic = {
    "AskMen": 0, 
    "relationship_advice": 1
}

In [34]:
men_relation_df['Subreddit'] = men_relation_df['Subreddit'].map(dic)

In [35]:
men_relation_df.head()

Unnamed: 0,Subreddit,Title_Content
0,0,The AskMen Book Club The Picture of Dorian G...
1,0,I am starting to realise my dad wont live fore...
2,0,What do you see on women s dating profiles tha...
3,0,What could women put in their dating profiles ...
4,0,What are some things on your mind that you can...


In [36]:
men_relation_df.to_csv('../data/AskMen_Relationship.csv')

#### Dataframe 3: AskWomen vs Relationship Advice 

In [37]:
women_relation_df = pd.concat([askwomen_df, relationshipadvice_df])
women_relation_df.head()

Unnamed: 0,Subreddit,Title_Content
0,AskWomen,Welcome to a new season Spring Fall AskWomen ...
1,AskWomen,What was a time you had to let a dream i e c...
2,AskWomen,What s the lamest thing you ever did to get a ...
3,AskWomen,Global check in how is everyone with anxiety ...
4,AskWomen,What s the dumbest reason someone decided not ...


In [39]:
#Changing the Subreddit Key 
dic = {
    "AskWomen": 1, 
    "relationship_advice": 0
}

In [40]:
women_relation_df['Subreddit'] = women_relation_df['Subreddit'].map(dic)

In [41]:
women_relation_df.head()

Unnamed: 0,Subreddit,Title_Content
0,1,Welcome to a new season Spring Fall AskWomen ...
1,1,What was a time you had to let a dream i e c...
2,1,What s the lamest thing you ever did to get a ...
3,1,Global check in how is everyone with anxiety ...
4,1,What s the dumbest reason someone decided not ...


In [42]:
women_relation_df.to_csv('../data/AskWomen_Relationship.csv')