# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from three subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

---

### Datasets: 

1. AskMen:0, AskWomen:1, WordCounts 
3. RelationshipAdvice:0, AskWomen:1, WordCounts


---

### Notebook Walkthrough 

1. Models Walkthrough
2. Models Analysis 
3. Key Takeaways  


---

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

---

### Contents:
- [Models Walkthrough](#Models-Walkthrough)
- [Models Analysis](#Models-Analysis)
    - [Hypothesis 1, large train test gap => large overlap of common words](#Hypothesis-1,-large-train-test-gap-=>-large-overlap-of-common-words)
    - [Hypothesis 2, small train test gap => small overlap of common words](#Hypothesis-2,-small-train-test-gap-=>-small-overlap-of-common-words)
- [Key Takeaways](#Key-Takeaways)

---

## Model Walkthrough

---

### 1. The Data

The goal of this project is to use natural language processing techniques to train a classifier to identify which subreddits. The data of this project comes from Reddit, and three subreddits are chosen to build the model, they are AskWomen, AskMen, and Relationship Advice.  Two data frames are constructed using these three subreddits. The first data frame contains topics and contents from both subreddits AskMen and AskWomen. In the first data frame, the target variable is AskMen subreddit, and the model is trying to classify AskMen subreddit from AskWomen using NPL techniques.  For comparison purposes in understanding how the model works and what keywords the model used for classification purpose, I flip the target and change AskWomen subreddit posts as the target for classification. The second data frame contains topics and contents from subreddits Relationship Advice and AskWomen where Relationship Advice is the target. There is a slight tweak of the natural language processing model due to the difference in content, but the concept is similar for both models.  


---

## 2. Models Analysis

---

### 2.1 The Models 

The Models : 
So far, there are many different models we can use to achieve the goal of classification. Without using all of them, there is no way we can understand which model works better. Therefore, I used seven models that I consider as appropriate for the goal of classification. Here is a list of all seven of them: 

1. CountVectorizer Model With Logistic Regression
2. CountVectorizer Model With Multinomial NB
3. TFIDF Model With Logistic Regression
4. TFIDF Model With Multinomial NB
5. Random Forest Model Feature Extraction
6. Extra Tree Model Feature Extraction
7. AdaBoost classifier Model Feature Extraction

After fitting all the models, I noticed that some models perform better than others. For example, the multinomial naive Bayes model works well in general, and the AdaBoost classifier model is overfitting in both datasets. With the above observation, I decided to combine all seven models and build one ensemble model. The following scores are the train and test score for each ensemble model I constructed for two datasets. 



For AskMen and AskWomen Dataset with Target = AskWomen

```
train score 0.92
test score 0.73

```

For AskMen and AskWomen Dataset with Target = AskMen

```
train score 0.87
test score 0.74

```
For AskWomen and Relationship Advice Dataset with Target = AskWomen 

```
train score 0.999
test score 0.976
```
For AskWomen and Relationship Advice Dataset with Target = Relationship Advice
```
train score 0.993
test score 0.982

```


### 2.2 Analysis 

#### So what is happening? 

With the same baseline content from AskWomen subreddit, the classifier is trying to distinguish blog post from AskMen and Relationship Advice. However, as one can see, the training score and testing score are entirely different from one dataset to the other. I have tried many models and grid search over a different number of parameters (keywords) to improve the score of train and test. However, tunning the model doesn't seem to do much to the score. Therefore, I am going to look at the number of most common words in each subreddit and find the percentage of overlap words. 

### **Hypothesis** 
- Large train test gap		=> More overlap in top Features 
- Small train test gap		=> Less overlap in top features

**My hypothesis 1 :**

For the model that has a large gap between training and testing score,  the percentage of overlap for top features will be higher. 

The process to confirm the hypothesis: 
1. Look into the top 100 most common words in AskMen and AskWomen. 
2. Look into the top 100 features after plotting the model for AskMen and AskWomen.

**My hypothesis 2 :**

For the model that has a small gap between training and testing score,  the percentage of overlap for top features will be lower. 

The process to confirm the hypothesis: 
1. Look into the top 100 most common words in AskWomen and Relationship Advice. 
2. Look into the top 100 features after plotting the model for AskWomen and Relationship Advice.


----

## Hypothesis 1, large train test gap => large overlap of common words

---

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#import datasets 
women1_count = pd.read_csv('../data/AskMen0AskWomen1wordcount.csv')
women1_coeff = pd.read_csv('../data/AskMen0AskWomen1wordcoeff.csv')
men1_count = pd.read_csv('../data/AskMen1AskWomen0wordcount.csv')
men1_coeff = pd.read_csv('../data/AskMen1AskWomen0wordcoeff.csv')

In [3]:
#sort by number of occurance target = AskWomen
women1_count.sort_values(by = 'AskWomen', ascending = False).head(5)

Unnamed: 0.1,Unnamed: 0,AskMen,AskWomen
1422,you,40.694791,28.624296
1314,what,31.534464,20.296435
1178,to,29.553963,18.756743
264,do,23.625515,16.424676
1480,your,26.823695,16.341694


In [4]:
#top 100 word list for AskWomen
women1_top100word = list(women1_count.sort_values(by = 'AskWomen', ascending = False).head(300)['Unnamed: 0'])
women1_top100word[:5]

['you', 'what', 'to', 'do', 'your']

In [5]:
men1_count.sort_values(by = 'AskMen', ascending = False).head(5)

Unnamed: 0.1,Unnamed: 0,AskMen,AskWomen
1422,you,40.694791,28.624296
1314,what,31.534464,20.296435
1178,to,29.553963,18.756743
1098,the,29.16444,15.120878
1480,your,26.823695,16.341694


In [6]:
#top 100 word list for AskMen
men1_top100word = list(men1_count.sort_values(by = 'AskMen', ascending = False).head(300)['Unnamed: 0'])
men1_top100word[:5]

['you', 'what', 'to', 'the', 'your']

In [7]:
print(len(set(women1_top100word) & set(men1_top100word)))

218


In [8]:
len(women1_count.sort_values(by = 'AskWomen', ascending = False)['Unnamed: 0'].head(10000))

1500

In [9]:
#Top Word Counts for MenWomen DataSets Without fitting Models 
print("AskMen AskWomen Most Common Words Overlap Percentage")
for i in [100, 300, 500]: 
    women1_topword = list(women1_count.sort_values(by = 'AskWomen', ascending = False).head(i)['Unnamed: 0'])
    men1_topword = list(men1_count.sort_values(by = 'AskMen', ascending = False).head(i)['Unnamed: 0'])
    print(f'within top {i} word, the overlap is {len(set(women1_topword).intersection(set(men1_topword)))/i}')

AskMen AskWomen Most Common Words Overlap Percentage
within top 100 word, the overlap is 0.81
within top 300 word, the overlap is 0.7266666666666667
within top 500 word, the overlap is 0.676


In [10]:
#Top feature by sorting logistic regression coefficient 
men1_coeff.sort_values(by = "logit_coef_switched", ascending = False).head(10)

Unnamed: 0.1,Unnamed: 0,nb_coef_switched,logit_coef_switched
734,men,-5.973712,2.316345
452,guys,-6.145557,2.09047
764,my,-5.529801,1.924988
1180,to,-4.810618,1.653101
41,and,-4.994186,1.596805
982,she,-6.118351,1.190787
498,her,-6.13862,1.107499
859,out,-6.014033,1.055357
735,men of,-6.776649,1.03237
1084,that,-5.40198,0.946948


In [11]:
#Top feature by sorting logistic regression coefficient 
women1_coeff.sort_values(by = "nb_coef", ascending = True).head(10)

Unnamed: 0.1,Unnamed: 0,nb_coef,logit_coef
449,guys who,-8.179097,-0.790965
382,for it,-8.179097,-0.088054
98,ask her,-8.179097,-0.059461
196,class,-8.179097,-0.131403
486,he is,-8.179097,-0.1021
1348,whenever,-8.179097,-0.266105
424,gf,-8.179097,-0.51924
488,head,-8.179097,-0.496253
736,men of,-8.179097,-0.885945
353,father,-8.179097,-0.13184


In [12]:
#Top Word Features for MenWomen DataSets After fitting Logistic Regression Models
model1_logit_overlap = []
print('AskMen AskWomen Logistic Regression Top Features Overlap')
for i in [100, 200, 300, 400, 500]: 
    women1_topfeature = list(women1_coeff.sort_values(by = "logit_coef", 
                                                      ascending = False).head(i)['Unnamed: 0'])
    men1_topfeature = list(men1_coeff.sort_values(by = "logit_coef_switched", 
                                                  ascending = False).head(i)['Unnamed: 0'])
    percentage = len(set(women1_topfeature).intersection(set(men1_topfeature)))/i
    model1_logit_overlap.append(len(set(women1_topfeature).intersection(set(men1_topfeature))))
    print(f'within top {i} features, the overlap of top features is {percentage}')

AskMen AskWomen Logistic Regression Top Features Overlap
within top 100 features, the overlap of top features is 0.0
within top 200 features, the overlap of top features is 0.0
within top 300 features, the overlap of top features is 0.02
within top 400 features, the overlap of top features is 0.0375
within top 500 features, the overlap of top features is 0.062


In [42]:
#Top Word Features for MenWomen DataSets After fitting NB Models
model1_nb_overlap = []
print('AskMen AskWomen Naive Bayes Top Features Overlap')
for i in [100, 200, 300, 400, 500]: 
    women1_topfeature = list(women1_coeff.sort_values(by = "nb_coef", ascending = False).head(i)['Unnamed: 0'])
    men1_topfeature = list(men1_coeff.sort_values(by = "nb_coef_switched", ascending = False).head(i)['Unnamed: 0'])
    percentage = len(set(women1_topfeature).intersection(set(men1_topfeature)))/i
    model1_nb_overlap.append(percentage)
    print(f'within top {i} features, the overlap of top features is {percentage}')

AskMen AskWomen Naive Bayes Top Features Overlap
within top 100 features, the overlap of top features is 0.71
within top 200 features, the overlap of top features is 0.685
within top 300 features, the overlap of top features is 0.6466666666666666
within top 400 features, the overlap of top features is 0.64
within top 500 features, the overlap of top features is 0.624


----
### Conclusion for Hypothesis 1: 

---

Although within the top, and up to 1000 most common words in the AskMen and AskWomen subreddits, the number of overlap common words is significant: 

| Top Common Words |  Common Words Overlap  |
|:----------------:|:----------------------:|
|        100       |          0.81          |
|        300       |          0.727         |
|        500       |          0.676         |

However, after we fit the model and pull out the most significant features up to 500 features from both Logistic Regression model and Naive Bayes Model, we see less overlaps in top features untill we reach about top 400 features for logistic regression. 

| # Of Top Features  | Log Reg Overlap | NB Overlap |
|:-------------:|:---------------:|:----------:|
|      100      |       0.0       |    0.71    |
|      200      |       0.0       |    0.685    |
|      300      |       0.02      |    0.646   |
|      400      |      0.0375     |    0.64    |
|      500      |      0.062      |    0.624   |


---

## Hypothesis 2, small train test gap => small overlap of common words

---

In [14]:
import pandas as pd
import matplotlib.pyplot as plt

In [15]:
#import datasets 
women2_count = pd.read_csv('../data/Relatioinship0AskWomen1wordcount.csv')
women2_coeff = pd.read_csv('../data/Relatioinship0AskWomen1wordcoeff.csv')
relationship1_count = pd.read_csv('../data/AskWomen0RelationshipAdvice1wordcount.csv')
relationship1_coeff = pd.read_csv('../data/AskWomen0RelationshipAdvice1wordcoeff.csv')

In [16]:
women2_count.shape, women2_coeff.shape, relationship1_count.shape, relationship1_coeff.shape

((1500, 3), (1500, 3), (1500, 3), (1500, 3))

In [17]:
#sort by number of occurance target = AskWomen
women2_count.sort_values(by = 'AskWomen', ascending = False).head(5)

Unnamed: 0.1,Unnamed: 0,AskWomen,RelationshipAdvice
49,and,53.32989,87.391961
1264,to,52.960465,85.770449
1181,the,39.056518,57.914291
1490,you,36.010009,56.96082
513,he,33.075412,53.77786


In [18]:
relationship1_count.sort_values(by = 'RelationshipAdvice', ascending = False).head(5)

Unnamed: 0.1,Unnamed: 0,RelationshipAdvice,AskWomen
48,and,85.465545,55.908178
1259,to,82.468045,56.510292
1178,the,58.674563,37.868873
1491,you,53.887508,37.520461
517,he,53.275884,33.657295


In [26]:
#Top Word Counts for Women and Relationship Advice DataSets Without fitting Models 
print("Women Relationship Advice Most Common Words Overlap Percentage")
for i in [100, 200, 500]: 
    women2_topword = list(women2_count.sort_values(by = 'AskWomen', ascending = False).head(i)['Unnamed: 0'])
    relationship_topword = list(relationship1_count.sort_values(by = 'RelationshipAdvice', ascending = False).head(i)['Unnamed: 0'])
    print(f'within top {i} word, the overlap is {len(set(women2_topword).intersection(set(relationship_topword)))/i}')

Women Relationship Advice Most Common Words Overlap Percentage
within top 100 word, the overlap is 0.92
within top 200 word, the overlap is 0.905
within top 500 word, the overlap is 0.878


In [20]:
#Sort by significance of coefficientss for women(target) and relationship advice dataframe
women2_coeff.sort_values(by = 'logit_coef', ascending= False).head(20)

Unnamed: 0.1,Unnamed: 0,nb_coef,logit_coef
1490,you,-3.75128,3.54697
1416,what,-4.148849,2.811666
1499,your,-4.070718,2.248796
317,do you,-4.51823,1.729328
1465,women,-5.194354,1.404916
1420,what is,-5.112346,1.289274
1417,what are,-5.113371,1.055633
302,did you,-5.312946,0.990316
692,ladies,-5.574215,0.968368
594,how did,-5.663323,0.818149


In [27]:
#Sort by significance of coefficientss for women and relationship(target) dataframe 
relationship1_coeff.sort_values(by = 'nb_coef_switched', ascending= False).head(10)

Unnamed: 0.1,Unnamed: 0,nb_coef_switched,logit_coef_switched
48,and,-4.253006,3.018729
1259,to,-4.334256,2.313939
517,he,-4.61462,2.804925
1041,she,-4.690114,2.61181
1178,the,-4.786272,0.803423
550,her,-4.857493,2.157683
765,me,-4.888376,2.441805
825,my,-4.910471,2.511056
651,it,-5.009908,1.373701
1156,that,-5.026045,0.826481


In [22]:
relationship1_coeff.sort_values(by = 'logit_coef_switched', ascending= False).head(20)

Unnamed: 0.1,Unnamed: 0,nb_coef_switched,logit_coef_switched
48,and,-4.253006,3.018729
517,he,-4.61462,2.804925
1041,she,-4.690114,2.61181
825,my,-4.910471,2.511056
765,me,-4.888376,2.441805
1259,to,-4.334256,2.313939
550,her,-4.857493,2.157683
1384,we,-5.10424,1.914438
568,him,-5.267039,1.575202
1231,this,-5.392376,1.566193


In [23]:
#Top Word Features for MenWomen DataSets After fitting Logistic Regression Models
model2_logit_overlap = []
print('Women Relationship Advice Logistic Regression Top Features Overlap')
for i in [100, 200, 300, 400, 500]: 
    women2_topfeature = list(women2_coeff.sort_values(by = 'logit_coef', 
                                                      ascending= False).head(i)['Unnamed: 0'])
    relationship_topfeature = list(relationship1_coeff.sort_values(by = 'logit_coef_switched',
                                                           ascending= False).head(i)['Unnamed: 0'])
    percentage = len(set(women2_topfeature).intersection(set(relationship_topfeature)))/i
    model2_logit_overlap.append(len(set(women2_topfeature).intersection(set(relationship_topfeature))))
    print(f'within top {i} features, the overlap of top features is {percentage}')

Women Relationship Advice Logistic Regression Top Features Overlap
within top 100 features, the overlap of top features is 0.0
within top 200 features, the overlap of top features is 0.0
within top 300 features, the overlap of top features is 0.013333333333333334
within top 400 features, the overlap of top features is 0.035
within top 500 features, the overlap of top features is 0.046


In [28]:
model2_nb_overlap = []
print('Women Relationship Advice NB Top Features Overlap')
for i in [100, 200, 300, 400, 500]: 
    women2_topfeature = list(women2_coeff.sort_values(by = 'nb_coef', 
                                                      ascending= False ).head(i)['Unnamed: 0'])
    relationship_topfeature = list(relationship1_coeff.sort_values(by = 'nb_coef_switched',
                                                           ascending= False).head(i)['Unnamed: 0'])
    percentage = len(set(women2_topfeature).intersection(set(relationship_topfeature)))/i
    model2_nb_overlap.append(len(set(women2_topfeature).intersection(set(relationship_topfeature))))
    print(f'within top {i} features, the overlap of top features is {percentage}')

Women Relationship Advice NB Top Features Overlap
within top 100 features, the overlap of top features is 0.43
within top 200 features, the overlap of top features is 0.5
within top 300 features, the overlap of top features is 0.5033333333333333
within top 400 features, the overlap of top features is 0.53
within top 500 features, the overlap of top features is 0.566


In [41]:
set(women2_coeff.sort_values(by = 'logit_coef', ascending= False ).head(100)['Unnamed: 0']).intersection(set(relationship1_coeff.sort_values(by = 'logit_coef_switched',ascending= False).head(100)['Unnamed: 0']))

set()

---
### Conclusion for Hypothesis 2: 

---

Our hypothesis says that if the model is doing good, then the number overlap words for the most common words will be small. We have proved this hypothesis is true : 

| Top Common Words |  Common Words Overlap  |
|:----------------:|:----------------------:|
|        100       |          0.92          |
|        300       |          0.905         |
|        1000      |          0.878          |

After we fit the model and pull out the most significant features up to 500 features from both Logistic Regression model and Naive Bayes Model, we see less overlaps in top features comparing to the model that had had large gap between train test score: 

| # Of Top Features  | Log Reg Overlap | NB Overlap |
|:-------------:|:---------------:|:----------:|
|      100      |       0.00       |    0.43    |
|      200      |       0.00       |    0.5    |
|      300      |       0.013      |    0.503   |
|      400      |       0.035       |    0.53    |
|      500      |       0.046      |    0.566  |

---

## Key Takeaways

---

In this notebook, I focus on understanding the question why with the same model apply to two different datasets, model perform well on one but not on the other. To understand the model, we made a hypothesis regarding the content of the subreddits. if two subreddits have a lot of in common, the model will perform poorly in distingushing one from the other. On the other hand, if two subreddit have different content and keywords, the model is better at picking up the different keywords regrading the different topics, and be better at distingushing the difference. 

Although this concept seems to be a common sense to us human beings, to fully understand how to model works, I decided to jump into this rabbit hole and figure out if the model actually works the same way as we assume it to work. 

After a long process of teasing out the keywords and coeffients, I have come to the conclusion that the model works better when there are less overlap keywords between subreddits. I also notice couple other things that are quite interesting: 
1. The most common words in a subreddit are not necessary the ones the model pick up as the key identifiers. This phenomenon can be due to the fact that we use TFIDF to tease out the most impactful words instead of the ones that are most common. 
2. Although the train test score gap of the model is large when it comes to similar content, the model is still quite good at picking up the difference between two subreddits. We can see this from looking at the overlap of the top features. Up to the top 300 features, the overlaping features between the similar subreddits is only 1% for Logistic regression model.
3. When a model become more complex, the intepretation of the model start to become more complex and less easy to understand. 