# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Datasets: 
1. AskMen vs AskWomen (0, 1)
2. AskMen vs Relationship Advice (0, 1)
3. AskWomen vs Relationship Advice (0, 1)

### Model Improvement
Use all baseline models to build ensemble model. 

#### 1. Ensemble Model 
- with CountVectorizer 
- with TFIDF Model
- with Logistic Regression 
- with Multinomial NB

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Ensamble Model 1 menwomen_df With Best Parameters](#Ensamble-Model-1-menwomen_df-With-Best-Parameters)
    - [Ensamble Model 1: ](#Ensamble-Model-1)
    - [Extracting Coefficients](#Extracting-Coefficients)
    - [Model With Target Switched](#Model-With-Target-Switched)

In [1]:
import pandas as pd 
import numpy as np 
import re
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
menwomen_df = pd.read_csv('../data/AskMenAskWomen.csv')
menrelationship_df = pd.read_csv('../data/AskMen_Relationship.csv')
womenrelationship_df = pd.read_csv('../data/AskWomen_Relationship.csv')

## Ensamble Model With Best Parameters 

## Ensamble Model 1: With menwomen_df

In [3]:
# models with Best Parameters from previous notebook
cvec = CountVectorizer(max_df= 0.95, 
                       max_features= 2000, 
                       min_df= 5, 
                       ngram_range= (1,1))
tfidf1 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1000, 
                       min_df= 5, 
                       ngram_range= (1,1))
tfidf2 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1500, 
                       min_df= 2, 
                       ngram_range= (1,2))
mnb = MultinomialNB()
logit = LogisticRegression()

In [10]:
#building votingclassifier model 
#will be apply to the rest of the model
vc = VotingClassifier(estimators= [('mnb', mnb),
                                   ('logit', logit)], 
                     voting = 'hard')

In [14]:
#building Pipeline
pipe1 = Pipeline([
#     ('cvec', CountVectorizer(max_df= 0.95, 
#                        max_features= 2000, 
#                        min_df= 5, 
#                        ngram_range= (1,1))), 
    ('tfidf1', TfidfVectorizer(max_df= 0.95,  #keep this one since it give us highest train test score 
                       max_features= 1000, 
                       min_df= 5, 
                       ngram_range= (1,1))),
#     ('tfidf1', TfidfVectorizer(max_df= 0.95, 
#                        max_features= 1500, 
#                        min_df= 2, 
#                        ngram_range= (1,2))),
    ('vc', VotingClassifier(estimators= [('mnb', mnb),
                                   ('logit', logit)], 
                     voting = 'hard'))
])

In [15]:
#train-test-split 
X = menwomen_df["Title_Content"]
y = menwomen_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [16]:
pipe1.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('tfidf1', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=1000, min_df=5,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tru...0, warm_start=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None))])

In [17]:
print('train score', pipe1.score(X_train, y_train))
print('test score', pipe1.score(X_test, y_test))

train score 0.8224513172966781
test score 0.7466666666666667


### Extracting Coefficients 

In [30]:
pipe1.named_steps['vc'].estimators_

[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False)]

In [29]:
mnb_coef = pipe1.named_steps['vc'].estimators_[0].coef_[0]

In [31]:
#saving logistic coefficient 
logit_coef = pipe1.named_steps['vc'].estimators_[1].coef_[0]

In [32]:
#get column names 
tfidf = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1000, 
                       min_df= 5, 
                       ngram_range= (1,1))
tfidf.fit_transform(X_train)

#Saving column name
col_name = tfidf.get_feature_names()
len(col_name)

727

In [33]:
menwomen_word = pd.DataFrame(data= [mnb_coef, logit_coef], columns= col_name, index= ['nb_coef', 'logit_coef'])
menwomen_word.head()
menwomen_word = menwomen_word.T
menwomen_word.head()

Unnamed: 0,nb_coef,logit_coef
able,-7.2232,-0.197191
about,-5.244258,0.033025
above,-7.676415,-0.06847
absolutely,-7.64757,-0.292504
accept,-7.099458,0.035477


In [34]:
menwomen_word.sort_values(by = 'nb_coef', ascending = True).head(20)

Unnamed: 0,nb_coef,logit_coef
bar,-7.819936,-0.155624
himself,-7.819936,-0.159051
kinda,-7.819936,-0.360041
anymore,-7.819936,-0.21841
three,-7.819936,-0.108125
small,-7.819936,-0.61933
came,-7.819936,-0.334269
group,-7.819936,-0.110682
half,-7.819936,-0.548768
late,-7.819936,-0.166031


In [35]:
menwomen_word.sort_values(by = 'logit_coef', ascending = True).head()

Unnamed: 0,nb_coef,logit_coef
men,-6.961656,-3.155234
guys,-7.150168,-2.76365
my,-6.242374,-2.192249
to,-4.458594,-1.558698
and,-4.713635,-1.471389
