# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Datasets: 
1. AskMen vs AskWomen (0, 1)
2. AskMen vs Relationship Advice (0, 1)
3. AskWomen vs Relationship Advice (0, 1)

### Model Improvement
Use all baseline models to build ensemble model. 

#### 1. Ensemble Model 
- with CountVectorizer 
- with TFIDF Model
- with Logistic Regression 
- with Multinomial NB

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Ensamble Model 1 menwomen_df With Best Parameters](#Ensamble-Model-1-menwomen_df-With-Best-Parameters)
    - [Ensamble Model 1: ](#Ensamble-Model-1)
    - [Extracting Coefficients](#Extracting-Coefficients)
    - [Model With Target Switched](#Model-With-Target-Switched)

In [2]:
import pandas as pd 
import numpy as np 
import re
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [3]:
womenrelationship_df = pd.read_csv('../data/AskWomen_Relationship.csv')

## Ensamble Model 3: With womenrelationship_df

In [5]:
# models with Best Parameters from previous notebook
cvec = CountVectorizer(max_df= 0.95, 
                       max_features= 2000, 
                       min_df= 5, 
                       ngram_range= (1,1))

tfidf1 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1000, 
                       min_df= 5, 
                       ngram_range= (1,1))

tfidf2 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1500, 
                       min_df= 2, 
                       ngram_range= (1,2))
mnb = MultinomialNB()
logit = LogisticRegression()

In [25]:
#building Pipeline
pipe3 = Pipeline([     
      ('cvec', CountVectorizer(max_df= 0.95, #keep this one since it give us highest train test score 
                       max_features= 2000, 
                       min_df= 5, 
                       ngram_range= (1,1))), 
    
#      ('tfidf1', TfidfVectorizer(max_df= 0.95, 
#                        max_features= 1000, 
#                        min_df= 5, 
#                        ngram_range= (1,1))),
     
#      ('tfidf2',  TfidfVectorizer(max_df= 0.95, 
#                        max_features= 1500, 
#                        min_df= 2, 
#                        ngram_range= (1,2))),
     
    ('vc', VotingClassifier(estimators= [('mnb', mnb),
                                   ('logit', logit)], 
                     voting = 'hard'))
])

In [26]:
X = womenrelationship_df["Title_Content"]
y = womenrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [27]:
pipe3.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=2000, min_df=5,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...0, warm_start=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None))])

In [28]:
print('train score', pipe3.score(X_train, y_train))
print('test score', pipe3.score(X_test, y_test))

train score 0.9840871021775545
test score 0.9824561403508771


### Extracting Coefficients 

In [38]:
pipe3.named_steps['vc'].estimators_

[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False)]

In [36]:
#extracting coefficients 
mnb3_coef = pipe3.named_steps['vc'].estimators_[0].coef_[0]

logit3_coef = pipe3.named_steps['vc'].estimators_[1].coef_[0]
len(mnb3_coef)

2000

In [39]:
#get column names for man relationship columns
cvec = CountVectorizer(max_df= 0.95, 
                       max_features= 2000, 
                       min_df= 5, 
                       ngram_range= (1,1))

cvec.fit_transform(X_train)

#Saving column name
col_name3 = cvec.get_feature_names()
len(col_name3)

2000

In [40]:
womenrelationship_word = pd.DataFrame(data= [mnb3_coef, logit3_coef], columns= col_name3, index= ['nb_coef', 'logit_coef'])
womenrelationship_word = womenrelationship_word.T
womenrelationship_word.head()

Unnamed: 0,nb_coef,logit_coef
able,-8.034415,-0.229925
abortion,-9.538492,0.000209
about,-4.942111,0.011605
above,-9.538492,-0.026743
absolute,-10.113857,7.7e-05
