# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Datasets: 
1. AskMen vs AskWomen (0, 1)
2. AskMen vs Relationship Advice (0, 1)
3. AskWomen vs Relationship Advice (0, 1)

### Model Improvement
Use all baseline models to build ensemble model. 

#### 1. Ensemble Model 
- with CountVectorizer 
- with TFIDF Model
- with Logistic Regression 
- with Multinomial NB

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Ensamble Model 2 menrelationship_df With Best Parameters](#Ensamble-Model-1-menwomen_df-With-Best-Parameters)
    - [Ensamble Model 2: ](#Ensamble-Model-1)
    - [Extracting Coefficients](#Extracting-Coefficients)
    - [Model With Target Switched](#Model-With-Target-Switched)

In [1]:
import pandas as pd 
import numpy as np 
import re
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
menrelationship_df = pd.read_csv('../data/AskMen_Relationship.csv')

## Ensamble Model 2: With menrelationship_df

In [155]:
# models with Best Parameters from previous notebook
cvec1 = CountVectorizer(max_df= 0.95, 
                       max_features= 2500, 
                       min_df= 2, 
                       ngram_range= (1,1))
cvec2 = CountVectorizer(max_df= 0.95, 
                       max_features= 2000, 
                       min_df= 2, 
                       ngram_range= (1,1))
tfidf1 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 2000, 
                       min_df= 5, 
                       ngram_range= (1,1))

tfidf2 = TfidfVectorizer(max_df= 0.95, 
                       max_features= 1500, 
                       min_df= 5, 
                       ngram_range= (1,2))
mnb = MultinomialNB()
logit = LogisticRegression()

In [65]:
#building Pipeline
pipe2 = Pipeline([
#     ('cvec1', CountVectorizer(max_df= 0.95, 
#                        max_features= 2500, 
#                        min_df= 2, 
#                        ngram_range= (1,1))),  
     
#      ('cvec2', CountVectorizer(max_df= 0.95, 
#                        max_features= 2000, 
#                        min_df= 2, 
#                        ngram_range= (1,1))),
     
#      ('tfidf1', TfidfVectorizer(max_df= 0.95, 
#                        max_features= 2000, 
#                        min_df= 5, 
#                        ngram_range= (1,1))),
     
     ('tfidf2',  TfidfVectorizer(max_df= 0.95, #keep this one since it give us highest train test score 
                       max_features= 1500, 
                       min_df= 5, 
                       ngram_range= (1,2))),
     
    ('vc', VotingClassifier(estimators= [('mnb', mnb),
                                   ('logit', logit)], 
                     voting = 'hard'))
])

In [66]:
X = menrelationship_df["Title_Content"]
y = menrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [67]:
pipe2.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('tfidf2', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=1500, min_df=5,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tru...0, warm_start=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None))])

In [68]:
print('train score', pipe2.score(X_train, y_train))
print('test score', pipe2.score(X_test, y_test))

train score 0.9488636363636364
test score 0.9359823399558499


### Extracting Coefficients 

In [69]:
pipe2.named_steps['vc'].estimators_[0].coef_[0]

array([-7.71977346, -7.72375633, -5.68410029, ..., -8.34486223,
       -8.10852961, -7.26970552])

In [70]:
#extracting coefficients 
mnb2_coef = pipe2.named_steps['vc'].estimators_[0].coef_[0]

logit2_coef = pipe2.named_steps['vc'].estimators_[1].coef_[0]

In [71]:
#get column names for man relationship columns
tfidf = TfidfVectorizer(max_df= 0.95,
                       max_features= 1500, 
                       min_df= 5, 
                       ngram_range= (1,2))
tfidf.fit_transform(X_train)

#Saving column name
col_name2 = tfidf.get_feature_names()
len(col_name2)

1500

In [72]:
menrelationship_word = pd.DataFrame(data= [mnb2_coef, logit2_coef], columns= col_name2, index= ['nb_coef', 'logit_coef'])
menrelationship_word = menrelationship_word.T
menrelationship_word.head()

Unnamed: 0,nb_coef,logit_coef
able,-7.719773,-0.126985
able to,-7.723756,-0.097881
about,-5.6841,0.531166
about her,-7.660617,0.125422
about him,-8.116777,0.081896
