# Web APIs & Classification

## Project Challenge Statement

### Goal: 
#### 1. Using Reddit's API, collect posts from two subreddits: AskWomen, AskMen, Relationship_Advice. 
#### 2. NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Datasets: 
1. AskMen vs AskWomen (0, 1)
2. AskMen vs Relationship Advice (0, 1)
3. AskWomen vs Relationship Advice (0, 1)

### Baseline Model Building 
Baseline models are built without using tokenizers, 
#### 1. CountVectorizer Model 
- with Logistic Regression 
- with Multinomial NB

#### 2. TF-IDF Model
- with Logistic Regression 
- with Multinomial NB

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [CountVectorizer Model With Logistic Regression](#CountVectorizer-Model-With-Logistic-Regression)
    1. AskMen vs AskWomen (0, 1)
    2. AskMen vs Relationship Advice (0, 1)
    3. AskWomen vs Relationship Advice (0, 1)
    
    
- [CountVectorizer Model With Multinomal NB](#CountVectorizer-Model-With-Multinomal-NB)
    1. AskMen vs AskWomen (0, 1)
    2. AskMen vs Relationship Advice (0, 1)
    3. AskWomen vs Relationship Advice (0, 1)
    
    
- [TFIDF Model With Logistic Regression](#TFIDF-Model-With-Multinomal-NB)
    1. AskMen vs AskWomen (0, 1)
    2. AskMen vs Relationship Advice (0, 1)
    3. AskWomen vs Relationship Advice (0, 1)
    
    
- [TFIDF Model With Multinomal NB](#TFIDF-Model-With-Multinomal-NB)
    1. AskMen vs AskWomen (0, 1)
    2. AskMen vs Relationship Advice (0, 1)
    3. AskWomen vs Relationship Advice (0, 1)



##### Libraries 

In [94]:
import pandas as pd 
import numpy as np 
import re
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [95]:
menwomen_df = pd.read_csv('../data/AskMenAskWomen.csv')
menrelationship_df = pd.read_csv('../data/AskMen_Relationship.csv')
womenrelationship_df = pd.read_csv('../data/AskWomen_Relationship.csv')
multi_df = pd.read_csv('../data/AskMenWomen_Relationship.csv')

In [96]:
menwomen_df.head()

Unnamed: 0.1,Unnamed: 0,Subreddit,Title_Content
0,0,0,The AskMen Book Club The Picture of Dorian G...
1,1,0,I am starting to realise my dad wont live fore...
2,2,0,What do you see on women s dating profiles tha...
3,3,0,What could women put in their dating profiles ...
4,4,0,What are some things on your mind that you can...


### Tokenizer Function

In [102]:
#build custom tokenizers with lemmaliszer 
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def lemma(content):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(content.lower())
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in tokens]
    return(" ".join(tokens_lem))

In [103]:
#build custom tokenizers with lemmaliszer 
from nltk.stem.porter import PorterStemmer

def stemmer(content):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(content.lower())
    p_stemmer = PorterStemmer()
    stem_tokens = [p_stemmer.stem(i) for i in tokens]
    return(" ".join(stem_tokens))

## CountVectorizer Model With Logistic Regression

In [125]:
#building pipeline and gridsearch model 
#pipeline 
loggit_pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('logreg', LogisticRegression())
])

#parameters for grid search 
params_1 = {
    'cvec__stop_words':[None, 'english'],
    'cvec__max_features': [1000, 1500, 2000, 2500],
    'cvec__min_df': [2, 5],
    'cvec__max_df': [.95, 0.99],
    'cvec__ngram_range': [(1,1), (1,2)],
#     'cvec__tokenizer':[None, lemma, stemmer]
}

#gridsearch 
gs1 = GridSearchCV(loggit_pipe, param_grid = params_1 , cv = 3)
gs2 = GridSearchCV(loggit_pipe, param_grid = params_1 , cv = 3)
gs3 = GridSearchCV(loggit_pipe, param_grid = params_1 , cv = 3)

#### Model 1 with menwomen_df

In [126]:
X = menwomen_df["Title_Content"]
y = menwomen_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [127]:
gs1.fit(X_train, y_train)
gs1.best_params_



{'cvec__max_df': 0.95,
 'cvec__max_features': 1500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': 'english'}

In [128]:
print('train score', gs1.score(X_train, y_train))
print('test score', gs1.score(X_test, y_test))

train score 0.9289805269186713
test score 0.7013333333333334


#### Model 1 with menrelationship_df

In [129]:
X = menrelationship_df["Title_Content"]
y = menrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [130]:
gs2.fit(X_train, y_train)
gs2.best_params_



{'cvec__max_df': 0.95,
 'cvec__max_features': 2500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

In [131]:
print('train score', gs2.score(X_train, y_train))
print('test score', gs2.score(X_test, y_test))

train score 0.9990530303030303
test score 0.9315673289183223


#### Model 1 with  womenrelationship_df

In [132]:
X = womenrelationship_df["Title_Content"]
y = womenrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [133]:
gs2.fit(X_train, y_train)
gs2.best_params_



{'cvec__max_df': 0.95,
 'cvec__max_features': 2000,
 'cvec__min_df': 5,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

In [118]:
print('train score', gs2.score(X_train, y_train))
print('test score', gs2.score(X_test, y_test))

train score 0.998324958123953
test score 0.98635477582846


## CountVectorizer Model With Multinomial NB

In [134]:
#building pipeline and gridsearch model 
#pipeline 
nb_pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('mnb', MultinomialNB())
])

#parameters for grid search 
params_2 = {
    'cvec__stop_words':[None, 'english'],
    'cvec__max_features': [1000, 1500, 2000, 2500],
    'cvec__min_df': [2, 5],
    'cvec__max_df': [.95, 0.99],
    'cvec__ngram_range': [(1,1), (1,2)]
}

#gridsearch 
gs4 = GridSearchCV(nb_pipe, param_grid = params_2 , cv = 3)
gs5 = GridSearchCV(nb_pipe, param_grid = params_2 , cv = 3)
gs6 = GridSearchCV(nb_pipe, param_grid = params_2 , cv = 3)

#### Model 2 with menwomen_df

In [135]:
X = menwomen_df["Title_Content"]
y = menwomen_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [136]:
gs4.fit(X_train, y_train)
gs4.best_params_

{'cvec__max_df': 0.95,
 'cvec__max_features': 1000,
 'cvec__min_df': 5,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}

In [137]:
print('train score', gs4.score(X_train, y_train))
print('test score', gs4.score(X_test, y_test))

train score 0.8029782359679267
test score 0.6986666666666667


#### Model 2 with menrelationship_df

In [138]:
X = menrelationship_df["Title_Content"]
y = menrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [139]:
gs5.fit(X_train, y_train)
gs5.best_params_

{'cvec__max_df': 0.95,
 'cvec__max_features': 2000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

In [140]:
print('train score', gs5.score(X_train, y_train))
print('test score', gs5.score(X_test, y_test))

train score 0.9517045454545454
test score 0.9205298013245033


#### Model 2 with  womenrelationship_df

In [141]:
X = womenrelationship_df["Title_Content"]
y = womenrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [142]:
gs6.fit(X_train, y_train)
gs6.best_params_

{'cvec__max_df': 0.95,
 'cvec__max_features': 2000,
 'cvec__min_df': 5,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

In [143]:
print('train score', gs6.score(X_train, y_train))
print('test score', gs6.score(X_test, y_test))

train score 0.981574539363484
test score 0.9766081871345029


## TFIDF Model With Logistic Regression

In [145]:
#building pipeline and gridsearch model 
#pipeline 
loggit_pipe2 = Pipeline([
    ('tfidf', TfidfVectorizer()), 
    ('logreg', LogisticRegression())
])

#parameters for grid search 
params_3 = {
    'tfidf__stop_words':[None, 'english'],
    'tfidf__max_features': [1000, 1500, 2000, 2500],
    'tfidf__min_df': [2, 5],
    'tfidf__max_df': [.95, 0.99],
    'tfidf__ngram_range': [(1,1), (1,2)]
}

#gridsearch 
gs7 = GridSearchCV(loggit_pipe2, param_grid = params_3 , cv = 3)
gs8 = GridSearchCV(loggit_pipe2, param_grid = params_3 , cv = 3)
gs9 = GridSearchCV(loggit_pipe2, param_grid = params_3 , cv = 3)

#### Model 3 with menwomen_df

In [146]:
X = menwomen_df["Title_Content"]
y = menwomen_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [147]:
gs7.fit(X_train, y_train)
gs7.best_params_



{'tfidf__max_df': 0.95,
 'tfidf__max_features': 1000,
 'tfidf__min_df': 2,
 'tfidf__ngram_range': (1, 1),
 'tfidf__stop_words': None}

In [148]:
print('train score', gs7.score(X_train, y_train))
print('test score', gs7.score(X_test, y_test))

train score 0.8304696449026346
test score 0.736


#### Model 3 with menrelationship_df

In [151]:
X = menrelationship_df["Title_Content"]
y = menrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [152]:
gs8.fit(X_train, y_train)
gs8.best_params_



{'tfidf__max_df': 0.95,
 'tfidf__max_features': 2000,
 'tfidf__min_df': 5,
 'tfidf__ngram_range': (1, 1),
 'tfidf__stop_words': None}

In [153]:
print('train score', gs8.score(X_train, y_train))
print('test score', gs8.score(X_test, y_test))

train score 0.9517045454545454
test score 0.9227373068432672


#### Model 3 with  womenrelationship_df

In [161]:
X = womenrelationship_df["Title_Content"]
y = womenrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [162]:
gs9.fit(X_train, y_train)
gs9.best_params_



{'tfidf__max_df': 0.95,
 'tfidf__max_features': 1000,
 'tfidf__min_df': 5,
 'tfidf__ngram_range': (1, 1),
 'tfidf__stop_words': None}

In [163]:
print('train score', gs9.score(X_train, y_train))
print('test score', gs9.score(X_test, y_test))

train score 0.9807370184254607
test score 0.9805068226120858


## TFIDF Model With Multinomial NB

In [168]:
#building pipeline and gridsearch model 
#pipeline 
nb_pipe2 = Pipeline([
    ('tfidf', TfidfVectorizer()), 
    ('mnb', MultinomialNB())
])

#parameters for grid search 
params_4 = {
    'tfidf__stop_words':[None, 'english'],
    'tfidf__max_features': [1000, 1500, 2000, 2500],
    'tfidf__min_df': [2, 5],
    'tfidf__max_df': [.95, 0.99],
    'tfidf__ngram_range': [(1,1), (1,2)],
#     'tfidf__tokenizer':[None, lemma, stemmer]
}

#gridsearch 
gs10 = GridSearchCV(nb_pipe2, param_grid = params_4 , cv = 3)
gs11 = GridSearchCV(nb_pipe2, param_grid = params_4 , cv = 3)
gs12 = GridSearchCV(nb_pipe2, param_grid = params_4 , cv = 3)

#### Model 4 with menwomen_df

In [169]:
X = menwomen_df["Title_Content"]
y = menwomen_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [170]:
gs10.fit(X_train, y_train)
gs10.best_params_

{'tfidf__max_df': 0.95,
 'tfidf__max_features': 1000,
 'tfidf__min_df': 2,
 'tfidf__ngram_range': (1, 2),
 'tfidf__stop_words': None}

In [171]:
print('train score', gs10.score(X_train, y_train))
print('test score', gs10.score(X_test, y_test))

train score 0.7972508591065293
test score 0.7146666666666667


#### Model 4 with menrelationship_df

In [172]:
X = menrelationship_df["Title_Content"]
y = menrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [173]:
gs11.fit(X_train, y_train)
gs11.best_params_

{'tfidf__max_df': 0.95,
 'tfidf__max_features': 1500,
 'tfidf__min_df': 5,
 'tfidf__ngram_range': (1, 2),
 'tfidf__stop_words': None}

In [174]:
print('train score', gs11.score(X_train, y_train))
print('test score', gs11.score(X_test, y_test))

train score 0.928030303030303
test score 0.9183222958057395


#### Model 4 with  womenrelationship_df

In [175]:
X = womenrelationship_df["Title_Content"]
y = womenrelationship_df['Subreddit']

#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [176]:
gs12.fit(X_train, y_train)
gs12.best_params_

{'tfidf__max_df': 0.95,
 'tfidf__max_features': 1500,
 'tfidf__min_df': 2,
 'tfidf__ngram_range': (1, 2),
 'tfidf__stop_words': None}

In [177]:
print('train score', gs12.score(X_train, y_train))
print('test score', gs12.score(X_test, y_test))

train score 0.9824120603015075
test score 0.9766081871345029
