# 3. Model Evaluation
<br>
This is the final notebook of four in the Reddit API scrape and classification project.


#### Actions performed:
   - [Preprocessing: determine which filtering methods to apply](#preprocessing)
   - [Score log regression model; Analyze some metrics and incorrect predictions](#metrics)
   - [Gridsearch optimization:](#gridsearch)
       1. logistic regression
       2. MultinomialNB
       3. Random Forest
       4. Extra Trees Classifier
       5. Bagging Classifier
       6. Adaboost
       
 

---

Import modules

In [2]:
# cleaning and processing tools
import pandas as pd
import numpy as np
import regex as re
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from bs4 import BeautifulSoup         
import nltk
from nltk.corpus import stopwords

In [3]:
# modeling tools
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.ensemble import BaggingClassifier

  from numpy.core.umath_tests import inner1d


In [4]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

### Import cleaned dataframe

In [5]:
df = pd.read_csv('../datasets/dad_five.csv')

In [6]:
df.head(1)

Unnamed: 0,title,selftext,subreddit,post
0,"Today, my son asked ""Can I have a book mark?"" ...",,0,"Today, my son asked ""Can I have a book mark?"" ..."


### Baseline Accuracy

In [7]:
df['subreddit'].value_counts(normalize = True)

0    0.533292
1    0.466708
Name: subreddit, dtype: float64

- My model needs to predict posts with more than 53% accuracy to beat my baseline.
- As such, I shouldn't need to stratify y during train/test/split

 <a id='preprocessing'></a>
# Preprocessing and Cleaning
Before testing all models I want to determine what pre-processing and filtering methods should be applied. To determine this, I will measure and compare the accuracy scores of a logistic regression model with different filtering methods applied each time. The following cells will be run for each model, and the scores compiled in a table (see README.md) for results.

### Question #1: What filtering methods should be applied?

Test 1: all fitering methods used (HTML/punctuation/digits/stopwords removed, lemmetizer used)

Test 2/3: stopwords included

Test 4/5: no lemmitizer

Test 6/7/8: HTML included

Test 9/10: NO filters (punctuation/digits included)

Test 11/12: HTML included but punctuation/digits removed

In [8]:
lem = WordNetLemmatizer()

# # Test 1 = all feature filters applied
# def clean_review(raw_review):
#     no_html = BeautifulSoup(raw_review).get_text()
#     letters_only = re.sub('[^a-zA-Z]', " ", no_html)
#     lower_list = letters_only.lower().split()
#     lem_list = [lem.lemmatize(i) for i in lower_list]
#     stops = set(stopwords.words('english'))
#     clean_words = [w for w in lem_list if w not in stops]
#     return(" ".join(clean_words))

# Test 2,3: stopwords left in, max feature limit or not
# def clean_review(raw_review):
#     no_html = BeautifulSoup(raw_review).get_text()
#     letters_only = re.sub('[^a-zA-Z]', " ", no_html)
#     lower_list = letters_only.lower().split()
#     lem_list = [lem.lemmatize(i) for i in lower_list]
#     return(" ".join(lem_list))

# Test 4,5: no lemmatizer, max features or not
def clean_review(raw_review):
    no_html = BeautifulSoup(raw_review).get_text()
    letters_only = re.sub('[^a-zA-Z]', " ", no_html)
    lower_list = letters_only.lower().split()
    return(" ".join(lower_list))

# Test 6,7, 8: html left in, max features or not
# def clean_review(raw_review):
#     letters_only = re.sub('[^a-zA-Z]', " ", raw_review)
#     lower_list = letters_only.lower().split()
#     return(" ".join(lower_list))


# # Test 9, 10: no filters
# def clean_review(raw_review):
#     lower_list = raw_review.lower().split()
#     return(" ".join(lower_list))

# Test 11, 12: html removed but punct left in
# def clean_review(raw_review):
#     no_html = BeautifulSoup(raw_review).get_text()
#     lower_list = no_html.lower().split()
#     return(" ".join(lower_list))

i. Compare effect of filtering on non-filtered review

In [9]:
print(df['post'][3021])
print("---------------------------------------------")
print(clean_review(df['post'][3021]))

 How does the Portrait mode on recent phones like the Pixel 3 determine what's in and out of focus? 
---------------------------------------------
how does the portrait mode on recent phones like the pixel determine what s in and out of focus


ii. Clean posts, append to list

In [10]:
# formula to clean posts and append clean_corpus to an empty list
X = df['post']
clean_corpus = []
print("Cleaning Reddit Corpus")
j=0
for document in X:
    clean_corpus.append(clean_review(document))
    
    if (j+1) % 500 ==0:
        print(f'Documents ={j+1}')
    j +=1
print(f'Total Clean Documents={len(clean_corpus)}')

Cleaning Reddit Corpus
Documents =500
Documents =1000
Documents =1500
Documents =2000
Documents =2500
Documents =3000
Total Clean Documents=3259


iii. Train/test/split the cleaned data and countvectorize

In [11]:
# train/test/split and countvectorize
X = clean_corpus
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

vect = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,
                            ngram_range=(1, 2),
                             max_features = None)

train_data = vect.fit_transform(X_train)
test_data = vect.transform(X_test)
train_data = train_data.toarray()

print(test_data.shape)
print(y_test.shape)
print(train_data.shape)
print(y_train.shape)

(1076, 42934)
(1076,)
(2183, 42934)
(2183,)


iv. Score Log Reg Model

In [12]:
lr = LogisticRegression()

lr.fit(train_data, y_train)
print(lr.score(train_data, y_train))
print(lr.score(test_data, y_test))

0.9990838295923041
0.9321561338289963


- not surprisingly, if n-grams are used, the model performs WORSE if stopwords are not included

In [13]:
# features we could look at to examine most important words
print(lr.intercept_)
print(lr.coef_)

[-2.19959767]
[[-3.56375673e-03  4.03588992e-02  4.03588992e-02 ... -9.58924646e-06
  -2.39233215e-02 -4.21819544e-03]]


**Summary Table of Results from Filtering Tests:** see README.md

---

    

 <a id='metrics'></a>
# Score Log Reg Model and analyze metrics/predictions

- No preprocessing filters applied (stopwords left in, no lemmatizer)
- Punctuation and digits will be automatically removed with CountVectorizer

In [14]:
X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


vect = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,
                            ngram_range=(1, 2),
                             max_features = None)

train_data = vect.fit_transform(X_train)
test_data = vect.transform(X_test)
train_data = train_data.toarray()

print(test_data.shape)
print(y_test.shape)
print(train_data.shape)
print(y_train.shape)


(1076, 43481)
(1076,)
(2183, 43481)
(2183,)


In [15]:
lr = LogisticRegression()

lr.fit(train_data, y_train)
print(lr.score(train_data, y_train))
print(lr.score(test_data, y_test))

0.9990838295923041
0.9330855018587361


**Lets examine some incorrect predictions**

In [16]:
pred = lr.predict(test_data)
pred

array([0, 0, 0, ..., 1, 0, 1])

In [17]:
lr.coef_.shape

(1, 43481)

In [18]:
fun = pd.DataFrame(X_test, columns =['post'])

In [19]:
# create new dataframe that gives a score of 1 if prediction is correct
fun = pd.DataFrame(X_test, columns =['post'])

fun.loc[:, 'actual'] = y_test
fun.loc[:, 'predicted'] = pred
fun.loc[:, 'predicted'] = pred
fun.loc[:, 'score'] = fun['actual'] + fun['predicted']
fun.head()

Unnamed: 0,post,actual,predicted,score
1569,I just can't handle it. [This stuff always mak...,0,0,0
134,What's the cutest season? Awwwtumn.,0,0,0
844,A cheese shop asked me to rate thier service. ...,0,0,0
2247,Why is it so hard for old people to figure ou...,1,1,2
3030,What are the benefits of using metric system ...,1,1,2


In [20]:
fun['score'].value_counts()

0    558
2    446
1     72
Name: score, dtype: int64

**This is a list of all incorrect predictions**

In [21]:
fun[fun['score'] ==1]

Unnamed: 0,post,actual,predicted,score
2183,What causes you to overdose on cold medication?,1,0,1
2059,What chemical interactions happen inside huma...,1,0,1
1627,How are you gonna drive to the fruit and veg s...,0,1,1
257,What is heavy forward but not backward Ton,0,1,1
1670,What happens when Chanukah and Christmas get t...,0,1,1
3151,Lenz's and Farday's Laws,1,0,1
1053,What's the difference between a depression and...,0,1,1
239,Why do cows lie on each other in the rain To k...,0,1,1
1922,If Gambler's fallacy it's a thing... What's t...,1,0,1
1506,Why are they called apartments....? When they ...,0,1,1


Some observations about these incorrect predictions:
 - Several incorrectly predicted dad jokes posts contain words that scored high in TFID vectorizer of ELI5(when stopwords were removed). For example: "does", "like", and "know" are in several of these dadjokes, and could be the reason the model predicted it to be ELI5. 
 - Lots of incorrectly predicted ELI5 posts are written as statements instead of questions (so missing descriptive words like "why", "how", etc). Also, several incorrect predictions contain the word "did", which scored high with TFID vectorizer for dad jokes.

**Build a confusion matrix**

In [22]:
cm = confusion_matrix(y_test, pred)
cm_df = pd.DataFrame(cm, columns = ['pred -', 'pred +'], index = ['actual - ', 'actual +'])
cm_df

Unnamed: 0,pred -,pred +
actual -,558,36
actual +,36,446


In [23]:
accuracy = (558 + 446) / (558+36+36+446)
accuracy

0.9330855018587361

In [24]:
missclass = (36 + 36)  /  (558+36+36+446)
missclass

0.06691449814126393

In [25]:
# true positive rate
sens = 446 / (36 + 446)
sens

0.9253112033195021

In [26]:
# true neg rate
spec = 558 / (558 + 36)
spec

0.9393939393939394

So the model is slightly better at classifying jokes (93.9%) than it is at classifying ELI5 questions (92.5%)

 <a id='gridsearch'></a>
# Gridsearch comparison of all models

### 1. Gridsearch score of logistic  regression 

a) default parameters

In [199]:
X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

In [200]:
cv = CountVectorizer()
lr = LogisticRegression()

pipe = Pipeline([
    ('cv', cv),
    ('lr', lr)
])

gs = GridSearchCV(pipe, param_grid = {}, return_train_score=True, cv=5)
gs.fit(X_train, y_train)

print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_train, y_train))
gs.score(X_test, y_test)

0.9147961520842877
{}
0.9931287219422813


0.9172862453531598

b) hyperparameter optimization

In [204]:
params = {
    'cv__stop_words': [None],
    'cv__max_features': [None, 7000, 10000],
    'cv__ngram_range' : [(1,1),(1,2)],
    'lr__penalty' : ['l2'],
    'lr__C': [1.0, 2, 5, 10, 20]
    }

gs = GridSearchCV(pipe, param_grid=params, return_train_score=True, cv=5)
gs.fit(X_train, y_train);
print(gs.best_score_)
gs.best_params_

0.923957856161246


{'cv__max_features': None,
 'cv__ngram_range': (1, 2),
 'cv__stop_words': None,
 'lr__C': 5,
 'lr__penalty': 'l2'}

In [205]:
print(gs.score(X_train, y_train))
gs.score(X_test, y_test)

0.9990838295923041


0.9321561338289963

In [41]:
# cv_results shows the score for each combination of parameters
scores = pd.DataFrame(gs.cv_results_).T
scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
mean_fit_time,0.0622721,0.0500613,0.170448,0.106797,0.309287,0.168168,0.0636616,0.0540187,0.188671,0.121428,0.327894,0.166386,0.066666,0.0557305,0.171852,0.122301,0.319904,0.178055
std_fit_time,0.004733,0.00126503,0.00396971,0.00236417,0.00567067,0.00303674,0.000811197,0.00283974,0.0131352,0.00579958,0.00306169,0.00140978,0.00495108,0.00438728,0.00362407,0.00181668,0.0128388,0.00799314
mean_score_time,0.0105358,0.00898776,0.018247,0.0129671,0.0247729,0.0161896,0.0107127,0.0101295,0.0226887,0.0159694,0.030999,0.0171419,0.0113708,0.010931,0.0196231,0.0154451,0.0283714,0.0181802
std_score_time,0.00045542,0.000253977,0.000644924,0.000310558,0.000701213,0.000445582,0.000105093,0.000580573,0.0027231,0.00101544,0.0041373,0.00058467,0.000602411,0.0019012,0.000579007,0.00113149,0.00220907,0.00117669
param_cv__max_features,3000,3000,3000,3000,3000,3000,7000,7000,7000,7000,7000,7000,10000,10000,10000,10000,10000,10000
param_cv__ngram_range,"(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)","(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)","(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)"
param_cv__stop_words,,english,,english,,english,,english,,english,,english,,english,,english,,english
params,"{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 10000, 'cv__ngram_range':...","{'cv__max_features': 10000, 'cv__ngram_range':...","{'cv__max_features': 10000, 'cv__ngram_range':...","{'cv__max_features': 10000, 'cv__ngram_range':...","{'cv__max_features': 10000, 'cv__ngram_range':...","{'cv__max_features': 10000, 'cv__ngram_range':..."
split0_test_score,0.915332,0.837529,0.933638,0.83524,0.935927,0.839817,0.913043,0.828375,0.933638,0.837529,0.935927,0.828375,0.913043,0.828375,0.935927,0.837529,0.935927,0.839817
split1_test_score,0.919908,0.82151,0.929062,0.82151,0.926773,0.826087,0.919908,0.828375,0.93135,0.819222,0.922197,0.82151,0.919908,0.828375,0.93135,0.810069,0.93135,0.812357


Not surprisingly, stop words are VERY importnat!

### #2. MultinomialNB
<br> a) default score

In [206]:
# run gridsearch without tuning parameters

# X = clean_corpus
X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

cv = CountVectorizer()
model = MultinomialNB()

pipe = Pipeline([
    ('cv', cv),
    ('model', model)
])

gs = GridSearchCV(pipe, param_grid = {}, return_train_score=True, cv=5)
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [207]:
print(gs.best_score_)
print(gs.best_params_)
# print(gs.score(X_train, y_train))
# gs.score(X_test, y_test)

0.8983050847457628
{}


In [208]:
print(gs.score(X_train, y_train))
gs.score(X_test, y_test)

0.9780119102153001


0.8689591078066915

b) hyperparameter optimization 

In [209]:
params = {
    'cv__stop_words': [None, 'english'],
    'cv__max_features': [3000, 4000, 7000, None],
    'cv__ngram_range' : [(1,1), (1, 2)], 
    "model__alpha": [1.0, 5.0, 0.1]
}
gs = GridSearchCV(pipe, param_grid=params, return_train_score=True, cv=5)
gs.fit(X_train, y_train);
print(gs.best_score_)
gs.best_params_

0.9097572148419606


{'cv__max_features': None,
 'cv__ngram_range': (1, 2),
 'cv__stop_words': None,
 'model__alpha': 0.1}

In [211]:
print(gs.score(X_train, y_train))
gs.score(X_test, y_test)

1.0


0.8782527881040892

In [60]:
scores = pd.DataFrame(gs.cv_results_).T
scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
mean_fit_time,0.0490338,0.0454088,0.155101,0.102563,0.297713,0.158341,0.0469511,0.043715,0.154607,0.101934,0.287844,0.157591,0.0465664,0.0414627,0.148071,0.0954224,0.284867,0.159023
std_fit_time,0.00216932,0.00226354,0.00600786,0.0045666,0.010399,0.0091292,0.00037444,0.00182711,0.00234074,0.00562203,0.0095528,0.00543634,0.000309221,0.000393396,0.00377595,0.00111186,0.00252848,0.0052392
mean_score_time,0.00997195,0.00919323,0.018606,0.0130199,0.0249485,0.0156729,0.00964084,0.0089499,0.0181919,0.0129216,0.025477,0.0158454,0.00981035,0.00882893,0.0179274,0.0124278,0.024458,0.0162475
std_score_time,0.000853971,0.00056106,0.000559472,0.000687887,0.00112558,0.000400221,0.000514078,0.000366111,0.000984063,0.000648542,0.00137369,0.000599271,0.000438918,0.000299481,0.000655879,0.000274518,0.000598556,0.000446304
param_cv__max_features,3000,3000,3000,3000,3000,3000,4000,4000,4000,4000,4000,4000,7000,7000,7000,7000,7000,7000
param_cv__ngram_range,"(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)","(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)","(1, 1)","(1, 1)","(1, 2)","(1, 2)","(1, 3)","(1, 3)"
param_cv__stop_words,,english,,english,,english,,english,,english,,english,,english,,english,,english
params,"{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 3000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 4000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ...","{'cv__max_features': 7000, 'cv__ngram_range': ..."
split0_test_score,0.90389,0.869565,0.90389,0.8627,0.908467,0.8627,0.897025,0.874142,0.913043,0.885584,0.913043,0.883295,0.908467,0.869565,0.91762,0.881007,0.924485,0.878719
split1_test_score,0.908467,0.846682,0.894737,0.84897,0.894737,0.855835,0.906178,0.84897,0.894737,0.84897,0.892449,0.851259,0.897025,0.864989,0.901602,0.853547,0.901602,0.853547


### KNN

In [132]:
# X = clean_corpus
X = df['post']
y = df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 


knn_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())
    ])





rf_params = {}

gs = GridSearchCV(knn_pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

ValueError: could not convert string to float: 'What do you call a dictionary on drugs? High definition. '

I could not get KNN to work so I am leaving it out of my evaluation

###  #3. Random Forest

a.) default 

In [153]:
# X = clean_corpus
X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 
cv = CountVectorizer()
rf = RandomForestClassifier(random_state=42)

pipe = Pipeline([
    ('cv', cv),
    ('rf', rf)])

In [154]:
rf_params = {}

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.864406779661017


{}

b) Optimize hyperparameters

In [156]:
rf_params = { 
    'cv__max_features': [None, 4000, 7000],
    'cv__max_df': [1.0, 0.7],
    'rf__n_estimators': [10, 50, 100],
#     'rf__max_depth': [None, 1, 2, 3, 4],
    'rf__max_features': ['auto', 1.0, 0.5],
#     'rf__criterion': ["gini", "entropy"],
    }

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.9111314704535044


{'cv__max_df': 1.0,
 'cv__max_features': 7000,
 'rf__max_features': 'auto',
 'rf__n_estimators': 100}

In [157]:
gs.score(X_train, y_train)

1.0

In [158]:
gs.score(X_test, y_test)

0.9172862453531598

### #4. Extra Trees
a) default

In [178]:
et = ExtraTreesClassifier()

X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 
cv = CountVectorizer()
et = ExtraTreesClassifier()

pipe = Pipeline([
    ('cv', cv),
    ('et', et)])

rf_params = {}

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.8754008245533669


{}

b) hyperparameter optimization

In [180]:
rf_params = { 
    'cv__max_features': [None],
    'cv__max_df': [1.0, 0.7],
#     'cv__min_df': [1.0, 0.1],
    'et__n_estimators': [500, 1000, 5000],
#     'rf__max_depth': [None, 1, 2, 3, 4],
    'et__max_features': ['auto'],
     'et__criterion': ["entropy"],
    }

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.9294548786074209


{'cv__max_df': 1.0,
 'cv__max_features': None,
 'et__criterion': 'entropy',
 'et__max_features': 'auto',
 'et__n_estimators': 1000}

In [181]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

1.0
0.921003717472119


### (tangent - try Voting Classifier)
compare 4 models, using default parameters

In [149]:
# X = clean_corpus
X = df['post']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 
cv = CountVectorizer()
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
gb = GradientBoostingClassifier()
bag = BaggingClassifier()


In [150]:
knn_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())])


# vote w/ no params
rf_pipe = Pipeline([
    ('cv', cv),
    ('rf', rf)])

ada_pipe = Pipeline([
    ('cv', cv),
    ('ada', ada)])
     
gb_pipe = Pipeline([
    ('cv', cv),
    ('gb', gb)])

bag_pipe = Pipeline([
    ('cv', cv),
    ('bag', bag)])

In [151]:
vote = VotingClassifier([
#     ('knn', knn_pipe), 
    ('rf', rf_pipe),
    ('ada', ada_pipe), 
    ('gb', gb_pipe),
    ('bag', bag_pipe)
])

vote.fit(X_train, y_train)

VotingClassifier(estimators=[('rf', Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), prepr...imators=10, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False))]))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [152]:
print(vote.score(X_train, y_train))
print(vote.score(X_test, y_test))

0.9583142464498396
0.8912639405204461


  if diff:
  if diff:


>So out of these 4 models (Random Forest, AdaBoost, Gradient Boosting, and Bagging, all with default params), Random Forest scores the highest.



### #5. Bagging classifier

a) default

In [129]:
# X = clean_corpus
X = df['post']
y = df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 
cv = CountVectorizer()
bag = BaggingClassifier()

pipe = Pipeline([
    ('cv', cv),
    ('bag', bag)])

rf_params = {}

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.8767750801649107


{}

b) hyperparameter optimization

In [134]:
rf_params = { 
    'bag__n_estimators': [50, 100, 150],
    'bag__max_samples': [1.0, 0.75],
    'bag__max_features': [1.0],
    'bag__bootstrap': [False, True],
    'bag__bootstrap_features': [True, False],
    }

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.9106733852496565


{'bag__bootstrap': False,
 'bag__bootstrap_features': True,
 'bag__max_features': 1.0,
 'bag__max_samples': 0.75,
 'bag__n_estimators': 100}

In [135]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

1.0
0.9061338289962825


### #5. Ada boost
a) default

In [125]:
# X = clean_corpus
X = df['post']
y = df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)


# gridsearch w/ no params 
cv = CountVectorizer()
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())

pipe = Pipeline([
    ('cv', cv),
    ('ada', ada)])

rf_params = {}

gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.858451672010994


{}

b) hyperparameter optimization

In [127]:
# the default adaboost is 50 decision stumps
# if you want to change the decision stump, do that in the base estimator
# can throw in any kind of classifier w/ sample weights

rf_params = {
    'ada__base_estimator__max_depth' : [1, 2, 3],
    'ada__n_estimators': [40, 50, 60]
}
gs = GridSearchCV(pipe, param_grid=rf_params, cv=5, return_train_score=True,)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

# remember, underscore means its only available to you AFTER you fit

0.8946404031149794


{'ada__base_estimator__max_depth': 1, 'ada__n_estimators': 60}

In [128]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))

0.9198350893266147
0.8921933085501859


## Evaluation of Results
_The results of these models has been appended to a table and can be viewed in the README.md file_

- While logistic regression performed the best overall, it was not much improved with parameter optimization.
- All models (except for Ada Boost) suffered from overfitting and could further be optimized with a focus on reducing variance. 
- The second best score was produced by Extra Trees Classifier, which introduces more randomness than Bagging Classifier and Random Forest Classifier, but still managed to produce less bias.


Overall, this model does a pretty decent job distinguishing the difference between a Dad joke and an ELI5 question. 