# Modeling

In this notebook, I take what was learned from EDA and apply it to various models.  Included models are Logistic Regression, Naive Bayes, Random Forests, ExtraTrees, and Support Vector Machines.

In [61]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
import seaborn as sns

In [16]:
df_clean = pd.read_csv('./data/clean_token_titles.csv')

In [17]:
X = df_clean['title']
y = df_clean['is_evolution']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42,
                                                    stratify=y)

In [18]:
cv = CountVectorizer(min_df=5)
cv.fit(X_train)
X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

In [19]:
len(cv.get_feature_names())

408

In [20]:
X_train_cv = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())
X_test_cv = pd.DataFrame(X_test_cv.toarray(), columns=cv.get_feature_names())
X_train_cv.head()

Unnamed: 0,000,10,100,2018,50,ability,abiogenesis,accept,actually,adam,...,way,well,whale,work,world,would,year,years,yec,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Baseline Accuracy

In [21]:
y_train.value_counts(normalize=True)

0    0.504743
1    0.495257
Name: is_evolution, dtype: float64

Baseline Accuracy is 50.5%

## Logistic Regression

In [60]:
lr = LogisticRegression(penalty='l1')

lr.fit(X_train_cv, y_train)
lr.score(X_train_cv, y_train), lr.score(X_test_cv, y_test)



(0.8346883468834688, 0.7723577235772358)

In [59]:
lr = LogisticRegression()

lr.fit(X_train_cv, y_train)
lr.score(X_train_cv, y_train), lr.score(X_test_cv, y_test)



(0.8570460704607046, 0.7764227642276422)

Logistic regression achieved an accuracy of 85.7% on the training set and 77.6% on the test set.

This model is fairly overfit, but a great improvement over the baseline

## Naive Bayes

In [23]:
from sklearn.naive_bayes import MultinomialNB

In [35]:
nb = MultinomialNB(alpha=15)
model = nb.fit(X_train_cv, y_train)
nb.score(X_train_cv, y_train), nb.score(X_test_cv, y_test)

(0.809620596205962, 0.7804878048780488)

Likely the best model for the data.  Only slightly overfit with a 22% misclassification rate.

## Random Forests and Extra Trees

In [37]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [38]:
rf = RandomForestClassifier(n_estimators=10)

et = ExtraTreesClassifier(n_estimators=10)

In [42]:
cross_val_score(rf, X_train_cv, y_train, cv=5).mean()

0.7167911131470454

In [44]:
cross_val_score(et, X_train_cv, y_train, cv=5).mean()

0.7201923957856161

In [46]:
rf = RandomForestClassifier(random_state=42)

rf_params = {
    'n_estimators': [10, 20, 50, 100],
    'max_depth': [None, 3, 4, 5],
    'max_features': ['auto', 100, 200, 300, 4, 5, 50]
}

gs = GridSearchCV(rf, param_grid=rf_params, cv=5)
gs.fit(X_train_cv, y_train)
print(gs.best_score_)
gs.best_params_

0.7398373983739838


{'max_depth': None, 'max_features': 4, 'n_estimators': 100}

In [None]:
gs.score(X_test_cv, y_test)

In [47]:
et = ExtraTreesClassifier(random_state=42)

et_params = {
    'n_estimators': [10, 20, 50, 100],
    'max_depth': [None, 3, 4, 5],
    'max_features': ['auto', 100, 200, 300, 4, 5, 50]
}

gs = GridSearchCV(et, param_grid=et_params, cv=5)
gs.fit(X_train_cv, y_train)
print(gs.best_score_)
gs.best_params_

0.7378048780487805


{'max_depth': None, 'max_features': 5, 'n_estimators': 100}

In [48]:
gs.score(X_test_cv, y_test)

0.741869918699187

## Support Vector Machines

In [54]:
from sklearn import svm

In [64]:
svc = svm.SVC()

svc_params = {
    'kernel': ['rbf','linear','poly','sigmoid'],
    'C': [1.0, 0.5, 2.0, 5.0]
}

gs = GridSearchCV(svc, param_grid=svc_params, cv=5)
gs.fit(X_train_cv, y_train)
print(gs.best_score_)

gs.score(X_train_cv, y_train), gs.score(X_test_cv, y_test)







0.7350948509485095


(0.8543360433604336, 0.7764227642276422)

In [65]:
gs.best_params_

{'C': 0.5, 'kernel': 'linear'}

In [68]:
gs.best_estimator_

SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

## Conclusions and Recommendations

Initially, I believed that classification of these two subreddits would have been fairly straightforward.  I believed that the topics of these two subreddits were distinct enough that there would be several strong, easily-identifiable indicators which would result in a high accuracy for classification.  However after performing this analysis, it seems that the content of the two subreddits is more blended than I initially thought.  While both subreddits deal with the same overarching origins topic, their approaches and beliefs could not be more distinct.  The blurring however seems to occur nevertheless, and based on the models and EDA there are a few reasons for this.

Primarily, while r/Evolution seems to be specifically focused within its domain, discussing topics related directly to Evolution, r/Creation seems to focus much more on comparison with its antagonist.  r/Creation included many keywords at extremely high rates that we would expect to see primarily in r/Evolution.  This resulted in a great deal of blending that made it more difficult to classify the posts.  In order to improve the model, removing the words such as 'evolution', 'dna', 'study', etc. (which one might believe to be a strong identifier of r/Evolution) from the model, may lead to more distinct categorization, improving the performance and lowering miscalssification of the models.

I also believed that, due to the subject matter, tense may have been a strong identifying feature for my models, especially for r/Evolution.  However, the analysis does not seem to suggest that this is the case.  My next step would be to copy the 3rd and 4th notebooks and rerun everything with stemmed/lemmatized words instead.  This would confirm (or refute) that tense is a non-indicator for the models.

All of the models performed significantly better than the baseline for modeling and classifying the titles.  Of the models, Multinomial Naive Bayes performed the best due to its high accuracy and relatively low overfit.  Other models, such as SVM, had higher accuracy on the training set but were significantly more overfit, and would thus not be as great of a model to use.

Given more time, I would like to follow up with my suggestions listed earlier, but more importantly, I would like to gather more data over time.  This dataset was just shy of 2000 posts, and they were all imported on the same day.  This means that the dataset was merely a snapshot of the content over the past several days (depending on the activity level of the subreddits).  Gathering at different points over several weeks would likely result in a more representative sample of each subreddit, improving the potential performance of the models.  Additionally, the posts were imported from the "Hot" subsection of each subreddit.  Because only posts with high rates of interaction show up with this section, there may be some level of confirmation bias that is not being accounted for, given that similar posts and topics are likely to be upvoted and discussed in the subreddit.  Gathering from either "Top" or "New" may help to give more representative samples.