# Modeling - Subreddit

## Training a model to identify between the news and uplifting news subreddits

After cleaning the data and vectorizing it, I want to create and train a model with a high accuracy rate in order to use for predicting. I will want to test using multiple classification models (specifically naive bayes, logistic regression, and random forest). I will be using the two datasets - SVD and TF-IDF.

In [1]:
import pandas as pd
import nltk
import pickle
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline



# Load in the cleaned posts 

In [2]:
with open('../pickle/svd_df.pkl', 'rb') as f:
    svd_df = pickle.load(f)
with open('../pickle/target_sub.pkl', 'rb') as f:
    target_sub = pickle.load(f)
# with open('../pickle/target_sent.pkl', 'rb') as f:
#     target_sent = pickle.load(f)
with open('../pickle/term_df.pkl', 'rb') as f:
    term_df = pickle.load(f)


SVD values are always non-negative, so I'll have to get the absolute values of the whole dataset; otherwise, I'll encounter an error down the line.

In [3]:
svd_df = abs(svd_df)

## Train Test Split

I will be testing three models to compare against each other and use the best scoring model for my predictor. The models are:

- Multinomial Naive Bayes
- Logistic Regression
- Random Forest

In [4]:
X_train, X_test, y_train, y_test = train_test_split(svd_df, target_sub, stratify=target_sub, random_state=42)
X_train_term, X_test_term, y_train_term, y_test_term = train_test_split(term_df, target_sub, stratify=target_sub, random_state=42)

## Multinomial Naive Bayes

Naive Bayes Classifiers require the assumption that all features are independent which may not make this model the best fit for sentences or contextual words. However, cleaning the text (stop words) and adding weights (TF-IDF) will support this assumption. 

This model is a fairly simple one, so the only parameter to be tuned for will be the alpha value.

In [5]:
from sklearn.naive_bayes import MultinomialNB

nb_gs = GridSearchCV(MultinomialNB(), {'alpha': (1.0, 0.95, .9)})
nb_term_gs = GridSearchCV(MultinomialNB(), {'alpha': (1.0, 0.95, .9)})

In [6]:
nb_gs.fit(X_train, y_train)
nb_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': (1.0, 0.95, 0.9)}, pre_dispatch='2*n_jobs',
       refit=True, scoring=None, verbose=0)

In [7]:
nb_gs.best_score_, nb_term_gs.best_score_

(0.574591351127664, 0.776536312849162)

In [8]:
nb_gs.best_params_, nb_term_gs.best_params_

({'alpha': 1.0}, {'alpha': 1.0})

In [9]:
nb_gs.best_estimator_, nb_term_gs.best_estimator_

(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))

In [10]:
nb_gs.grid_scores_, nb_term_gs.grid_scores_

([mean: 0.57459, std: 0.00192, params: {'alpha': 1.0},
  mean: 0.57438, std: 0.00163, params: {'alpha': 0.95},
  mean: 0.57438, std: 0.00163, params: {'alpha': 0.9}],
 [mean: 0.77654, std: 0.00383, params: {'alpha': 1.0},
  mean: 0.77612, std: 0.00460, params: {'alpha': 0.95},
  mean: 0.77633, std: 0.00482, params: {'alpha': 0.9}])

In [11]:
nb_gs.score(X_train, y_train), nb_gs.score(X_test, y_test)

(0.5774881026277674, 0.5698324022346368)

In [12]:
nb_term_gs.score(X_train_term, y_train_term), nb_term_gs.score(X_test_term, y_test_term)

(0.8713014690668321, 0.7827436374922409)

In [47]:
nb_gs.predict_proba(X_test)

array([[0.5890489 , 0.4109511 ],
       [0.54587878, 0.45412122],
       [0.56220715, 0.43779285],
       ...,
       [0.53131735, 0.46868265],
       [0.57786938, 0.42213062],
       [0.46068086, 0.53931914]])

## Evaluation 

The model scored terribly using the SVD data, and did a bit better using the TF-IDF data. Both models are overfitting based on the training data, though. The classification decisions being made by the model also appear to be relatively vague, with the probabilities being close to 50/50 as opposed to clear-cut. This is likely due to the large overlap in words used between the two subreddits.

Next step is to try Logistic Regression to see if it fares any better.

## Logistic Regression

Logistic Regression is useful for binary target datasets, as it models the probability of the default class (in this case, for the news subreddit). 

I will be tuning the model based on penalty, the C value, and the tolerance. I want to test the model with both L1 and L2 regularization, as well as the regularization strength (represented as C, the inverse).

In [5]:
lr_gs = GridSearchCV(LogisticRegression(), {'penalty': ['l1', 'l2'],
                                            'C': np.logspace(.01, 1, 15), 
                                            'tol': (0.0001, 0.001, 0.01)})
lr_term_gs = GridSearchCV(LogisticRegression(), {'penalty': ['l1', 'l2'],
                                                 'C': np.logspace(.01, 1, 15), 
                                                 'tol': (0.0001, 0.001, 0.01)})                                

In [6]:
lr_gs.fit(X_train, y_train), 
lr_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([ 1.02329,  1.20424,  1.41719,  1.6678 ,  1.96271,  2.30978,
        2.71823,  3.1989 ,  3.76456,  4.43025,  5.21366,  6.1356 ,
        7.22057,  8.49739, 10.     ]), 'tol': (0.0001, 0.001, 0.01)},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [7]:
lr_gs.best_score_, lr_term_gs.best_score_

(0.7001862197392924, 0.776743223670598)

In [8]:
lr_gs.best_params_, lr_term_gs.best_params_

({'C': 3.198895109691398, 'penalty': 'l2', 'tol': 0.0001},
 {'C': 2.3097843187477425, 'penalty': 'l2', 'tol': 0.0001})

In [9]:
lr_gs.best_estimator_, lr_term_gs.best_estimator_

(LogisticRegression(C=3.198895109691398, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False),
 LogisticRegression(C=2.3097843187477425, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False))

In [10]:
lr_grid = lr_gs.grid_scores_
lr_term_grid = lr_term_gs.grid_scores_

lr_grid[:5], lr_term_grid[:5]

([mean: 0.66418, std: 0.01098, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.0001},
  mean: 0.66418, std: 0.00967, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.001},
  mean: 0.66605, std: 0.00830, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.01},
  mean: 0.68384, std: 0.00747, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.0001},
  mean: 0.68384, std: 0.00747, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.001}],
 [mean: 0.73495, std: 0.00483, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.0001},
  mean: 0.73474, std: 0.00468, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.001},
  mean: 0.73433, std: 0.00439, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.01},
  mean: 0.77447, std: 0.00688, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.0001},
  mean: 0.77447, std: 0.00688, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.001}])

In [11]:
lr_gs.score(X_train, y_train), lr_gs.score(X_test, y_test)

(0.7808814400993171, 0.675356921166977)

In [12]:
lr_term_gs.score(X_train_term, y_train_term), lr_term_gs.score(X_test_term, y_test_term)

(0.904407200496586, 0.7846058348851644)

In [13]:
lr_term_gs.predict_proba(X_test_term)

array([[0.67229373, 0.32770627],
       [0.75040028, 0.24959972],
       [0.37863694, 0.62136306],
       ...,
       [0.53893128, 0.46106872],
       [0.614055  , 0.385945  ],
       [0.2919485 , 0.7080515 ]])

In [14]:
y_pred = lr_term_gs.predict(X_test_term)

In [15]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

[[736 139]
 [208 528]]


             precision    recall  f1-score   support

          0       0.78      0.84      0.81       875
          1       0.79      0.72      0.75       736

avg / total       0.79      0.78      0.78      1611



In [16]:
from sklearn.metrics import accuracy_score, f1_score

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("\nF1 Score: {:.2f}".format(f1_score(y_test, y_pred) * 100))
print("\nCOnfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 78.46%

F1 Score: 75.27

COnfusion Matrix:
 [[736 139]
 [208 528]]


### Evaluation:

Again, both models are very prone to overfitting. However, both models scored marginally better than the Multinomial model. The model works best using ridge regularization and a lower tolerance.

I will want to test this data in an ensemble method next - Random Forest Classifier.

## Random Forest

Random Forest is a tree-based model, so it will be making a series of splits/decisions on features "behind the scenes" in order to output the best score. This model will vote on the best model after creating trees using a random sample of features and random sample of variables. 

This model has a lot of hyperparameters for tuning; I will be using the following:

- n_estimators - decide how many trees should be tested for in the forest. 
- max_depth - having the default max depth is likely to contribute to overfitting.
- criterion - decide whether the splits should be made based on gini or entropy values.

In [56]:
rfc_gs = GridSearchCV(RandomForestClassifier(random_state=42), {'n_estimators': [150, 170, 190],
                                                                'max_depth': (10, 50, 100),
                                                                'criterion': ['gini', 'entropy']})
rfc_term_gs = GridSearchCV(RandomForestClassifier(random_state=42), {'n_estimators': [150, 170, 190],
                                                                     'max_depth': (10, 50, 100),
                                                                     'criterion': ['gini', 'entropy']})

In [57]:
rfc_gs.fit(X_train, y_train)
rfc_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [150, 170, 190], 'max_depth': (10, 50, 100), 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [58]:
rfc_gs.best_score_, rfc_term_gs.best_score_

(0.6654252017380509, 0.7392923649906891)

In [59]:
rfc_gs.best_params_, rfc_term_gs.best_params_

({'criterion': 'gini', 'max_depth': 50, 'n_estimators': 190},
 {'criterion': 'gini', 'max_depth': 100, 'n_estimators': 150})

In [60]:
rfc_gs.best_estimator_, rfc_term_gs.best_estimator_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=50, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=190, n_jobs=1,
             oob_score=False, random_state=42, verbose=0, warm_start=False),
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=100, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1,
             oob_score=False, random_state=42, verbose=0, warm_start=False))

In [61]:
rfc_gs.grid_scores_, rfc_term_gs.grid_scores_

([mean: 0.64577, std: 0.01307, params: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 150},
  mean: 0.64701, std: 0.01106, params: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 170},
  mean: 0.65136, std: 0.01064, params: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 190},
  mean: 0.66211, std: 0.00768, params: {'criterion': 'gini', 'max_depth': 50, 'n_estimators': 150},
  mean: 0.66274, std: 0.00413, params: {'criterion': 'gini', 'max_depth': 50, 'n_estimators': 170},
  mean: 0.66543, std: 0.00134, params: {'criterion': 'gini', 'max_depth': 50, 'n_estimators': 190},
  mean: 0.66211, std: 0.00768, params: {'criterion': 'gini', 'max_depth': 100, 'n_estimators': 150},
  mean: 0.66274, std: 0.00413, params: {'criterion': 'gini', 'max_depth': 100, 'n_estimators': 170},
  mean: 0.66543, std: 0.00134, params: {'criterion': 'gini', 'max_depth': 100, 'n_estimators': 190},
  mean: 0.65508, std: 0.01322, params: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 

In [62]:
rfc_gs.score(X_train, y_train), rfc_gs.score(X_test, y_test)

(0.9952410511069729, 0.6641837368094351)

In [63]:
rfc_term_gs.score(X_train_term, y_train_term), rfc_term_gs.score(X_test_term, y_test_term)

(0.8698530933167805, 0.7523277467411545)

### Evaluation:

As predicted, this model grossly overfits. I may be able to achieve a higher score with further tuning (i.e. higher tree count), but that the time for fitting and testing will only get longer and longer. 

As it stands, the current best model is Logistic Regression using the TF-IDF'd dataset. This outcome makes sense considering the lack of cleanup on the feature selection (which the Logistic Regression model has included as a paramter), and the lack of context to the words; though this may be helped by including a higher n-gram. 

# Testing the model with new data

I will now want to test the model against new data to see if it can predict correctly.

In [33]:
with open('../pickle/tfidf.pkl', 'rb') as f:
    tfidf = pickle.load(f)

In [34]:
new_news = ["OxyContin creator being sued for \'significant role in causing opioid epidemic'",
            "Serena Williams fined $17,000 for U.S. Open violations",
            "Girl wins homecoming queen, then goes on to kick extra point in overtime to win football game.",
            "Hundreds are still trapped from Florence\'s flooding, and \'the worst is still yet to come",
            "Layoffs hit, prices lag as tariff pinches lobster industry"]

new_upnews = ["The Sniping Scientists Whose Work Saved Millions of Lives",
            "NICU volunteer donates a million dollars to local baby unit",
            "He spent 27 years wrongly convicted of murder. He wants to spend the rest encouraging inmates to read",
            "She heard their cries and couldn’t walk away, so she helped save 18 dogs in Kinston",
            "San Antonio police credit four good Samaritans with pulling a driver out of his burning truck on U.S. 281 and he walked away from the crash unscathed."]

In [35]:
term_news = tfidf.transform(new_news)
term_upnews = tfidf.transform(new_upnews)

In [36]:
lr_term_gs.predict_proba(term_news)

array([[0.64352352, 0.35647648],
       [0.82392992, 0.17607008],
       [0.49874612, 0.50125388],
       [0.55841746, 0.44158254],
       [0.78503572, 0.21496428]])

In [37]:
lr_term_gs.predict_proba(term_upnews)

array([[0.05706687, 0.94293313],
       [0.06150192, 0.93849808],
       [0.59157234, 0.40842766],
       [0.08219673, 0.91780327],
       [0.43900884, 0.56099116]])

In [38]:
lr_term_gs.predict(term_news)

array([0, 0, 1, 0, 0])

In [39]:
lr_term_gs.predict(term_upnews)

array([1, 1, 0, 1, 1])

### Evaluation

The model was able to predict with 80% accuracy, which is better than I was anticipating considering the lower score and the performance of the sentiment model. This outcome also surprises me considering that the news subreddit was not as dominantly 'negative' in sentiment as I had expected, which is where I would have figured the main differences in words for subreddit would have come from. Though, I do still believe that a significant factor in this model being able to predict with some level of accuracy does come from the bias for negative news in r/news over r/upliftingnews.

In terms of improvements to this model, they are mostly the same as with the sentiment model - better cleanup and vectorizing in the previous step, and utilizing more processing time/power to optimize the random forest classifier.

Overall, I was not able to prove my hypothesis that the news subreddit had a significantly more negative bias, lending to the demoralization of the reading public. With this exploration, however, I was still able to create models with ~80% accuracy in identifying both sentiment and subreddit - an interesting next step would be to try and consolidate these models (this would have been easier had news been more negatively biased and upliftingnews positive).