# Modeling - Sentiment

## Training a model to identify between the news and uplifting news subreddits

After cleaning the data and vectorizing it, I want to create and train a model with a high accuracy rate in order to use for predicting. I will want to test using multiple classification models (specifically naive bayes, logistic regression, and random forest). I will be using the two datasets - SVD and TF-IDF.

In [3]:
import pandas as pd
import nltk
import pickle
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline



## Load in the cleaned posts 

In [4]:
with open('../pickle/svd_df.pkl', 'rb') as f:
    svd_df = pickle.load(f)
# with open('../pickle/target_sub.pkl', 'rb') as f:
#     target_sub = pickle.load(f)
with open('../pickle/target_sent.pkl', 'rb') as f:
    target_sent = pickle.load(f)
with open('../pickle/term_df.pkl', 'rb') as f:
    term_df = pickle.load(f)


SVD values are always non-negative, so I'll have to get the absolute values of the whole dataset; otherwise, I'll encounter an error down the line.

In [5]:
svd_df = abs(svd_df)

## Train Test Split

I will be testing three models to compare against each other and use the best scoring model for my predictor. The models are:

- Multinomial Naive Bayes
- Logistic Regression
- Random Forest

In [6]:
X_train, X_test, y_train, y_test = train_test_split(svd_df, target_sent, stratify=target_sent, random_state=42)
X_train_term, X_test_term, y_train_term, y_test_term = train_test_split(term_df, target_sent, stratify=target_sent, random_state=42)

## Multinomial Naive Bayes

Naive Bayes Classifiers require the assumption that all features are independent which may not make this model the best fit for sentences or contextual words. However, cleaning the text (stop words) and adding weights (TF-IDF) will support this assumption. 

This model is a fairly simple one, so the only parameter to be tuned for will be the alpha value.

In [5]:
nb_gs = GridSearchCV(MultinomialNB(), {'alpha': (1.0, 0.95, .9)})
nb_term_gs = GridSearchCV(MultinomialNB(), {'alpha': (1.0, 0.95, .9)})

In [6]:
nb_gs.fit(X_train, y_train)
nb_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': (1.0, 0.95, 0.9)}, pre_dispatch='2*n_jobs',
       refit=True, scoring=None, verbose=0)

In [7]:
nb_gs.best_score_, nb_term_gs.best_score_

(0.563831988412994, 0.8547486033519553)

In [8]:
nb_gs.best_params_, nb_term_gs.best_params_

({'alpha': 0.9}, {'alpha': 1.0})

In [9]:
nb_gs.best_estimator_, nb_term_gs.best_estimator_

(MultinomialNB(alpha=0.9, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))

Interesting to note that the SVD vectorized dataset performed better with a 0.9 alpha value, while the TF-IDF'ed model used 1.0. The alpha is a smoothing factor wherein a value is included to reduce the amount of 0 probabilities - perhaps the words in the SVD dataset are able to be classified with more certainty?

In [10]:
nb_gs.grid_scores_, nb_term_gs.grid_scores_

([mean: 0.56342, std: 0.00260, params: {'alpha': 1.0},
  mean: 0.56363, std: 0.00232, params: {'alpha': 0.95},
  mean: 0.56383, std: 0.00240, params: {'alpha': 0.9}],
 [mean: 0.85475, std: 0.00415, params: {'alpha': 1.0},
  mean: 0.85454, std: 0.00437, params: {'alpha': 0.95},
  mean: 0.85371, std: 0.00558, params: {'alpha': 0.9}])

In [11]:
nb_gs.score(X_train, y_train), nb_gs.score(X_test, y_test)

(0.5663149182702255, 0.5592799503414029)

In [12]:
nb_term_gs.score(X_train_term, y_train_term), nb_term_gs.score(X_test_term, y_test_term)

(0.9265466583902338, 0.8497827436374923)

## Evaluation 

The model scored terribly using the SVD data, and did significantly better using the TF-IDF data. Both models are overfitting based on the training data, though. Next step is to try Logistic Regression to see if it fares any better.

In [13]:
nb_gs.predict_proba(X_test)

array([[0.41853322, 0.58146678],
       [0.49963369, 0.50036631],
       [0.47300985, 0.52699015],
       ...,
       [0.40231758, 0.59768242],
       [0.41484672, 0.58515328],
       [0.32321828, 0.67678172]])

## Logistic Regression

Logistic Regression is useful for binary target datasets, as it models the probability of the default class (in this case, for positive sentiment). 

I will be tuning the model based on penalty, the C value, and the tolerance. I want to test the model with both L1 and L2 regularization, as well as the regularization strength (represented as C, the inverse).

In [7]:
lr_gs = GridSearchCV(LogisticRegression(), {'penalty': ['l1', 'l2'],
                                            'C': np.logspace(.01, 1, 15), 
                                            'tol': (0.0001, 0.001, 0.01)})
lr_term_gs = GridSearchCV(LogisticRegression(), {'penalty': ['l1', 'l2'],
                                                 'C': np.logspace(.01, 1, 15), 
                                                 'tol': (0.0001, 0.001, 0.01)})                                

In [8]:
lr_gs.fit(X_train, y_train), 
lr_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([ 1.02329,  1.20424,  1.41719,  1.6678 ,  1.96271,  2.30978,
        2.71823,  3.1989 ,  3.76456,  4.43025,  5.21366,  6.1356 ,
        7.22057,  8.49739, 10.     ]), 'tol': (0.0001, 0.001, 0.01)},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [9]:
lr_gs.best_score_, lr_term_gs.best_score_

(0.7330850403476101, 0.8704738257810883)

In [10]:
lr_gs.best_params_, lr_term_gs.best_params_

({'C': 5.21366181472805, 'penalty': 'l2', 'tol': 0.0001},
 {'C': 4.430253439574549, 'penalty': 'l2', 'tol': 0.01})

In [11]:
lr_gs.best_estimator_, lr_term_gs.best_estimator_

(LogisticRegression(C=5.21366181472805, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False),
 LogisticRegression(C=4.430253439574549, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           solver='liblinear', tol=0.01, verbose=0, warm_start=False))

Both models used L2 regularization (ridge), and the tol is higher (0.01) on the TF-IDF data, implying that there is less regularization.

In [12]:
lr_grid = lr_gs.grid_scores_
lr_term_grid = lr_term_gs.grid_scores_

lr_grid[:5], lr_term_grid[:5]

([mean: 0.70764, std: 0.01351, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.0001},
  mean: 0.70805, std: 0.01390, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.001},
  mean: 0.70908, std: 0.01258, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.01},
  mean: 0.72357, std: 0.00600, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.0001},
  mean: 0.72357, std: 0.00600, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.001}],
 [mean: 0.81730, std: 0.00844, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.0001},
  mean: 0.81730, std: 0.00844, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.001},
  mean: 0.81730, std: 0.00844, params: {'C': 1.023292992280754, 'penalty': 'l1', 'tol': 0.01},
  mean: 0.86427, std: 0.00600, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.0001},
  mean: 0.86427, std: 0.00600, params: {'C': 1.023292992280754, 'penalty': 'l2', 'tol': 0.001}])

In [13]:
lr_gs.score(X_train, y_train), lr_gs.score(X_test, y_test)

(0.8268156424581006, 0.7461204220980757)

In [14]:
lr_term_gs.score(X_train_term, y_train_term), lr_term_gs.score(X_test_term, y_test_term)

(0.9770328988206083, 0.8758535071384234)

In [17]:
y_pred = lr_term_gs.predict(X_test_term)

In [18]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

[[618 104]
 [ 96 793]]


             precision    recall  f1-score   support

         -1       0.87      0.86      0.86       722
          1       0.88      0.89      0.89       889

avg / total       0.88      0.88      0.88      1611



In [20]:
from sklearn.metrics import accuracy_score, f1_score

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("\nF1 Score: {:.2f}".format(f1_score(y_test, y_pred) * 100))
print("\nCOnfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 87.59%

F1 Score: 88.80

COnfusion Matrix:
 [[618 104]
 [ 96 793]]


### Evaluation:

Again, both models appear prone to overfitting. However, both models scored much better than the Multinomial model.

I will want to test this data in an ensemble method next - Random Forest Classifier.

## Random Forest

Random Forest is a tree-based model, so it will be making a series of splits/decisions on features "behind the scenes" in order to output the best score. This model will vote on the best model after creating trees using a random sample of features and random sample of variables. 

This model has a lot of hyperparameters for tuning; I will be using the following:

- n_estimators - decide how many trees should be tested for in the forest. 
- max_depth - having the default max depth is likely to contribute to overfitting.
- criterion - decide whether the splits should be made based on gini or entropy values.

In [57]:
rfc_gs = GridSearchCV(RandomForestClassifier(random_state=42), {'n_estimators': [10, 50, 100], 
                                                                'max_depth': (None, 2, 5), 
                                                                'criterion': ['gini', 'entropy']})
rfc_term_gs = GridSearchCV(RandomForestClassifier(random_state=42), {'n_estimators': [10, 50, 100], 
                                                                     'max_depth': (None, 2, 5), 
                                                                     'criterion': ['gini', 'entropy']})

In [58]:
rfc_gs.fit(X_train, y_train)
rfc_term_gs.fit(X_train_term, y_train_term)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 50, 100], 'max_depth': (None, 2, 5), 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [59]:
rfc_gs.best_score_, rfc_term_gs.best_score_

(0.6828057107386716, 0.8384026484585144)

In [60]:
rfc_gs.best_params_, rfc_term_gs.best_params_

({'criterion': 'entropy', 'max_depth': None, 'n_estimators': 100},
 {'criterion': 'entropy', 'max_depth': None, 'n_estimators': 100})

In [61]:
rfc_gs.best_estimator_, rfc_term_gs.best_estimator_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=3, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
             oob_score=False, random_state=42, verbose=0, warm_start=False),
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=3, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
             oob_score=False, random_state=42, verbose=0, warm_start=False))

Unsuprisingly, the best parameters were for no max depth limit and for the most number of estimators - this model will likely be overfit.

In [62]:
rfc_grid = rfc_gs.grid_scores_
rfc_term_grid = rfc_term_gs.grid_scores_

rfc_grid[:5], rfc_term_grid[:5]

([mean: 0.60687, std: 0.01118, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 10},
  mean: 0.65777, std: 0.00345, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 50},
  mean: 0.67577, std: 0.00833, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 100},
  mean: 0.57294, std: 0.00686, params: {'criterion': 'gini', 'max_depth': 2, 'n_estimators': 10},
  mean: 0.55390, std: 0.00155, params: {'criterion': 'gini', 'max_depth': 2, 'n_estimators': 50}],
 [mean: 0.81233, std: 0.00310, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 10},
  mean: 0.82351, std: 0.00205, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 50},
  mean: 0.82516, std: 0.00326, params: {'criterion': 'gini', 'max_depth': None, 'n_estimators': 100},
  mean: 0.58783, std: 0.00442, params: {'criterion': 'gini', 'max_depth': 2, 'n_estimators': 10},
  mean: 0.55514, std: 0.00205, params: {'criterion': 'gini', 'max_depth': 2, 'n_estimators': 50}]

In [63]:
rfc_gs.score(X_train, y_train), rfc_gs.score(X_test, y_test)

(0.9977239809642044, 0.6778398510242085)

In [64]:
rfc_term_gs.score(X_train_term, y_train_term), rfc_term_gs.score(X_test_term, y_test_term)

(0.9393751293192634, 0.8336436995654872)

### Evaluation:

As predicted, this model grossly overfits. I may be able to achieve a higher score with further tuning (i.e. higher tree count), but that the time for fitting and testing will only get longer and longer. 

As it stands, the current best model is Logistic Regression using the TF-IDF'd dataset. This outcome makes sense considering the lack of cleanup on the feature selection (which the Logistic Regression model has included as a paramter), and the lack of context to the words; though this may be helped by including a higher n-gram. 

# Testing the model with new data

I will now want to test the model against new data to see if it can predict correctly.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [88]:
with open('../pickle/tfidf.pkl', 'rb') as f:
    tfidf = pickle.load(f)

In [105]:
new_news = ["OxyContin creator being sued for \'significant role in causing opioid epidemic'",
            "Serena Williams fined $17,000 for U.S. Open violations",
            "Girl wins homecoming queen, then goes on to kick extra point in overtime to win football game.",
            "Hundreds are still trapped from Florence\'s flooding, and \'the worst is still yet to come",
            "Layoffs hit, prices lag as tariff pinches lobster industry"]

new_upnews = ["The Sniping Scientists Whose Work Saved Millions of Lives",
            "NICU volunteer donates a million dollars to local baby unit",
            "He spent 27 years wrongly convicted of murder. He wants to spend the rest encouraging inmates to read",
            "She heard their cries and couldn’t walk away, so she helped save 18 dogs in Kinston",
            "San Antonio police credit four good Samaritans with pulling a driver out of his burning truck on U.S. 281 and he walked away from the crash unscathed."]

In [106]:
term_news = tfidf.transform(new_news)
term_upnews = tfidf.transform(new_upnews)

In [107]:
lr_term_gs.predict_proba(term_news)

array([[0.26532039, 0.73467961],
       [0.2267944 , 0.7732056 ],
       [0.10926029, 0.89073971],
       [0.94882055, 0.05117945],
       [0.46478371, 0.53521629]])

In [108]:
lr_term_gs.predict_proba(term_upnews)

array([[0.23437802, 0.76562198],
       [0.19105892, 0.80894108],
       [0.76819741, 0.23180259],
       [0.12129401, 0.87870599],
       [0.74987998, 0.25012002]])

In [109]:
lr_term_gs.predict(term_news)

array([ 1,  1,  1, -1,  1])

In [110]:
lr_term_gs.predict(term_upnews)

array([ 1,  1, -1,  1, -1])

### Evaluation

Looking at the headlines and the sentiment being ascribed to them, it is clear that the model is not very accurate. 

For this particular purpose of sorting by sentiment, I would not want to rely on the current model for any sort of production. I believe more improvements have to be made on the actual sentiment classification rather than the model itself; because the sentiment analysis library/function (VADER) was not optimized, the model is predicting correctly on the wrong sentiment. 

For the largest improvements, I would want to spend more time on the preprocessing portion to try and select more relevant features, as well as improve the sentiment analysis (perhaps by increasing n-grams). 