# Modeling

In this workbook I will build, tune, and evaluate various classification models with the goal of creating a model that given a political post from a user of social media platforms, such as Reddit, can accurately identify whether the poster is Democrat or Republican. 

My main evaluation metric will be accuracy however, I will examine false positives and negatives as well as sensitivity and recall scores to ensure the model is flexible to other needs should they arise over the course of the campaign. 

Overview of Modeling Process:
1. Load clean data from previous workbook
2. Identify Baseline Metrics of Null Model
3. Build & Tune Various Classification Models
4. Evaluate Model Metrics (Accuracy, Sensitivity, Precision)
5. Conclusion & Recommendations

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
%store -r demdata
%store -r repdata

In [3]:
#Join both datasets
data= pd.concat([repdata, demdata])

In [4]:
#Encode the target variable as numerical 1 for Republican 0 for Dem
data['subreddit']= data['subreddit'].map({'Republican': 1, 'democrats':0})

In [5]:
data.head()

Unnamed: 0,subreddit,title,author,num_comments,title_word_count
0,1,'extremely concerning': elon musk on child por...,BroSteveWinter,1,10
1,1,florida gov. desantis says majority of post-hu...,tbburns2017,1,12
2,1,louden county virginia man arrested on child s...,tbburns2017,1,13
3,1,twitter shares soar on report elon musk agrees...,PinkClouds20,1,13
4,1,"hollywood turns on kamala harris, ends vp’s po...",Patriots-United,1,12


In [6]:
data.shape

(7581, 5)

In [7]:
#Define X and y
X= data['title']
y= data['subreddit']

#train test split
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state= 42, stratify= y)

#### Baseline Metrics
As seen below the majority class is the Republican Subreddit with a 55% share of the posts within the dataset. This 55% will serve as the baseline accuracy that we should consider when we evaluate our models.

In [8]:
y_train.value_counts(normalize = True)

1    0.550396
0    0.449604
Name: subreddit, dtype: float64

In [9]:
y_test.value_counts(normalize = True)

1    0.550633
0    0.449367
Name: subreddit, dtype: float64

## Building & Tuning the Models

Below I will instantiate, fit, and tune the hyperparameters for each model employing Pipelines and GridSearch to optimize my hyperparamters for each model. I've decided to run 5 models including LogisticRegression, RandomForestClassifier, AdaBoost, Gradient Boost, and a stacked combination of a select group of classification models. 

### Logistic Regression

This model is similar to linear regression in that we are drawing a line of best fit through our data in order to make a prediction. The difference is that we apply the logit link function to "bend" the line of best fit so that it is a curve rather than a straight line. We use this curved line to predict the probability that a data point is in one class versus the other.

In [10]:
#Logistic Regression
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params = {
    'cvec__max_features': [2000, 3000, 4000],
    'cvec__stop_words': [None, 'english'],
    'cvec__min_df': [1, 2, 4],
    'cvec__max_df': [1.0, .5],
    'lr__C': [1.0, 0.1],
    'lr__penalty': ['l2', 'none']
}

gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

In [11]:
 gs.fit(X_train, y_train)

In [12]:
print(gs.best_score_)
gs.best_params_

0.7160949868073879


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 2,
 'cvec__stop_words': 'english',
 'lr__C': 1.0,
 'lr__penalty': 'l2'}

In [13]:
#Saving predictions and accuracy scores
log_preds = gs.predict(X_test)

log_train = gs.score(X_train, y_train)

log_test = gs.score(X_test, y_test)

### Random Forest + Gridsearch

RandomForest is essentially a combinination of multiple decision trees using only a random subset of the total features from the dataset. By utilizing random subsets of the features as nodes to split the data into smaller subsets random forest can reduce levels of overfitting typically seen in standard decision tree models. Below I will use a gridsearch to tune the hyperparamters and optimize my model for the best set of max features, tree depth, and number of trees. 

In [16]:
# #Instantiate the Count Vectorizer
cvec= CountVectorizer(stop_words = 'english')

X_train_rf = X_train
X_test_rf = X_test

# #Fit the model
cvec.fit(X_train_rf)

X_train_rf= cvec.transform(X_train_rf)

# #Transform the test set
X_test= cvec.transform(X_test_rf)

In [17]:
#Instantiate the Model and Tune the Parameters
rf= RandomForestClassifier(random_state = 42)
rf_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 5],
    'max_features' : [1, 3, 5],
    'n_jobs': [-1]
}

gs1 = GridSearchCV(rf, param_grid= rf_params, cv = 3, n_jobs = -1)

gs1.fit(X_train_rf, y_train)
print(gs1.best_score_)
gs1.best_params_

0.7298153034300792


{'max_depth': None, 'max_features': 5, 'n_estimators': 200, 'n_jobs': -1}

In [18]:
rf_train = gs1.score(X_train_rf, y_train)

In [19]:
rf_test = gs1.score(X_test, y_test)

## Boosting Models

Boosting is a type of ensemble modeling where we run a model, weight the observations then iterate on the model with the goal of reducing bias and improving accuracy for classification problems. Below I will employ two boosting models Ada and Gradient Boost. These models vary in that Ada weights the observed errors while Gradient Boost fits the next model using the residuals of the prior model. 

In [21]:
#Ada Boost
ada= AdaBoostClassifier(base_estimator = DecisionTreeClassifier(random_state = 42))

ada_params= {
    'n_estimators': [50, 100, 150],
    'base_estimator__max_depth':[1,2],
    'learning_rate':[0.6, 1.0]
}

gs2 = GridSearchCV(ada, param_grid = ada_params, cv = 3)
gs2.fit(X_train_rf, y_train)

In [32]:
ada_train = gs2.score(X_train_rf, y_train)

ada_test = gs2.score(X_test, y_test)

In [24]:
#Graident Boost
gboost= GradientBoostingClassifier(random_state = 42)

gboost_params = {
    'max_depth': [2, 3, 4],
    'n_estimators': [100, 125, 150],
    'learning_rate': [0.8, 1.0, 0.1]
}

gb_gs = GridSearchCV(gboost, param_grid= gboost_params, cv =3)
gb_gs.fit(X_train_rf, y_train)
print(gb_gs.best_score_)
gb_gs.score(X_test, y_test)

0.6858399296394019


0.6988396624472574

In [34]:
gb_train = gb_gs.score(X_train_rf, y_train)
gb_test = gb_gs.score(X_test, y_test)

### Stacking Models

Lastly I will deploy a stacking model, another ensemble model technique. Stacking is a technique wherein predictions are made by multiple models then those predictions are used as features for another model. Here I will use random forest, logistic regression, and ada boost classifier to make my predictions which will then be fit into a logistic regression model.

In [26]:
level1_estimators = [
    ('random_forest', RandomForestClassifier()),
    ('lr', LogisticRegression()),
    ('boost', AdaBoostClassifier())
]

stacked_model = StackingClassifier(estimators=level1_estimators,
                                 final_estimator = LogisticRegression())

In [27]:
# Fit
stacked_model.fit(X_train_rf, y_train)

In [35]:
# Train score
stack_train= stacked_model.score(X_train_rf, y_train)

# Test score
stack_test = stacked_model.score(X_test, y_test)

### Model Evaluation

The main metric I will use to evaluate each model is the accuracy score which represents what percentage of the train and testing set did the model correctly classify. To take things further I will also review the specificity and sensitivity scores. Hypothetically I may care more about correctly identifying my base voter so depending on whether my base in Democrat or Republican I can use sensitivity or specificity to evaluate each model.

In [52]:
train_scores = [log_train, rf_train, ada_train, gb_train, stack_train]
eval_df = pd.DataFrame(train_scores, columns = ['train_scores'])

eval_df['test_scores'] = [log_test, rf_test, ada_test, gb_test, stack_test]
eval_df.index = ['logreg', 'rf', 'ada', 'gb', 'stack']
eval_df['diff'] = eval_df['train_scores'] - eval_df['test_scores']
eval_df['baseline'] = .55

eval_df

Unnamed: 0,train_scores,test_scores,diff,baseline
logreg,0.919789,0.716245,0.203544,0.55
rf,0.998065,0.747363,0.250702,0.55
ada,0.777485,0.685127,0.092358,0.55
gb,0.875462,0.69884,0.176622,0.55
stack,0.992612,0.733122,0.25949,0.55


In [73]:
#False Positive and Negative Scores
log_tn, log_fp, log_fn, log_tp = confusion_matrix(y_test, log_preds).ravel()
rf_tn, rf_fp, rf_fn, rf_tp = confusion_matrix(y_test, gs1.predict(X_test)).ravel()
ada_tn, ada_fp, ada_fn, ada_tp = confusion_matrix(y_test, gs2.predict(X_test)).ravel()
gb_tn, gb_fp, gb_fn, gb_tp = confusion_matrix(y_test, gb_gs.predict(X_test)).ravel()
st_tn, st_fp, st_fn, st_tp = confusion_matrix(y_test, stacked_model.predict(X_test)).ravel()

In [78]:
#Calculate Sensitivity Scores
log_sens = log_tp / (log_tp + log_fn)
rf_sens = rf_tp / (rf_tp + rf_fn)
ada_sens = ada_tp / (ada_tp + ada_fn)
gb_sens = gb_tp / (gb_tp + gb_fn)
st_sens = st_tp / (st_tp + st_fn)

In [76]:
#Calculate Specificity Scores
log_spec = log_tn / (log_tn + log_fp)
rf_spec = rf_tn / (rf_tn + rf_fp)
ada_spec = ada_tn / (ada_tn + ada_fp)
gb_spec = gb_tn / (gb_tn + gb_fp)
st_spec = st_tn / (st_tn + st_fp)

In [82]:
#Generate a Dataframe showing specificity and sensitivity scores
sens_scores = [log_sens, rf_sens, ada_sens, gb_sens, st_sens]
metrics_df = pd.DataFrame(sens_scores, columns = ['sensitivity'])

metrics_df['specificty'] = [log_spec, rf_spec, ada_spec, gb_spec, st_spec]
metrics_df.index = ['logreg', 'rf', 'ada', 'gb', 'stack']

metrics_df

Unnamed: 0,sensitivity,specificty
logreg,0.743295,0.683099
rf,0.792146,0.692488
ada,0.768199,0.583333
gb,0.774904,0.605634
stack,0.771073,0.68662


### Results & Conclusions

As you can see each model is suffering from overfitting to the training dataset as they are all showing higher accuracy scores in training as compared to test. Ultimately the Random Forest Classifier was the optimal model with the highest testing accuracy of approximately 75% which is 20% above the baseline accuracy from our null model. Given the significant overlap in the words being used by both democrat and republican subreddits, I am satisfied with an accuracy score of 75%. Additionally the Random Forest also had the highest sensitivity and specificity scores so regardless if I am more concerned with prediciting my base as a Democrat or Republican I should choose the Random Forest to minimize false positives and negatives.