# Project 3: Reddit Post Classification

<i>Pulling information and classifying posts via Pushshift's API</i>

**Author: Brendan McDonnell**

## Step 3: Modeling

Creating a model that predicts which reddit a post belongs in.

## Relative Links
- [Importing Libraries and Datasets Needed](#Importing-Libraries-and-Datasets-Needed)
- [Deciding Between TFIDF & CVEC](#Deciding-Between-TFIDF-&-CVEC)
- [Checking Sentiment Analysis Features](#Checking-Sentiment-Analysis-Features)
- [Modeling Predictions](#Modeling-Predictions)

## Importing Libraries and Datasets Needed

In [53]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [54]:
df = pd.read_csv('./data/final_cleaned.csv')

In [55]:
df.head()

Unnamed: 0,title,body,is_the_donald,vad_title_neg,vad_title_neu,vad_title_pos,vad_title_compound,vad_body_neg,vad_body_neu,vad_body_pos,vad_body_compound,polarity_tit,subjectivity_tit,polarity_bod,subjectivity_bod
0,Need help Costas family fun part is discrimina...,_,0,0.0,0.558,0.442,0.8271,0.0,0.0,0.0,0.0,0.3,0.2,0.0,0.0
1,So what will the voters say when Texas turns b...,_,0,0.0,0.848,0.152,0.3612,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
2,When liberals generalize our entire party sayi...,_,0,0.107,0.631,0.262,0.4939,0.0,0.0,0.0,0.0,0.0,0.6125,0.0,0.0
3,SmythTV! 7/3/19 #IndependenceDay #Happy4thOfJuly,_,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Conditions in Migrant Detention Centers Almost...,_,0,0.323,0.677,0.0,-0.6929,0.0,0.0,0.0,0.0,-0.7,0.666667,0.0,0.0


In [56]:
# baseline
df.is_the_donald.value_counts(normalize=True)

0    0.520104
1    0.479896
Name: is_the_donald, dtype: float64

## Deciding Between TFIDF & CVEC

Because we need to feed an X series feature, I will do run each vectorizer in a pipeline for both the 'title' and 'body' columns w/ a basic logistic regression. Will decided between the methods after seeing which works best with this dataset.

In [57]:
X_t = df['title'] # X title series
X_b = df['body'] # X body series
X_sent = df[['vad_title_neg',
             'vad_title_neu',
             'vad_title_pos', 
             'vad_title_compound', 
             'vad_body_neg',
             'vad_body_neu',
             'vad_body_pos',
             'vad_body_compound',
             'polarity_tit',
             'subjectivity_tit',
             'polarity_bod',
             'subjectivity_bod']]
y = df['is_the_donald']

# train test split the two X's y's for titles and bodies
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X_t, y, stratify=y, random_state=4)
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y, stratify=y, random_state=4)
X_train_sent, X_test_sent, y_train, y_test = train_test_split(X_sent, y, stratify=y, random_state=4)

# two pipelines
pipe_cvec = Pipeline([('cvec', CountVectorizer()),
             ('lr', LogisticRegression())])
pipe_tvec = Pipeline([('tvec', TfidfVectorizer()),
             ('lr', LogisticRegression())])

In [58]:
X_train_t.shape, X_test_t.shape

((21320,), (7107,))

In [59]:
X_train_b.shape, X_test_b.shape

((21320,), (7107,))

In [60]:
X_train_sent.shape, X_test_sent.shape

((21320, 12), (7107, 12))

In [61]:
# pipe params for CountVectorizer
pipe_cvec_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__stop_words': [None, 'english'],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# grid search over title training data for CVEC
gs_cvec_t = GridSearchCV(pipe_cvec, param_grid=pipe_cvec_params, cv=3, verbose=1)
gs_cvec_t.fit(X_train_t, y_train_t)
print(gs_cvec_t.best_score_)
gs_cvec_t.best_params_

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.










[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:  5.6min finished


0.7392120075046904


{'cvec__max_df': 0.9,
 'cvec__max_features': 3500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

In [62]:
# pipe params for CountVectorizer
pipe_cvec_params = {
    'cvec__max_features': [3500],
    'cvec__min_df': [2],
    'cvec__max_df': [.9],
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,2)]
}

# grid search over body training data for CVEC
gs_cvec_b = GridSearchCV(pipe_cvec, param_grid=pipe_cvec_params, cv=3, verbose=1)
gs_cvec_b.fit(X_train_b, y_train_b)
print(gs_cvec_b.best_score_)
gs_cvec_b.best_params_

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.4s finished


0.5447467166979362




{'cvec__max_df': 0.9,
 'cvec__max_features': 3500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': 'english'}

In [63]:
# pipe params for TFIDF Vectorizer
pipe_tvec_params = {
    'tvec__max_features': [10000, 9500],
    'tvec__min_df': [3],
    'tvec__max_df': [.9],
    'tvec__stop_words': [None],
    'tvec__ngram_range': [(1,1)]
}

# grid search over title training data for TFID
gs_tvec_t = GridSearchCV(pipe_tvec, param_grid=pipe_tvec_params, cv=3, verbose=1)
gs_tvec_t.fit(X_train_t, y_train_t)
print(gs_tvec_t.best_score_)
gs_tvec_t.best_params_

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    3.7s finished


0.7535178236397748




{'tvec__max_df': 0.9,
 'tvec__max_features': 10000,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': None}

In [64]:
# pipe params for TFIDF Vectorizer
pipe_tvec_params = {
    'tvec__max_features': [2500, 3000, 3500],
    'tvec__min_df': [2, 3],
    'tvec__max_df': [.9, .95],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# grid search over body training data for TFID
gs_tvec_b = GridSearchCV(pipe_tvec, param_grid=pipe_tvec_params, cv=3, verbose=1)
gs_tvec_b.fit(X_train_b, y_train_b)
print(gs_tvec_b.best_score_)
gs_tvec_b.best_params_

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.










[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:  1.8min finished


0.5434333958724202




{'tvec__max_df': 0.9,
 'tvec__max_features': 2500,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

In [65]:
gs_cvec_t.score(X_test_t, y_test_t), gs_cvec_b.score(X_test_b, y_test_b), gs_tvec_t.score(X_test_t, y_test_t), gs_tvec_b.score(X_test_b, y_test_b)

(0.7505276487969608,
 0.5455185028844801,
 0.7665681722245673,
 0.5442521457717743)

TFIDF Vectorizer seems to be the marginally better option for vectorizing the words in the title. Removing stop words and using CVEC is better for the body (which isn't surprising)... The predictions will never be that good for the bodies given that about 1/3 of the datum actually contain any body at all.

In [66]:
# after a bit of testing;
gs_tvec_t.best_params_, gs_cvec_b.best_params_

({'tvec__max_df': 0.9,
  'tvec__max_features': 10000,
  'tvec__min_df': 3,
  'tvec__ngram_range': (1, 1),
  'tvec__stop_words': None},
 {'cvec__max_df': 0.9,
  'cvec__max_features': 3500,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 2),
  'cvec__stop_words': 'english'})

## Checking Sentiment Analysis Features

**Are these features worth including in the model? Computational efficiency played a big part in me being allowed to include these.**

In [67]:
X_sent = df[['vad_title_neg', 'vad_title_neu',
       'vad_title_pos', 'vad_title_compound', 'vad_body_neg', 'vad_body_neu',
       'vad_body_pos', 'vad_body_compound', 'polarity_tit', 'subjectivity_tit',
       'polarity_bod', 'subjectivity_bod']]
y = df['is_the_donald']

X_train, X_test, y_train, y_test = train_test_split(X_sent, y, stratify=y, random_state=4)

In [68]:
logreg = LogisticRegression()

logreg.fit(X_train, y_train)
logreg.score(X_train, y_train)



0.5611163227016885

**Answer: No**

## Modeling Predictions

**Fitting a Gaussian Naive Bayes model results in an overfit model, performing .10 above baseline; in fact, just using TFIDF and logistic regression on the Title information results in a much better model, as shown above.**

### Logistic Regression

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

In [None]:
X = df['title'] + ' ' + df['body']
y = df['is_the_donald']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=4)

In [None]:
# baseline
1 - y.mean()

In [None]:
# logistic regression
params = [{'tvec__max_df': [0.9],
           'tvec__max_features': [10_000],
           'tvec__min_df': [3],
           'tvec__ngram_range': [(1, 1)],
           'tvec__stop_words': [None],
           'lr__penalty': ['l2'],
           'lr__C': [1.5]
          }]
pipe_lr = Pipeline([('tvec', TfidfVectorizer()),
                     ('lr', LogisticRegression())
                    ])
grid = GridSearchCV(param_grid=params, estimator=pipe_lr, cv=3, verbose=1, n_jobs=2)
grid.fit(X_train, y_train)
print(grid.best_score_)
grid.best_params_

In [46]:
tvec = TfidfVectorizer(max_df = 0.9, max_features = 10_000, min_df = 3, ngram_range = (1,1), stop_words=None)
X_train = pd.DataFrame(tvec.fit_transform(X_train).todense(),
             columns=tvec.get_feature_names())
X_test  = pd.DataFrame(tvec.transform(X_test).todense(),
             columns=tvec.get_feature_names())
lr = LogisticRegression(C=1.5, penalty='l2')
lr.fit(X_train, y_train)



LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [47]:
df_beta = pd.DataFrame(np.exp(lr.coef_), columns=X_train.columns)

In [48]:
df_beta.shape

(1, 8722)

In [49]:
# getting a visual of the beta values and which feature they are associated with.
df_beta.T.sort_values(by=0, ascending=False).head()

Unnamed: 0,0
antifa,523.263257
quarantine,458.470694
t_d,106.15753
debates,98.11496
geotus,88.773115


In [50]:
grid.score(X_test, y_test)

ValueError: Found input variables with inconsistent numbers of samples: [7107, 8722]