In this notebook, I will start my modeling process with a simple logistic regression model. Logistic regression is a good place to start because it is one of the most widely used classification models and is easy to interpret.

In [151]:
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

### Loading in my train test splitted dataframes

In [152]:
reddit = pd.read_csv('../Data/reddit.csv', index_col=0)
X_train = pd.read_csv('../Data/X_train.csv', header = None, index_col=0)
X_test = pd.read_csv('../Data/X_test.csv', header = None, index_col=0)
y_train = pd.read_csv('../Data/y_train.csv', header = None, index_col=0)
y_test = pd.read_csv('../Data/y_test.csv', header = None, index_col=0)
X_train_sen = pd.read_csv('../Data/X_train_sen.csv', index_col=0)
X_test_sen = pd.read_csv('../Data/X_test_sen.csv', index_col=0)
y_train_sen = pd.read_csv('../Data/y_train_sen.csv', header = None, index_col=0)
y_test_sen = pd.read_csv('../Data/y_test_sen.csv', header = None, index_col=0)
reddit.fillna('', inplace=True)

Creating a a custom stopwords list that includes some numbers and terms found in urls

In [153]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
custom_stopwords.extend(['10', '12', '13', '14', '15', '18', '25','200', '000','https', 'com', 'youtube', 'www'])

### Logistic Regression on a TFIDF dataframe

I'm gonna start my modeling by tokenzing my data with a tfidf vectorizer, which gives a ratio for each word as opposed to a count, and then fit my logistic regression model on this tokenized data

In [154]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=custom_stopwords)),
    ('logreg', LogisticRegression())
])

After extensive grid searching, I found these parameters  to be the best.

tfidf_min_df: ignore terms that don't appear in at least a minimum of 1 documents

tfidf_max_df: ignore terms that appear in over 70% of the documents

tfidf_norm: regularizes with the l2 norm

logreg_c: inverse of a regularization strength of 2

logreg_penalty: regularizes with the l1 norm

In [167]:
params = {
    'tfidf__min_df': [1,3,5,7,10],
    'tfidf__max_df': [.7,.75,.8],
    'tfidf__norm': ['l1','l2'],
    'logreg__C': [1,2,4],
    'logreg__penalty': ['l1', 'l2']
}

In [168]:
gs = GridSearchCV(pipe, params)

In [169]:
gs.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': [1, 3, 5, 7, 10], 'tfidf__max_df': [0.7, 0.75, 0.8], 'tfidf__norm': ['l1', 'l2'], 'logreg__C': [1, 2, 4], 'logreg__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [170]:
gs.score(X_train[1],y_train[1])

0.7427230046948357

In [171]:
gs.score(X_test[1],y_test[1])

0.6309859154929578

My model did not perform too well. It appears to be somewhat overfit but not too badly

In [172]:
gs.best_params_

{'logreg__C': 2,
 'logreg__penalty': 'l1',
 'tfidf__max_df': 0.7,
 'tfidf__min_df': 1,
 'tfidf__norm': 'l2'}

### Logisitic Regression on a count vectorized data frame

count vectorized data is simply just a count of how many times a word appears in the document

In [173]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words=custom_stopwords)),
    ('logreg', LogisticRegression()),
])

After extensive grid searching, I found these parameters  to be the best.

cvec_min_df: ignore terms that don't appear in at least a minimum of 3 documents

cvec_max_df: ignore terms that appear in over 85% of the documents

cvec_ngram_range: the best n gram range was 1, so it only considers each word by itself as opposed to two words

logreg_c: inverse of a regularization strength of .1

logreg_penalty: regularizes with the l2 norm

In [175]:
params = {
    'cvec__min_df': [1,3,5,7,10],
    'cvec__max_df': [.85,.9],
    'cvec__ngram_range': [(1,1),(1,2)],
    'logreg__C': [1,2,4,8,(1/10),(1/5)],
    'logreg__penalty': ['l1', 'l2']
}

In [176]:
gs = GridSearchCV(pipe, params)

In [177]:
gs.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['again', '...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'cvec__min_df': [1, 3, 5, 7, 10], 'cvec__max_df': [0.85, 0.9, 0.95, 1.0], 'cvec__ngram_range': [(1, 1), (1, 2)], 'logreg__C': [1, 2, 4, 8, 0.1, 0.2], 'logreg__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [178]:
gs.score(X_train[1],y_train[1])

0.8065727699530516

In [179]:
gs.score(X_test[1],y_test[1])

0.6366197183098592

My model did not perform too well and seems more overfit than my initial model

In [180]:
gs.best_params_

{'cvec__max_df': 0.85,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1),
 'logreg__C': 0.1,
 'logreg__penalty': 'l2'}

### Logistic Regression on count vectorized data with sentiment analysis

Let's see if adding features for sentiment analysis will help my logistic regression model

In [198]:
params = {
    'C': [1,2,4,8,(1/10),(1/5)],
    'penalty': ['l1', 'l2']
}

In [199]:
gs = GridSearchCV(LogisticRegression(), params)

data was tokenized via count vectorizing in pre-processing notebook

In [200]:
gs.fit(X_train_sen,y_train_sen[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 2, 4, 8, 0.1, 0.2], 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [201]:
gs.score(X_train_sen,y_train_sen[1])

0.8093896713615023

In [202]:
gs.score(X_test_sen,y_test_sen[1])

0.6422535211267606

Unfortunately, adding sentiment analysis did not help my model much, but it is the best of the three. Let's examine the top coefficients

In [203]:
weights = pd.DataFrame(data = gs.best_estimator_.coef_, columns = X_train_sen.columns)

In [204]:
weights_abs = weights.T.abs()

In [206]:
weights_abs.sort_values(ascending=False, by = 0).head()

Unnamed: 0,0
clinton,0.822318
hillary,0.809109
walkaway,0.759293
kavanaugh,0.74451
president,0.66657


The most important features for differentiating between the subreddits is Clinton, democratic, and Kavanaugh. This makes sense because we found in our EDA that Clinton is a top word in the democratic subreddit but was not found often on the republican subreddit

### Conclusion

All of my models scored about a $63 \%$ accuracy. This is only $13 \%$ better than a coin flip. Since my model is having a hard time differentiating between the two subreddits, this supports the hypothesis that democrats and republicans discuss the same subjects. We will try another modeling technique known as K-nearest neighbors next, to see if a non-parametric model will have a higher accuracy score. 