## Modeling Reddit Data
This notebook contains the machine learning models to see if we can the models to identify the subreddit of each title. We will be optimizing the following classifications models and measuring their optimal performance:
    - LogisticRegression
    - RandomForest
    - AdaBoostClassifier


In [30]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words, text
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, auc

### Data Preprocessing    
    We will import the data, preprocess the data via CountVectorizer, remove additional StopWords, and train/test split the data.

In [3]:
df = pd.read_csv('reddit_data_learned.csv')

df.head(3)

Unnamed: 0,title,is_data_beautiful
0,[Battle] DataViz Battle for the month of Decem...,1
1,[Topic][Open] Open Discussion Monday — Anybody...,1
2,Submitted my thesis today. Here's what the chu...,1


In [4]:
df.isnull().sum()

title                0
is_data_beautiful    0
dtype: int64

In [5]:
X = df['title']
y = df['is_data_beautiful']


X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    random_state=42,
                                                    stratify=y)

We will remove key words that identifier that is specific to each subreddit to add more challenge and to be able to get better data for EDA. This also ensures that we are buildiong out a process that can be scaled in other situations.

In [6]:
stop_words = text.ENGLISH_STOP_WORDS.union(['oc', 'til', 'OC', 'TIL'])

In [7]:
cv = CountVectorizer(stop_words=stop_words, ngram_range = (1,1))

In [8]:
cv.fit(X_train)
X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

The baseline accuracy for the model is 50.8%, anything above that will be an improvement.

In [9]:
y_test.value_counts(normalize=True)

1    0.508526
0    0.491474
Name: is_data_beautiful, dtype: float64

### Modeling
    We will be running a GridSearch for LogisticRegression, RandomForest, and AdaBoostClassifier.


#### Logistic Regression

    GridSearching the best paramenters for LogisticRegression - Ridge

In [10]:
lr = LogisticRegressionCV()
gr_params_lr = {
    'Cs': [1,5,8,10,12,14,15,16,17,20,25,30]
}
gs_lr = GridSearchCV(lr, param_grid=gr_params_lr)
gs_lr.fit(X_train_cv, y_train)
print(gs_lr.best_score_)
gs_lr.best_params_

0.9588491134158581


{'Cs': 15}

    GridSearching the best paramenters for LogisticRegression - Lasso

In [12]:
lr = LogisticRegressionCV()
gr_params_lr = {
    'penalty': ['l1'],
    'solver':['liblinear'],
    'Cs': [6,8,10,12,14,16,18,20]
}
gs_lr = GridSearchCV(lr, param_grid=gr_params_lr)
gs_lr.fit(X_train_cv, y_train)
print(gs_lr.best_score_)
gs_lr.best_params_

0.9625292740046838


{'Cs': 8, 'penalty': 'l1', 'solver': 'liblinear'}

    Since the Lasso GridSearch gave us the best results for LogisticRegression, we will be using those parameters on a basic LogisticRegression to be able to extract all the features and use them in the EDA later.

In [10]:
lr = LogisticRegressionCV(Cs=8, penalty='l1', solver= 'liblinear')
lr.fit(X_train_cv, y_train)
lr.score(X_train_cv, y_train)

1.0

In [11]:
lr.score(X_test_cv, y_test)

0.9849548645937813

In [21]:
predictions = lr.predict(X_test_cv)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, 
                     columns = ['predicted neg', 'predic pos'],
                    index = ['actual neg', 'actual pos'])
cm_df

Unnamed: 0,predicted neg,predic pos
actual neg,475,15
actual pos,0,507


#### Random Forest

    We will run a GridSearch on the RandomForest to see how well it will perform.

In [22]:
rf = RandomForestClassifier()
gr_params = {
    'n_estimators': [3,5,10,30,40,50,100,150],
    'criterion': ['gini', 'entropy'],
    'max_depth':[None, 3,5,7,10]
    
}
gs = GridSearchCV(rf, gr_params)
gs.fit(X_train_cv, y_train)
print(gs.best_score_)
gs.best_params_

0.9474740715958515


{'criterion': 'entropy', 'max_depth': None, 'n_estimators': 40}

In [23]:
gs.score(X_train_cv, y_train)

0.9996654399464704

In [24]:
gs.score(X_test_cv, y_test)

0.9889669007021064

In [25]:
predictions = gs.predict(X_test_cv)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, 
                     columns = ['predicted neg', 'predic pos'],
                    index = ['actual neg', 'actual pos'])
cm_df

Unnamed: 0,predicted neg,predic pos
actual neg,481,9
actual pos,2,505


#### AdaBoostClassifier

We are going to run grid search through AdaBoostClassifier to see if we can improve our performance and to see if LogisticRegression or DecisionTreeClassifier would make a better weak learner.

In [26]:
ada = AdaBoostClassifier()
gr_params = {
    
    'base_estimator': [DecisionTreeClassifier(max_depth=1), LogisticRegressionCV()],
    'n_estimators': [1,10,25,50,100],
    'learning_rate': [.1,.50, .75, 1.0, 1.25, 1.50]
}

gs_ada = GridSearchCV(ada, param_grid=gr_params)

gs_ada.fit(X_test_cv, y_test)
print(gs_ada.best_score_)
gs_ada.best_params_

0.8324974924774323


{'base_estimator': LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=100,
            multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
            refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0),
 'learning_rate': 0.1,
 'n_estimators': 1}

In [28]:
gs_ada.score(X_train_cv, y_train)

0.892271662763466

In [27]:
gs_ada.score(X_test_cv, y_test)

1.0

In [29]:
predictions = gs_ada.predict(X_test_cv)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, 
                     columns = ['predicted neg', 'predic pos'],
                    index = ['actual neg', 'actual pos'])
cm_df

Unnamed: 0,predicted neg,predic pos
actual neg,490,0
actual pos,0,507


### Model Interpretation

- LogisticRegression c_val score was 0.9632 and the test data was 0.9819
- RandomForest c_val score was 0.9491 and the test data set was 0.9880
- AdaBoostClassifier c_val score was 0.8324 and the test data was 0.8923

Both LR and RF were able to to identify the subreddit highly accurately and with ease. The LR having the highest c_val score would tell me that it is the best model, slightly beating the RF. AdaBoost model's performance was surprisingly the worst and given more time was provided I would take a deeper dive to see if we can improve its performance, but with high performance from the other models and limited time we will put the AdaBoost on the bench.  

### Exportation of data for further EDA

Exporting the coefficients of LogisticRegression for further exploration

In [21]:
coef_df = pd.DataFrame({
    'coef':cv.get_feature_names(),
    'val':lr.coef_[0]}).to_csv('Data/coef_exploration.csv', index = False)

Exporting the model input for further exploration

In [22]:
model_inputs = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())
model_inputs['is_data_beautiful'] = y_train
model_inputs.to_csv('Data/model_input.csv', index = False)