# Modelling Notebook
## Notebook Goal
With all the data prepared and ready to go, it's time to start modelling. This means I'm going to have to vectorize my data in some way, grid search through some models as well as through their parameters, and establish some final predictive model.

This is what the workflow is going to look like: 

1) Separate space for each of the four training DataSets. 
2) Transform each dataset using either CountVectorizer or Tfidf. 
3) Try the following models on the data: Logistic Regression, Naive Bayes, and a Recurrent Neural Network.
4) Choose the best model from each data set and create a combined prediction from all of them.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import gensim
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB
import warnings
warnings.filterwarnings("ignore") # shhhhhhhh

In [8]:
# Read in all the necessary data first the training data
Emotion_short = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Emotion_Analyzer.csv')
Emotion_long = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Other_Cleaned_Emotion_Analyzer.csv')
Pos_neg = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Pos_Neg_Sentences.csv')
Word_Classifier = pd.read_csv('../data/Training_Data/1_Uncleaned_Training_Data/Andbrain_DataSet.csv')
# Now the testing data
Tester= pd.read_csv('../data/Testing_Data/4_Cleaned_Testing_Data/Final_Testing_Data.csv')

In [60]:
# Grid Search using Logistic Regression
def log_cvec(pipe_param):
    pipe = Pipeline([('cvec', CountVectorizer()),
                     ('lr', LogisticRegression())])

    gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
    gs.fit(X_train, y_train)
    train_predictions = gs.predict(X_train)
    test_predictions = gs.predict(X_test)
    print(f' The best score was: {gs.best_score_}')
    print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
    print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
    print(f'The best parameters were: {gs.best_params_}')
    return(pd.DataFrame(confusion_matrix(y_test, test_predictions), 
                                     index=['Actual Irrational', 'Actual Rational'], 
                                     columns=['Predicted Irrational', 'Predicted Rational']))

# Grid search using Naive Bayes
def nae_vec(pipe_param):
    pipe = Pipeline([('cvec', CountVectorizer()),
                     ('nb', MultinomialNB())])

    gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
    gs.fit(X_train, y_train)
    train_predictions = gs.predict(X_train)
    test_predictions = gs.predict(X_test)
    print(f'The best score for the grid search was: {gs.best_score_}')
    print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
    print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
    print(f'The best parameters were: {gs.best_params_}')
    return(pd.DataFrame(confusion_matrix(y_test, test_predictions), 
                                         index=['Actual Irrational', 'Actual Rational'], 
                                         columns=['Predicted Irrational', 'Predicted Rational']))

In [11]:
Tester.head()

Unnamed: 0.1,Unnamed: 0,Text,Irrational
0,0,oh of course,0
1,1,lately i ve been having these attack that are ...,0
2,2,well it becomes a total preoccupation i can t ...,1
3,3,patrick that s my husband he wa late he lost h...,1
4,4,well somehow i finally got myself together and...,1


In [12]:
Tester.drop('Unnamed: 0', axis=1, inplace=True)

In [56]:
Tester.Irrational.value_counts()

1    259
0    223
Name: Irrational, dtype: int64

Our data set is pretty balanced which is important when we attempt to predict. Also any model would have to beat a base line 259/482 or 54%

In [13]:
X_test = Tester['Text']
y_test = Tester[['Irrational']]

## Short Emotion Data
I called it this because in the original emotion listing there was only 6 emotions compared to the other emotion classifier with 12 emotions.

In [10]:
Emotion_short.head()

Unnamed: 0,Sentences,Negativity
0,i just feel really helpless and heavy hearted,1
1,ive enjoyed being able to slouch about relax a...,1
2,i gave up my internship with the dmrg and am f...,1
3,i dont know i feel so lost,1
4,i am a kindergarten teacher and i am thoroughl...,1


In [9]:
Emotion_short.drop('Unnamed: 0', axis=1, inplace=True)

In [34]:
Emotion_short.shape

(10000, 2)

## Logistic Regression

In [52]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [0],
                'cvec__max_df': [.999],
                'cvec__ngram_range': [(1, 3),(2, 2),(1,15)],
                'lr__C': [1],
                'lr__penalty': ['l1', 'l2']}

In [61]:
log_cvec(pipe_param)

 The best score was: 0.9497
The accuracy score for your training data was: 0.9993
The accuracy score for your testing data was: 0.495850622406639
The best parameters were: {'cvec__max_df': 0.999, 'cvec__min_df': 0, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'}


Unnamed: 0,Predicted Irrational,Predicted Rational
Actual Irrational,30,193
Actual Rational,50,209


## Naive Bayes 

In [54]:
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [0],
                'cvec__max_df': [.999],
                'cvec__ngram_range': [(1, 3),(2, 3),(1,5)]}

In [62]:
nae_vec(pipe_param)

The best score for the grid search was: 0.9307
The accuracy score for your training data was: 0.9983
The accuracy score for your testing data was: 0.529045643153527
The best parameters were: {'cvec__max_df': 0.999, 'cvec__min_df': 0, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'}


Unnamed: 0,Predicted Irrational,Predicted Rational
Actual Irrational,61,162
Actual Rational,65,194
