# Modelling Notebook
## Notebook Goal
With all the data prepared and ready to go, it's time to start modelling. This means I'm going to have to vectorize my data in some way, grid search through some models as well as through their parameters, and establish some final predictive model. The modelling parameters are going to be heavily focused on preventing overfitness so that it can generalize well. That will be the biggest detractor here. My model might be overly specific and really good at predicting documents that are very similar to the one it was trained on. But not so great at generalizing it to data it hasn't seen before. Another way to say it would be that an overly fit model loses some of its ability to capture negativity because it's giving more weight to certain non-negative words than it should.

## WorkFlow 

**1)** Create a Baseline model to compare our models to. <br>
**2)** Separate space for each of the four training DataSets. <br> 
**3)** Transform each dataset using either CountVectorizer or Tfidf. <br>
**4)** Try the following models on the data: Logistic Regression, Naive Bayes, and a Recurrent Neural Network. <br>
**5)** Choose the best model from each data set and create a combined prediction from all of them. <br>

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers import Dense, Dropout, GRU
from keras.optimizers import Adam
from keras.preprocessing.sequence import TimeseriesGenerator
import warnings
warnings.filterwarnings("ignore") # shhhhhhhh

I'm going to define some modelling functions that will help make the process of grid searching smoother.

In [None]:
# from Useful_Functions import *

In [6]:
# Read in all the necessary data first the training data
Emotion_short = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Emotion_Analyzer.csv')
Emotion_long = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Other_Cleaned_Emotion_Analyzer.csv')
Pos_neg = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Pos_Neg_Sentences.csv')
Word_Classifier = pd.read_csv('../data/Training_Data/1_Uncleaned_Training_Data/Andbrain_DataSet.csv')
# Now the testing data
Tester= pd.read_csv('../data/Testing_Data/4_Cleaned_Testing_Data/Final_Testing_Data.csv')

What I'll do next might seem a little confusing but the purpose of it is to get my data to fit into the parameters of some of the later models. I'm going to change the value `0` into `-1` for the target column in each of the data sets. 

In [7]:
Emotion_short['Negativity'] = Emotion_short['Negativity'].apply(zero_to_neg_one)
Emotion_long['Negativity'] = Emotion_long['Negativity'].apply(zero_to_neg_one)
Pos_neg['Negativity'] = Pos_neg['Negativity'].apply(zero_to_neg_one)

NameError: name 'zero_to_neg_one' is not defined

In [None]:
# # Grid Search using Logistic Regression
# def log_cvec(pipe_param, X_train, y_train, X_test, y_test):
#     pipe = Pipeline([('cvec', CountVectorizer()),
#                      ('lr', LogisticRegression())])

#     gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
#     gs.fit(X_train, y_train)
#     train_predictions = gs.predict(X_train)
#     test_predictions = gs.predict(X_test)
#     print(f'The best score was: {gs.best_score_}')
#     print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
#     print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
#     print(f'The best parameters were: {gs.best_params_}')
#     return(pd.DataFrame(confusion_matrix(y_test, test_predictions), 
#                                          index=['Actual Rational', 'Actual Irrational'], 
#                                          columns=['Predicted Rational', 'Predicted Irrational']))

# # Grid search using Naive Bayes
# def nae_vec(pipe_param):
#     pipe = Pipeline([('cvec', CountVectorizer()),
#                      ('nb', MultinomialNB())])

#     gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
#     gs.fit(X_train, y_train)
#     train_predictions = gs.predict(X_train)
#     test_predictions = gs.predict(X_test)
#     print(f'The best score for the grid search was: {gs.best_score_}')
#     print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
#     print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
#     print(f'The best parameters were: {gs.best_params_}')
#     return(pd.DataFrame(confusion_matrix(y_test, test_predictions), 
#                                          index=['Actual Rational', 'Actual Irrational'], 
#                                          columns=['Predicted Rational', 'Predicted Irrational']))

# def rnn_cvec()

In [8]:
Tester.head()

Unnamed: 0.1,Unnamed: 0,Text,Irrational
0,0,oh of course,0
1,1,lately i ve been having these attack that are ...,0
2,2,well it becomes a total preoccupation i can t ...,1
3,3,patrick that s my husband he wa late he lost h...,1
4,4,well somehow i finally got myself together and...,1


In [9]:
Tester.drop('Unnamed: 0', axis=1, inplace=True)

In [10]:
Tester.Irrational.value_counts()

1    259
0    223
Name: Irrational, dtype: int64

Our data set is pretty balanced which is important when we attempt to predict. Also any model would have to beat a base line 259/482 or 53.7%

In [11]:
X_test = Tester['Text']
y_test = Tester['Irrational']

## Short Emotion Data
I called it "short" because in the original emotion listing there was only 6 emotions compared to the other emotion classifier with 12 emotions. Let's do some quick summary stats to get a feel for what will happen in modelling stage.

In [12]:
Emotion_short.head()

Unnamed: 0.1,Unnamed: 0,Sentences,Negativity
0,0,i just feel really helpless and heavy hearted,1
1,1,ive enjoyed being able to slouch about relax a...,1
2,2,i gave up my internship with the dmrg and am f...,1
3,3,i dont know i feel so lost,1
4,4,i am a kindergarten teacher and i am thoroughl...,1


In [13]:
Emotion_short.drop('Unnamed: 0', axis=1, inplace=True)

In [14]:
Emotion_short.shape

(10000, 2)

In [15]:
Emotion_short.Negativity.value_counts()

1    5451
0    4549
Name: Negativity, dtype: int64

Okay, there's a decent amount of data here and a good amount of examples from both classes. So let's move on to modelling.

## Count Vectorized Short Emotion Data

In [17]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [0, .001],
                'cvec__max_df': [.999],
                'cvec__ngram_range': [(1, 3),(1, 6)],
                'lr__C': [.01],
                'lr__penalty': ['l2']}
pipe = Pipeline([('cvec', CountVectorizer()),
                 ('lr', LogisticRegression())])

gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
gs.fit(X_train, y_train)
train_predictions = gs.predict(X_train)
test_predictions = gs.predict(X_test)
print (train_predictions.shape)
print(test_predictions.shape)
print (y_train.shape)
print(y_test.shape)
print(f'The best score was: {gs.best_score_}')
print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
print(f'The best parameters were: {gs.best_params_}')

(10000,)
(482,)
(10000,)
(482,)
The best score was: 0.7715
The accuracy score for your training data was: 0.8933
The accuracy score for your testing data was: 0.5394190871369294
The best parameters were: {'cvec__max_df': 0.999, 'cvec__min_df': 0, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'lr__C': 0.01, 'lr__penalty': 'l2'}


In [21]:
pd.DataFrame(confusion_matrix(pd.Series(y_test), pd.Series(test_predictions)),
                                     index=['Actual Rational', 'Actual Irrational'], 
                                     columns=['Predicted Rational', 'Predicted Irrational'])

Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,22,201
Actual Irrational,21,238


In [None]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [0, .001],
                'cvec__max_df': [.999],
                'cvec__ngram_range': [(1, 3),(1, 6)],
                'lr__C': [.01],
                'lr__penalty': ['l2']}
pipe = Pipeline([('cvec', CountVectorizer()),
                 ('lr', LogisticRegression())])

gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
gs.fit(X_train, y_train)
train_predictions = gs.predict(X_train)
test_predictions = gs.predict(X_test)
print (train_predictions.shape)
print(test_predictions.shape)
print (y_train.shape)
print(y_test.shape)
print(f'The best score was: {gs.best_score_}')
print(f'The accuracy score for your training data was: {accuracy_score(train_predictions, y_train)}')
print(f'The accuracy score for your testing data was: {accuracy_score(test_predictions, y_test)}')
print(f'The best parameters were: {gs.best_params_}')

### Logistic Regression

In [None]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [0, .001],
                'cvec__max_df': [.999],
                'cvec__ngram_range': [(1, 3),(1, 6)],
                'lr__C': [.01],
                'lr__penalty': ['l2']}

In [None]:
log_cvec(pipe_param, X_train, y_train, X_test, y_test)

### Naive Bayes 

In [None]:
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [0],
                'cvec__max_df': [.98],
                'cvec__ngram_range': [(1,15),(1,10)],
              'nb__alpha': [2]}

In [None]:
nae_vec(pipe_param, X_train, y_train, X_test, y_test )

### Neural Network

## TfidfVectorizer Short Emotion Data

### Logistic Regression

### Naive Bayes 

### Neural Network

## Count Vectorized Long Emotion Data

### Logistic Regression

### Naive Bayes 

### Neural Network

## TfidfVectorizer Long Emotion Data

### Logistic Regression

### Naive Bayes 

### Neural Network

## Count Vectorized Positive and Negative Sentences

### Logistic Regression

### Naive Bayes 

### Neural Network

## TfidfVectorizer Positive and Negative Sentences

### Logistic Regression

### Naive Bayes 

### Neural Network