# Modelling Notebook
## Notebook Goal
With all the data prepared and ready to go, it's time to start modelling. This means I'm going to have to vectorize my data in some way, grid search through some models as well as through their parameters, and establish some final predictive model. The modelling parameters are going to be heavily focused on preventing overfitness so that it can generalize well. That will be the biggest detractor here. My model might be overly specific and really good at predicting documents that are very similar to the one it was trained on but since in the testing data we're dealing with irrationality we're going to have a lot of False positives.

## WorkFlow 

**1)** Create a Baseline model to compare our models to. <br>
**2)** Separate space for each of the four training DataSets. <br> 
**3)** Transform each dataset using either CountVectorizer or Tfidf. <br>
**4)** Try the following models on the data: Logistic Regression, Naive Bayes, Random Forest, Extra Trees and AdaBooster. <br>
**5)** Choose the best model from each data set and create a combined prediction from all of them. <br>

## Table of Contents
1) **Establish Baseline Model**
- [Baseline Model](#Establish-Baseline-Model)

2) **Short Emotion Data**
- [Count Vectorized](#Count-Vectorized-Short-Emotion-Data)
- [Tfidf Vectorized](#TfidfVectorizer-Short-Emotion-Data)
    
3) **Long Emotion Data**
- [Count Vectorized](#Count-Vectorized-Long-Emotion-Data)
- [Tfidf Vectorized](#TfidfVectorizer-Long-Emotion-Data)
    
4) **Positive and Negative Sentences**
- [Count Vectorized](#Count-Vectorized-Positive-and-Negative-Sentences)
- [Tfidf Vectorized](#TfidfVectorizer-Positive-and-Negative-Sentences)
    
5) **Word Predictor**
- [Model in of itself](#Word-Emotion-Data-Set)

6) **Create Final Model**
- [Final Model](#Final-Model)
- [Pickle](#Saving-our-models)

I've defined a lot of modelling functions in a separate .py file for cleanliness purposes, so we are going to import them here. That file also contains all the other needed imports so that will be included.

In [251]:
from Useful_Functions import *

In [197]:
# Read in all the necessary data first the training data
Emotion_short = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Emotion_Analyzer.csv', index_col=0)
Emotion_long = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Other_Cleaned_Emotion_Analyzer.csv', index_col=0)
Pos_neg = pd.read_csv('../data/Training_Data/2_Cleaned_Training_Data/Cleaned_Pos_Neg_Sentences.csv', index_col=0)
Word_Classifier = pd.read_csv('../data/Training_Data/1_Uncleaned_Training_Data/Andbrain_DataSet.csv', index_col=0)
# Now the testing data
Tester= pd.read_csv('../data/Testing_Data/4_Cleaned_Testing_Data/Final_Testing_Data.csv', index_col=0)

What I'll do next might seem a little confusing but the purpose of it is to get my data to fit into the parameters of some of the later models. I'm going to change the value `0` into `-1` for the target column in each of the data sets. 

In [198]:
Emotion_short['Negativity'] = Emotion_short['Negativity'].apply(zero_to_neg_one)
Emotion_long['Negativity'] = Emotion_long['Negativity'].apply(zero_to_neg_one)
Pos_neg['Negativity'] = Pos_neg['Negativity'].apply(zero_to_neg_one)
Tester['Irrational'] = Tester['Irrational'].apply(zero_to_neg_one)

In [199]:
Emotion_long.dropna(inplace=True)

In [200]:
Pos_neg.dropna(inplace=True)

## Establish Baseline Model 

In [201]:
Tester.Irrational.value_counts()

 1    259
-1    223
Name: Irrational, dtype: int64

#### Our baseline model equals 259/482 or 53.7%. <br>

In [202]:
X_test = Tester['Text']
y_test = Tester['Irrational']

## Short Emotion Data
I called it "short" because in the original emotion listing there was only 6 emotions compared to the other emotion classifier with 12 emotions. Let's do some quick summary stats to get a feel for what will happen in modelling stage.

In [203]:
Emotion_short.head()

Unnamed: 0,Sentences,Negativity
0,i just feel really helpless and heavy hearted,1
1,ive enjoyed being able to slouch about relax a...,1
2,i gave up my internship with the dmrg and am f...,1
3,i dont know i feel so lost,1
4,i am a kindergarten teacher and i am thoroughl...,1


In [9]:
Emotion_short.shape

(10000, 2)

In [10]:
Emotion_short.Negativity.value_counts()

 1    5451
-1    4549
Name: Negativity, dtype: int64

Okay, there's a decent amount of data here and a good amount of examples from both classes. So let's move on to modelling.
The modelling process is largely an iterative experiment so it might be worthwhile to skip to the results. I repeated the labeling to help anyone who was looking through it keep track of where they are at.

## Count Vectorized Short Emotion Data

### Logistic Regression Short Emotion

In [80]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
                'lr__C': [.005],
                'lr__penalty': ['l2']}

In [81]:
log_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.663
The accuracy score for your training data was: 0.7789
The accuracy score for your testing data was: 0.533195020746888
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'lr__C': 0.005, 'lr__penalty': 'l2'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,14,209
Actual Irrational,16,243


### Naive Bayes Short Emotion

In [82]:
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.90],
                'cvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}

In [83]:
nae_vec(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.9262
The accuracy score for your training data was: 0.9976
The accuracy score for your testing data was: 0.5311203319502075
The best parameters were: {'cvec__max_df': 0.9, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'nb__alpha': 2}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,54,169
Actual Irrational,57,202


### Random Forest Short Emotion

In [84]:
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [50],
                'rf__max_depth': [12, 13],}

In [85]:
rand_for_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5748
The accuracy score for your training data was: 0.5834
The accuracy score for your testing data was: 0.5373443983402489
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'rf__max_depth': 13, 'rf__n_estimators': 50}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,0,223
Actual Irrational,0,259


### Extra Trees Short Emotion

In [86]:
pipe_param =  { 'cvec__min_df': [1],
                'cvec__max_df': [.90],
                'cvec__ngram_range': [(1,3)],
                'et__n_estimators': [50],
                'et__max_depth': [12]}

In [87]:
extra_tree_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5621
The accuracy score for your training data was: 0.5635
The accuracy score for your testing data was: 0.5394190871369294
The best parameters were: {'cvec__max_df': 0.9, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'et__max_depth': 12, 'et__n_estimators': 50}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,1,222
Actual Irrational,0,259


### AdaBoost Classifier Short Emotion

In [88]:
AdaBoostClassifier()

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)

In [89]:
pipe_param = {  'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1,3)],
                'ada__n_estimators': [100],
             'ada__learning_rate': [.5]}

In [90]:
adaboost_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.8287
The accuracy score for your training data was: 0.8463
The accuracy score for your testing data was: 0.5373443983402489
The best parameters were: {'ada__learning_rate': 0.5, 'ada__n_estimators': 100, 'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,13,210
Actual Irrational,13,246


## TfidfVectorizer Short Emotion Data

### Logistic Regression Short Emotion

In [91]:
pipe_param =  { 'tvec__min_df': [0],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'lr__C': [.005],
                'lr__penalty': ['l2']}

In [92]:
log_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5451
The accuracy score for your training data was: 0.5451
The accuracy score for your testing data was: 0.5373443983402489
The best parameters were: {'lr__C': 0.005, 'lr__penalty': 'l2', 'tvec__max_df': 0.95, 'tvec__min_df': 0, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,0,223
Actual Irrational,0,259


### Naive Bayes Short Emotion

In [93]:
pipe_param =  {'tvec__stop_words':['english'],
                'tvec__min_df': [0],
                'tvec__max_df': [.80],
                'tvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}

In [94]:
nae_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.8809
The accuracy score for your training data was: 0.9897
The accuracy score for your testing data was: 0.5020746887966805
The best parameters were: {'nb__alpha': 2, 'tvec__max_df': 0.8, 'tvec__min_df': 0, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,18,205
Actual Irrational,35,224


### Random Forest Short Emotion

In [95]:
pipe_param =  { 'tvec__stop_words':['english'],
                'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [150, 200],
                'rf__max_depth': [12]}

In [96]:
rand_for_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5576
The accuracy score for your training data was: 0.5567
The accuracy score for your testing data was: 0.5352697095435685
The best parameters were: {'rf__max_depth': 12, 'rf__n_estimators': 150, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,0,223
Actual Irrational,1,258


### Extra Trees Short Emotion

In [97]:
pipe_param =  { 'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'et__n_estimators': [50],
                'et__max_depth': [9]}

In [98]:
extra_tree_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5477
The accuracy score for your training data was: 0.5459
The accuracy score for your testing data was: 0.5373443983402489
The best parameters were: {'et__max_depth': 9, 'et__n_estimators': 50, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,0,223
Actual Irrational,0,259


### AdaBoost Classifier Short Emotion

In [99]:
pipe_param = {  'tvec__min_df': [1],
                'tvec__max_df': [.999, .95],
                'tvec__ngram_range': [(3,4), (1,3)],
                'ada__n_estimators': [100]}

In [100]:
adaboost_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.8311
The accuracy score for your training data was: 0.8497
The accuracy score for your testing data was: 0.5311203319502075
The best parameters were: {'ada__n_estimators': 100, 'tvec__max_df': 0.999, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,13,210
Actual Irrational,16,243


## Count Vectorized Long Emotion Data

### Logistic Regression Long Emotion

In [157]:
X_train = Emotion_long['Sentences']
y_train = Emotion_long['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
                'lr__C': [.01],
                'lr__penalty': ['l2']}

In [158]:
log_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6861857252494244
The accuracy score for your training data was: 0.7246354566385265
The accuracy score for your testing data was: 0.5
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'lr__C': 0.01, 'lr__penalty': 'l2'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,184,39
Actual Irrational,202,57


### Naive Bayes Long Emotion

In [103]:
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.98],
                'cvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}

In [104]:
nae_vec(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.689357892044001
The accuracy score for your training data was: 0.9554361729342543
The accuracy score for your testing data was: 0.5954356846473029
The best parameters were: {'cvec__max_df': 0.98, 'cvec__min_df': 0, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'nb__alpha': 2}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,189,34
Actual Irrational,161,98


### Random Forest Long Emotion

In [159]:
RandomForestClassifier()
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [100],
                'rf__max_depth': [30],}

In [160]:
rand_for_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6153747761575851
The accuracy score for your training data was: 0.6155026861089793
The accuracy score for your testing data was: 0.46473029045643155
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'rf__max_depth': 30, 'rf__n_estimators': 100}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,223,0
Actual Irrational,258,1


### Extra Trees Long Emotion

In [107]:
pipe_param =  { 'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1,3)],
                'et__n_estimators': [100, 50],
                'et__max_depth': [9, 13]}

In [108]:
extra_tree_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6150166282936812
The accuracy score for your training data was: 0.6150166282936812
The accuracy score for your testing data was: 0.46265560165975106
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'et__max_depth': 9, 'et__n_estimators': 100}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,223,0
Actual Irrational,259,0


### AdaBoost Classifier Long Emotion

In [109]:
pipe_param = {  'cvec__min_df': [1],
                'cvec__max_df': [.999, .95],
                'cvec__ngram_range': [(3,4), (1,3)],
                'ada__n_estimators': [100]}

In [110]:
adaboost_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.7028140189306729
The accuracy score for your training data was: 0.7179073931951906
The accuracy score for your testing data was: 0.5311203319502075
The best parameters were: {'ada__n_estimators': 100, 'cvec__max_df': 0.999, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,177,46
Actual Irrational,180,79


## TfidfVectorizer Long Emotion Data

### Logistic Regression Long Emotion

In [161]:
pipe_param =  { 'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'lr__C': [.01],
                'lr__penalty': ['l2']}

In [162]:
log_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6150166282936812
The accuracy score for your training data was: 0.6150166282936812
The accuracy score for your testing data was: 0.46265560165975106
The best parameters were: {'lr__C': 0.01, 'lr__penalty': 'l2', 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,223,0
Actual Irrational,259,0


### Naive Bayes Long Emotion

In [113]:
pipe_param =  {'tvec__stop_words':['english'],
                'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}

In [114]:
nae_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.6504476848298798
The accuracy score for your training data was: 0.7581478639038117
The accuracy score for your testing data was: 0.5124481327800829
The best parameters were: {'nb__alpha': 2, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,217,6
Actual Irrational,229,30


### Random Forest Long Emotion

In [115]:
pipe_param =  { 'tvec__stop_words':['english'],
                'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [150, 200],
                'rf__max_depth': [12]}

In [116]:
rand_for_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6150166282936812
The accuracy score for your training data was: 0.6150166282936812
The accuracy score for your testing data was: 0.46265560165975106
The best parameters were: {'rf__max_depth': 12, 'rf__n_estimators': 150, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,223,0
Actual Irrational,259,0


### Extra Trees Long Emotion

In [163]:
pipe_param =  { 'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'et__n_estimators': [100],
                'et__max_depth': [9]}

In [164]:
extra_tree_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6150166282936812
The accuracy score for your training data was: 0.6150166282936812
The accuracy score for your testing data was: 0.46265560165975106
The best parameters were: {'et__max_depth': 9, 'et__n_estimators': 100, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,223,0
Actual Irrational,259,0


### AdaBoost Classifier Long Emotion

In [165]:
pipe_param = {  'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'ada__n_estimators': [100]}

In [166]:
adaboost_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6989511383985674
The accuracy score for your training data was: 0.7177539012535176
The accuracy score for your testing data was: 0.5311203319502075
The best parameters were: {'ada__n_estimators': 100, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,178,45
Actual Irrational,181,78


## Count Vectorized Positive and Negative Sentences

### Logistic Regression Positive and Negative Sentences

In [121]:
X_train = Pos_neg['Sentences']
y_train = Pos_neg['Negativity']
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.9],
                'cvec__ngram_range': [(1, 3)],
                'lr__C': [.005],
                'lr__penalty': ['l2']}

In [122]:
log_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6800967225435415
The accuracy score for your training data was: 0.7632585040682286
The accuracy score for your testing data was: 0.5062240663900415
The best parameters were: {'cvec__max_df': 0.9, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'lr__C': 0.005, 'lr__penalty': 'l2'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,79,144
Actual Irrational,94,165


### Naive Bayes Positive and Negative Sentences

In [123]:
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}

In [124]:
nae_vec(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.7232951017874064
The accuracy score for your training data was: 0.9783027807731268
The accuracy score for your testing data was: 0.5912863070539419
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'nb__alpha': 2}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,95,128
Actual Irrational,69,190


### Random Forest Positive and Negative Sentences

In [125]:
pipe_param =  { 'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [50],
                'rf__max_depth': [8, 12],}

In [126]:
rand_for_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6173577753814985
The accuracy score for your training data was: 0.6701303793745711
The accuracy score for your testing data was: 0.5124481327800829
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'rf__max_depth': 12, 'rf__n_estimators': 50}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,139,84
Actual Irrational,151,108


### Extra Trees Positive and Negative Sentences

In [127]:
pipe_param =  { 'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1,3)],
                'et__n_estimators': [50],
                'et__max_depth': [9]}

In [128]:
extra_tree_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5926543149364442
The accuracy score for your training data was: 0.6280756788550143
The accuracy score for your testing data was: 0.4896265560165975
The best parameters were: {'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'et__max_depth': 9, 'et__n_estimators': 50}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,185,38
Actual Irrational,208,51


### AdaBoost Classifier Positive and Negative Sentences

In [129]:
pipe_param = {  'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1,3)],
                'ada__n_estimators': [50]}

In [130]:
adaboost_cvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6029800999901971
The accuracy score for your training data was: 0.6147436525830801
The accuracy score for your testing data was: 0.5
The best parameters were: {'ada__n_estimators': 50, 'cvec__max_df': 0.95, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,174,49
Actual Irrational,192,67


## TfidfVectorizer Positive and Negative Sentences

### Logistic Regression Positive and Negative Sentences

In [131]:
pipe_param =  { 'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'lr__C': [.005],
                'lr__penalty': ['l1', 'l2']}

In [132]:
log_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6681371107407771
The accuracy score for your training data was: 0.724798222396497
The accuracy score for your testing data was: 0.5726141078838174
The best parameters were: {'lr__C': 0.005, 'lr__penalty': 'l2', 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,85,138
Actual Irrational,68,191


### Naive Bayes Positive and Negative Sentences

In [133]:
pipe_param =  {'tvec__stop_words':['english'],
                'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
              'nb__alpha': [2]}

In [134]:
nae_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score for the grid search was: 0.7248308989314773
The accuracy score for your training data was: 0.9742835669705584
The accuracy score for your testing data was: 0.5933609958506224
The best parameters were: {'nb__alpha': 2, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,96,127
Actual Irrational,69,190


### Random Forest Positive and Negative Sentences

In [135]:
pipe_param =  { 'tvec__stop_words':['english'],
                'tvec__min_df': [0],
                'tvec__max_df': [.999],
                'tvec__ngram_range': [(1, 3)],
                'rf__n_estimators': [150, 200],
                'rf__max_depth': [9, 10]}

In [136]:
rand_for_tfidf(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6578766787569846
The accuracy score for your training data was: 0.7307126752279188
The accuracy score for your testing data was: 0.5394190871369294
The best parameters were: {'rf__max_depth': 9, 'rf__n_estimators': 200, 'tvec__max_df': 0.999, 'tvec__min_df': 0, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,138,85
Actual Irrational,137,122


### Extra Trees Positive and Negative Sentences

In [137]:
pipe_param =  { 'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'et__n_estimators': [50],
                'et__max_depth': [9]}

In [138]:
extra_tree_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.5808254092736006
The accuracy score for your training data was: 0.6025226285004738
The accuracy score for your testing data was: 0.45850622406639
The best parameters were: {'et__max_depth': 9, 'et__n_estimators': 50, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,175,48
Actual Irrational,213,46


### AdaBoost Classifier Positive and Negative Sentences

In [139]:
pipe_param = {  'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
                'ada__n_estimators': [100]}

In [140]:
adaboost_tvec(pipe_param, X_train, X_test, y_train, y_test)

The best score was: 0.6296768290690455
The accuracy score for your training data was: 0.6530732281148907
The accuracy score for your testing data was: 0.508298755186722
The best parameters were: {'ada__n_estimators': 100, 'tvec__max_df': 0.95, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 3)}


Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,142,81
Actual Irrational,156,103


## Word Emotion Data Set

This Data set is based in giving words a certain weight based on how likely they would occur in a sentence that has certain  emotions. I'm going to search through each sentence in the testing data set for these words, add up their values, and determine whether the sentence has a negative emotion or not.

In [141]:
Word_Classifier.head()

Unnamed: 0_level_0,disgust,surprise,neutral,anger,sad,happy,fear
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ability,0.004464,0.047832,0.000638,0.023597,0.013393,0.015944,0.040179
able,1.7e-05,0.000182,0.000409,0.000176,0.000219,0.000244,0.000186
abuse,0.000532,0.000177,0.000177,0.137363,0.001241,0.001595,0.002659
academy,0.007143,0.021429,0.007143,0.007143,0.007143,0.092857,0.035714
accept,0.008271,0.006767,0.000752,0.048872,0.018797,0.024812,0.038346


This next line is a model in of itself. It can be used on any word based data to help determine whether it's overall positive or negative.

In [205]:
def word_class_model(df, df_sentences, word_df_name):
    pos_emo_list = [] # positive emotion list, I'm going to put the non-negative weights in here
    neg_emo_list = [] # negative emotion list, I'm going to put the negative weights in here
    temp_df = df.copy()
    # function to determine weight per sentence/row.
    def find_weights(sentence): 
        pos_emo = 0
        neg_emo = 0
        for word in sentence.split():
            if (word + " ") in word_df_name.index: # there was a trailing white space for the words in the index
                neg_emo += word_df_name.loc[(word + ' ')]['disgust'] # Grab the weight
                neg_emo += word_df_name.loc[(word + ' ')]['anger']
                neg_emo += word_df_name.loc[(word + ' ')]['sad']
                neg_emo += word_df_name.loc[(word + ' ')]['fear']
                pos_emo += word_df_name.loc[(word + ' ')]['happy']
                pos_emo += word_df_name.loc[(word + ' ')]['surprise']
                pos_emo += word_df_name.loc[(word + ' ')]['neutral']
        pos_emo_list.append(pos_emo)
        neg_emo_list.append(neg_emo)
    temp_df[df_sentences].apply(find_weights) # applying the function should create two lists the same length as our data
    # Incorporate lists into dataframe
    temp_df['Positive_Word_Weight'] = pos_emo_list
    temp_df['Negative_Word_Weight'] = neg_emo_list
    # Compare the columns return 1s for Negative, -1s for Positive
    Final_Predictions = np.where(temp_df['Positive_Word_Weight'] < temp_df['Negative_Word_Weight'], 1, -1) 
    # return final prediction        
    return Final_Predictions

In [206]:
final_column = word_class_model(Tester, 'Text', Word_Classifier)

In [207]:
Tester['Word Predictions'] = final_column

In [208]:
# confirm that it works
Tester.head()

Unnamed: 0,Text,Irrational,Word Predictions
0,oh of course,-1,-1
1,lately i ve been having these attack that are ...,-1,1
2,well it becomes a total preoccupation i can t ...,1,1
3,patrick that s my husband he wa late he lost h...,1,1
4,well somehow i finally got myself together and...,1,1


In [209]:
print(f'The accuracy score for your testing data was: {accuracy_score(final_column, y_test)}' )

The accuracy score for your testing data was: 0.43568464730290457


In [210]:
pd.DataFrame(confusion_matrix(y_test, final_column),
                                         index=['Actual Rational', 'Actual Irrational'],
                                         columns=['Predicted Rational', 'Predicted Irrational'])

Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,115,108
Actual Irrational,164,95


Wow pretty bad model. I'm going to need the misclassification rate for the final model later so i'll put that here.

In [230]:
word_miscl = 1 - (accuracy_score(final_column, y_test))
word_weights = (1/2)*np.log((1-word_miscl)/word_miscl)

## Final Model

For our final model we're going to combine the best model from each data set and based on how accurate they were, determine a final prediction. It's interesting to note that Naive Bayes was best model in general. The best Data Set was the long Emotion dataset with the positive and negative sentences coming in a close second. I'm going to create weights for each model's predictions and that will help reduce the value of the worse models' predictions.<br>
<br>
Final_Predictions = $\text{sign}(\sum(\hat{y_1}w_1 + \hat{y_2}w_2....))$ <br>
                    $w_t = \frac{1}{2}log(\frac{1 - \epsilon_t}{\epsilon_t})$ <br>
                    $\epsilon_t = $ Misclassification Rate
                    
<br>
First I'm going to put together the final models from each data set which I chose based on their ability to compliment each other.

### Emotion Short Final Model

In [231]:
X_train = Emotion_short['Sentences']
y_train = Emotion_short['Negativity']
pipe = Pipeline([('cvec', CountVectorizer(stop_words = 'english')),
                 ('ada', AdaBoostClassifier(random_state=42))])
pipe_param = {  'cvec__min_df': [1],
                'cvec__max_df': [.95],
                'cvec__ngram_range': [(1,3)],
                'ada__n_estimators': [100],
             'ada__learning_rate': [.5]}
emot_short_gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
emot_short_gs.fit(X_train, y_train)
# For our final predictions
emot_short_preds = emot_short_gs.predict(X_test)
# Get misclassification rate
emot_short_miscl = 1 - (accuracy_score(emot_short_preds, y_test))
# Get the appropriate weights
emot_short_weight = (1/2)*np.log((1-emot_short_miscl)/emot_short_miscl)

### Emotion Long Final Model

In [232]:
X_train = Emotion_long['Sentences']
y_train = Emotion_long['Negativity']
pipe = Pipeline([('cvec', CountVectorizer()),
                 ('nb', MultinomialNB())])
pipe_param =  {'cvec__stop_words':['english'],
                'cvec__min_df': [1],
                'cvec__max_df': [.98],
                'cvec__ngram_range': [(1, 3)],
              'nb__alpha': [2]}
emot_long_gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
emot_long_gs.fit(X_train, y_train)
# For our final predictions
emot_long_preds = emot_long_gs.predict(X_test)
# Get misclassification rate
emot_long_miscl = 1 - (accuracy_score(emot_long_preds, y_test))
# Get the appropriate weights
emot_long_weight = (1/2)*np.log((1-emot_long_miscl)/emot_long_miscl)

### Positive and Negative Sentences Final Model

In [233]:
X_train = Pos_neg['Sentences']
y_train = Pos_neg['Negativity']
pipe = Pipeline([('tvec', TfidfVectorizer()),
                 ('nb', MultinomialNB())])
pipe_param =  {'tvec__stop_words':['english'],
                'tvec__min_df': [1],
                'tvec__max_df': [.95],
                'tvec__ngram_range': [(1,3)],
              'nb__alpha': [2]}
pos_neg_gs = GridSearchCV(pipe, param_grid=pipe_param, cv=5)
pos_neg_gs.fit(X_train, y_train)
# For our final predictions
pos_neg_preds = pos_neg_gs.predict(X_test)
# Get misclassification rate
pos_neg_miscl = 1 - (accuracy_score(pos_neg_preds, y_test))
# Get the appropriate weights
pos_neg_weight = (1/2)*np.log((1-pos_neg_miscl)/pos_neg_miscl)

In [234]:
Tester['Emotion_Long_Preds'] = emot_long_preds
Tester['Pos_Neg_Preds'] = pos_neg_preds
Tester['Emotion_Short_Preds'] = emot_short_preds

I've already put together the final model for the word classifier so we'll implement the equation here.

In [247]:
 Tester['Final Prediction'] = ((Tester['Emotion_Long_Preds']*emot_long_weight)+ 
                                (Tester['Pos_Neg_Preds']*pos_neg_weight) + 
                                (Tester['Emotion_Short_Preds']*emot_short_weight) +
                              (Tester['Word Predictions']*word_weights))

Create final predictions based off weights

In [248]:
def adjust(value):
    if value > 0:
        return 1
    else:
        return -1

Get final test score

In [249]:
Tester['Final Prediction'] = Tester['Final Prediction'].apply(adjust)
final_preds = list(Tester['Final Prediction'])
print(accuracy_score(final_preds, y_test))

0.6327800829875518


In [250]:
pd.DataFrame(confusion_matrix(y_test, final_preds),
                                         index=['Actual Rational', 'Actual Irrational'],
                                         columns=['Predicted Rational', 'Predicted Irrational'])

Unnamed: 0,Predicted Rational,Predicted Irrational
Actual Rational,146,77
Actual Irrational,100,159


I want to create a final function that allows easy use of the combined model.

In [272]:
# create a list of model variables.
final_model_list = [emot_long_gs, pos_neg_gs, emot_short_gs]
# list of weights for each model
final_weights_list =[emot_long_weight, pos_neg_weight,emot_short_weight, word_weights] 
# final model function
def combined_model_predictor(df_to_predict, feature, word_df_name, final_model_list, weight_list):
        # Inculcates word class function to predict
        def word_class_model(df, df_sentences, word_df_name):
            pos_emo_list = [] # positive emotion list, I'm going to put the non-negative weights in here
            neg_emo_list = [] # negative emotion list, I'm going to put the negative weights in here
            temp_df = df.copy()
            
            # function to determine weight per sentence/row.
            def find_weights(sentence): 
                pos_emo = 0
                neg_emo = 0
                for word in sentence.split():
                    if (word + " ") in word_df_name.index: # there was a trailing white space for the words in the index
                        neg_emo += word_df_name.loc[(word + ' ')]['disgust'] # Grab the weight
                        neg_emo += word_df_name.loc[(word + ' ')]['anger']
                        neg_emo += word_df_name.loc[(word + ' ')]['sad']
                        neg_emo += word_df_name.loc[(word + ' ')]['fear']
                        pos_emo += word_df_name.loc[(word + ' ')]['happy']
                        pos_emo += word_df_name.loc[(word + ' ')]['surprise']
                        pos_emo += word_df_name.loc[(word + ' ')]['neutral']
                pos_emo_list.append(pos_emo)
                neg_emo_list.append(neg_emo)
                
            temp_df[df_sentences].apply(find_weights) # applying the function should create two lists the same length as our data
            # Incorporate lists into dataframe
            temp_df['Positive_Word_Weight'] = pos_emo_list
            temp_df['Negative_Word_Weight'] = neg_emo_list
            # Compare the columns return 1s for Negative, -1s for Positive
            Final_Predictions = np.where(temp_df['Positive_Word_Weight'] < temp_df['Negative_Word_Weight'], 1, -1)    
            return Final_Predictions
        predictions = []
        for model in final_model_list:
            predictions.append(model.predict(df_to_predict[feature]))
        word_model = word_class_model(df_to_predict, feature, word_df_name) 
        predictions.append(word_model)
        predicted_weights = ((np.array(predictions[0])*weight_list[0])+
                                (np.array(predictions[1])*weight_list[1]) +
                                (np.array(predictions[2])*weight_list[2]) +
                              (np.array(predictions[3])*weight_list[3]))        
        final_predictions = []
        for pred in predicted_weights:
                if pred > 0:
                    final_predictions.append(1)
                else:
                    final_predictions.append(-1)
        return final_predictions

In [275]:
# Test to confirm that it worked
final_prediction_list = combined_model_predictor(Tester, 'Text', Word_Classifier, final_model_list)
print(len(final_prediction_list))

482


## Saving our models

1) Save the first three models using pickle. <br>
2) Save the word based model and the final model that combines everything together into their own .py file. For this I will simply copy and paste it.

In [277]:
import pickle

In [278]:
final_model_list = [emot_long_gs, pos_neg_gs, emot_short_gs]
final_weights_list =[emot_long_weight, pos_neg_weight,emot_short_weight, word_weights] 
final_model_dictionary ={'Weights': final_weights_list, 'Models': final_model_list}
outfile = open('../Final_Models/Three_Models','w+b')
pickle.dump(final_model_dictionary, outfile)
outfile.close()