# Capstone Spring 2019: Is this AI?
by: bobby broz

__Abstract/Motivation:__ <br>

Professionals looking to improve their interpersonal communication skills or companies looking to
help improve their transaction success rates with their customers. This capstone seeks to create a
machine learning approach that's designed to identify and record any questions asked during
conversation. Using NLP to analyze all text queries and return them to the user (or digital
assistant) to aid them in preparing an appropriate response.



### NLP

Dataset: https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset <br>

This dataset is composed of the following: <br>
- README.md
- wav files sampled at 12,000 KHZ
- transcript.txt.
<br>

`transcript.txt` is in a tab-delimited format. <br>

The first column is the name of the `book` and audio file paths. <br> 
The second one is the `script`. <br>
The rightmost column is the `duration` of the audio file. <br>

This is a supervised learning classification problem. <br>

The goal here is to obtain the text and load it into a DataFrame and be able to parse and recognize what is a question based on a its denotation. <br>

eg. who? what? where? why? how? <br>

Going deeper, you could look at Does..? Will..? Am I..? Which is..? <br>

In [None]:
# Packages necessary for storing, manipulating and displaying text data statistics

from tqdm import tqdm # used for tracking execution time of code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline
plt.rc('figure', figsize=(16, 4))

In [None]:
# load and display tab delimited values, assign column names
transcript_txt_df = pd.read_csv('transcript.txt',sep='\t',header=None)
transcript_txt_df.columns=['book','script','duration']
transcript_txt_df.head()

In [None]:
# check for NaN values
transcript_txt_df.isna().sum()

In [None]:
# locate NaN values
display(transcript_txt_df.loc[transcript_txt_df['script'].isnull()])

__Note:__ Both the audio and script data for the verses above were removed. Upon further investigation, it was determined that these two verses are routinely removed in the modern day bible as they could be negatively misinterpretted. A decision was made to simply put in the word 'BRAINSTATION' as one unknown word from the 21st century will not affect the accuracy of the corpus.

In [None]:
transcript_txt_df['script'][25047] = 'BRAINSTATION'
transcript_txt_df['script'][25565] = 'BRAINSTATION'

In [None]:
# confirm NaN values were removed
transcript_txt_df.isna().sum()

In [None]:
# check to see if all 31,083 lines were correctly read in
print(len(transcript_txt_df), 'lines in DataFrame')

__Note:__ `transcript.txt` has 31083 lines. <br>

28825/31083 = 0.927355789338223 <br>
Approximately ~8% of data has not been read in correctly. <br>

__Data Cleaning required.__

In [None]:
# count number of aggregated and mis-read lines in DataFrame
misread_count = transcript_txt_df['script'].str.find("\n")
print(misread_count.gt(0).sum(),"rows incorrectly read in")

### Some more exploratory data analysis

A List of all the books, chapters and verses in the bible:
https://www.blueletterbible.org/study/misc/66books.cfm <br>

Ruth should have: <br> 
- 4 books <br> 
- 85 verses <br>

and the DataFrame matches this expectation <br>

In [None]:
display(transcript_txt_df.loc[transcript_txt_df['book'].str.contains('Ruth')])

Leviticus should have: <br> 
- 27 books <br> 
- 859 verses <br>

and the DataFrame __does not__ match this expection. <br>
The data was likely not read in correctly. Further EDA required. <br>

In [None]:
display(transcript_txt_df.loc[transcript_txt_df['book'].str.contains('Leviticus')])

__EDA__ <br>
From above we see that `Leviticus/Leviticus_2-1` jumps 3 books to `Leviticus/Leviticus_2-5` <br>
However their respectives rows increase by one `2495` -> `2496` <br>
Need to investigate further, display book title and script. <br>

In [None]:
# display known row that was mis-read incorrectly
display(transcript_txt_df['book'].iloc[2495])

In [None]:
transcript_txt_df['script'].iloc[2495]

__EDA__ <br>
We can see that this contains several `\t` and `\n` delimiters, separating the book titles, scripts and durations for the following books: <br>
- `Leviticus/Leviticus_2-2` <br>
- `Leviticus/Leviticus_2-3` <br>
- `Leviticus/Leviticus_2-4` <br>

__Pre-processing Pseudo__ <br>

1. Create `book[]`,`script[]`,`duration[]` master lists <br>
__Main Loop: Steps 2-7__ through `transcript_txt_df` DataFrame <br>
2. Read single row and store into `string[]`. Check `string[]` for `\n`. <br>
3. If True, append row values from `transcript_txt_df` from respective columns into `book[]`,`script[]`,`duration[]` master lists <br>
4. If False, split() `string[]` by the `\n` delimiter and store into `substring[]` <br>
__Sub Loop: Steps 5-7__ <br>
5. Sub-loop through `substring[]` list and split() each pass by the `\t` delimiter and store the 3 outputs into `book_temp`,`script_temp`,`duration_temp` <br>
6. Append `book_temp`,`script_temp`,`duration_temp` to `book[]`,`script[]`,`duration[]` master lists <br>
7. Once sub-loop scans entire length of `substring[]`, loop finishes and goes back to main loop <br>

This continues until entire `transcript_txt_df` DataFrame is scanned. The result is 3 separate master lists: 

- `book[]`
- `script[]`
- `duration[]`

These master lists have 31083 lines and will be combined into a `transcript_clean_df` DataFrame with each list as it's respectively named column.


In [None]:
book = []
script = []
duration = []

tab = "\t"
newline = "\n"

num_correct_lines = 0
num_fixed_lines = 0

# print(f'Transcript Lines read in: {len(transcript_txt_df)}')

for x in tqdm( range(0,len(transcript_txt_df)) ):

    string = transcript_txt_df['book'].iloc[x] +"\t"+ transcript_txt_df['script'].iloc[x]+"\t"+str(transcript_txt_df['duration'].iloc[x])
    # print(f"transcript_df['book'].iloc[{x}] + transcript_df['script'].iloc[{x}] + str(transcript_df['duration'].iloc[{x}] = ")
    # display(string)

    if (string.find("\n") < 0):

        book.append(transcript_txt_df['book'].iloc[x])
        script.append(transcript_txt_df['script'].iloc[x])
        duration.append(str(transcript_txt_df['duration'].iloc[x]))
        num_correct_lines += 1

    else:

        substring = string.split(newline)

        # print('substring = ')
        # display(substring)

        for y in range(0,len(substring)):
    #         print(y)
    #         display(substring[y].split(tab))
            book_temp,script_temp,duration_temp = substring[y].split(tab)
            book.append(book_temp)
            script.append(script_temp)
            duration.append(duration_temp) 
            num_fixed_lines += 1
            
#     if (x%5000==0): print('Line',x,'of',len(transcript_txt_df))

print(f'Scanned {x} lines!')
print(f'Number of initially scanned lines = {num_correct_lines}')
print(f'Number of fixed lines =              {num_fixed_lines}')

print(f"Length of book list     =           {len(book)}")
print(f"Length of script list   =           {len(script)}")
print(f"Length of duration list =           {len(duration)}")

print(f'Total lines =                       {num_correct_lines+num_fixed_lines}')


In [None]:
# create DataFrame, load each respective list into column and name it
transcript_clean_df = pd.DataFrame()
transcript_clean_df['book'] = book
transcript_clean_df['script'] = script
transcript_clean_df['duration'] = duration
transcript_clean_df.shape

Identify which lines in `transcript_clean_df` DataFrame have questions and store it in `target_df` DataFrame  <br>

In [None]:
# create target variable (question identifier)
question = []
for z in range(0,len(transcript_clean_df)):
    question.append(transcript_clean_df['script'][z].count('?'))
    
print(f'Scanned {z} lines!')

target_df = pd.DataFrame(question)
target_df.columns = ['question']
len(target_df)

In [None]:
# check if line 66 has two questions
display(target_df.head())
display(f"line 66 has {target_df['question'].iloc[66]} questions")

In [None]:
# count total number of questions
target_df.sum()

In [None]:
# show most number of questions in DataFrame for first 300 rows
target_df['question'][:300].nlargest(10)

In [None]:
# verify row 85 from above, result should show 2 questions in line
transcript_clean_df['script'].iloc[85]

In [None]:
transcript_clean_df['script'].iloc[21401]

Create `feature_df` by copying `transcript_clean_df`. This feature DataFrame will have all punctuation removed from the script and be brought down to lower case. 

In [None]:
feature_df = transcript_clean_df.copy()
feature_df.shape

In [None]:
# remove all punctuation and convert each string into lower case for all 31,083 verses

import string

print(string.punctuation)

def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

feature_df['script'] = feature_df['script'].apply(remove_punctuation)
display(feature_df['script'].head())

for m in tqdm( range(0,len(feature_df)) ):
    feature_df['script'][m] = str(feature_df['script'][m]).lower()

In [None]:
feature_df['script'].head(20)

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print(len(str(ENGLISH_STOP_WORDS)))

https://en.wikipedia.org/wiki/Tf-idf

In [None]:
# combine feature and target into one DataFrame
# create a train and test set for dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

web_df = feature_df.copy()
web_df['target'] = target_df.copy()
web_df['target'].values[web_df['target'] > 0] = 1

X = web_df['script']
y = web_df['target']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,stratify=y, random_state=1)

# vectorize all words using term frequency, inverse document frequency
# Create document-term matrix using TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf.fit(X_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)


__Note:__ Given that there are only 3241 questions out of 31083 verses, we need to balance the two classes (question/non-question) so that the algorithms are not trained on a imbalanced/biased dataset. This will be doing using Synthetic Minority Over-sampling TEchnique (SMOTE).

In [None]:
# using the SMOTE package, we balance the question/non-question classes using a 1:1 ratio
from imblearn.over_sampling import SMOTE

# instantiate
smote = SMOTE(ratio=1.0)
# fit
X_train_smote, y_train_smote = smote.fit_sample(X_train,y_train)

## Classifiers

- logistic regression
- naive bayes
- random forest
- scaled vector machine
- xgboost
- neural network

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# instantiate
logit_smote = LogisticRegression(solver='lbfgs')
logit = LogisticRegression(solver='lbfgs')
logit_weight = LogisticRegression(solver='lbfgs',class_weight='balanced')

# fit
logit_smote.fit(X_train_smote,y_train_smote)
logit.fit(X_train,y_train)
logit_weight.fit(X_train,y_train)

# predict
y_pred_smote = logit_smote.predict(X_test)
y_pred = logit.predict(X_test)
y_pred_weight = logit_weight.predict(X_test)


In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print('Logit Train Score:      ',logit.score(X_train,y_train))
print('Logit Test Score:       ',logit.score(X_test,y_test))
print('Logit Train Smote Score:',logit_smote.score(X_train_smote,y_train_smote))
print('Logit Test Smote Score: ',logit_smote.score(X_test,y_test))
print('Logit Balanced Train Score:      ',logit_weight.score(X_train,y_train))
print('Logit Balanced Test Score:       ',logit_weight.score(X_test,y_test))

print('\n')

print('y_pred_smote:\n', classification_report(y_test, y_pred_smote))
print('y_pred_smote confusion matrix:\n',confusion_matrix(y_test, y_pred_smote),'\n')
print('y_pred:\n', classification_report(y_test, y_pred))
print('y_pred confusion matrix:\n',confusion_matrix(y_test, y_pred),'\n')
print('y_pred_weight:\n', classification_report(y_test, y_pred_weight))
print('y_pred_weight confusion matrix:\n',confusion_matrix(y_test, y_pred_weight),'\n')

#### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

# convert from sparse matrix to array 

X_train_gnb = X_train.todense()
X_train_gnb_smote = X_train_smote.todense()

# instantiate
gnb = GaussianNB()
gnb_smote = GaussianNB()

# fit
gnb.fit(X_train_gnb,y_train)
gnb_smote.fit(X_train_gnb_smote,y_train_smote)

# predict
y_pred_gnb = gnb.predict(X_test.todense())
y_pred_gnb_smote = gnb_smote.predict(X_test.todense())

In [None]:
X_train_gnb.shape

In [None]:
print('Naive Bayes Train Score:      ',gnb.score(X_train_gnb,y_train))
print('Naive Bayes Test Score:       ',gnb.score(X_test.todense(),y_test))
print('Naive Bayes Smote Train Score:',gnb_smote.score(X_train_gnb_smote,y_train_smote))
print('Naive Bayes Smote Test Score: ',gnb_smote.score(X_test.todense(),y_test))
print('\n')
print('y_pred_gnb:\n', classification_report(y_test, y_pred_gnb))
print('y_pred_gnb confusion matrix:\n',confusion_matrix(y_test,y_pred_gnb),'\n')
print('y_pred_gnb_smote:\n', classification_report(y_test, y_pred_gnb_smote))
print('y_pred_gnb smote confusion matrix:\n',confusion_matrix(y_test,y_pred_gnb_smote),'\n')

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# instantiate
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_smote = RandomForestClassifier(n_estimators=100,n_jobs=-1)

# fit
rf.fit(X_train,y_train)
rf_smote.fit(X_train_smote,y_train_smote)

# predict
y_pred_rf = rf.predict(X_test)
y_pred_rf_smote = rf_smote.predict(X_test)

In [None]:
print('Random Forest Train Score:      ',rf.score(X_train,y_train))
print('Random Forest Test Score:       ',rf.score(X_test,y_test))
print('Random Forest Smote Train Score:',rf_smote.score(X_train,y_train))
print('Random Forest Smote Test Score: ',rf_smote.score(X_test,y_test))
print('\n')

print('y_pred_rf:\n', classification_report(y_test, y_pred_rf))
print('Random Forest confusion matrix:\n',confusion_matrix(y_test, y_pred_rf),'\n')
print('y_pred_rf_smote:\n', classification_report(y_test, y_pred_rf_smote))
print('Random Forest confusion matrix:\n',confusion_matrix(y_test, y_pred_rf_smote),'\n')

#### Scaled Vector Machine

In [None]:
from sklearn.svm import LinearSVC

# instantiate
svm = LinearSVC()
svm_smote = LinearSVC()

# fit
svm.fit(X_train,y_train)
svm_smote.fit(X_train_smote,y_train_smote)

# predict
y_pred_svm = svm.predict(X_test)
y_pred_svm_smote = svm_smote.predict(X_test)

In [None]:
print('SVM Train Score:      ',svm.score(X_train,y_train))
print('SVM Test Score:       ',svm.score(X_test,y_test))
print('SVM Smote Train Score:',svm_smote.score(X_train_smote,y_train_smote))
print('SVM Smote Test Score: ',svm_smote.score(X_test,y_test))
print('\n')
print('y_pred_svm:\n', classification_report(y_test, y_pred_svm))
print('SVM Smote confusion matrix:\n',confusion_matrix(y_test, y_pred_svm),'\n')
print('y_pred_svm_smote:\n', classification_report(y_test, y_pred_svm_smote))
print('SVM Smote confusion matrix:\n',confusion_matrix(y_test, y_pred_svm_smote),'\n')

#### XGBoost

In [None]:
from xgboost import XGBClassifier

# instantiate
xgb = XGBClassifier()
xgb_smote = XGBClassifier()

# fit
xgb.fit(X_train,y_train)
xgb_smote.fit(X_train_smote,y_train_smote)

# predict
y_pred_xgb = xgb.predict(X_test)
y_pred_xgb_smote = xgb_smote.predict(X_test)


In [None]:
print('XGBoost Train Score:         ',xgb.score(X_train,y_train))
print('XGBoost Test Score:          ',xgb.score(X_test,y_test))
print('XGBoost Smote Train Score:   ',xgb_smote.score(X_train_smote,y_train_smote))
print('XGBoost Smote Test Score:    ',xgb_smote.score(X_test,y_test))
print('\n')
print('y_pred_xgb:\n', classification_report(y_test, y_pred_xgb))
print('XGBoost confusion matrix:\n',confusion_matrix(y_test, y_pred_xgb),'\n')
print('y_pred_xgb_smote:\n', classification_report(y_test, y_pred_xgb_smote))
print('XGBoost Smote confusion matrix:\n',confusion_matrix(y_test, y_pred_xgb_smote),'\n')


In [None]:
# input_sentence = "Where did you get that from?"
input_sentence = "Are you going?"

print('Your input:   ',input_sentence)
input_sentence = remove_punctuation(input_sentence)
input_sentence = input_sentence.lower()
input_sentence = [input_sentence]
print('Pre-processed:',input_sentence[0])

# Create document-term matrix using TfidfVectorizer
input_sentence = tfidf.transform(input_sentence)

# predict
y_input_smote = logit_smote.predict(input_sentence)
y_input = logit.predict(input_sentence)
y_input_weight = logit_weight.predict(input_sentence)

In [None]:
print('Logit Smote: ',bool(y_input_smote))
print('Logit: ',bool(y_input))
print('Logit Weighted: ',bool(y_input_weight))

### NLTK

Using NLTK, remove all english stop words and stem and lemmetize all verses in dataset. Then use logistic regression to see if there is a variance compared to above.

In [None]:
import nltk
stemmer = nltk.stem.PorterStemmer()

def stop_punct_tokenizer(sentence):

    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    for word in listofwords:
        # Remove stopwords
        if word in ENGLISH_STOP_WORDS: continue
        # Stem words
        stemmed_word = stemmer.stem(word)
        for punctuation_mark in string.punctuation:
            # Remove punctuation and set to lower case
            stemmed_word = stemmed_word.replace(punctuation_mark, '').lower()
        listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

def punct_tokenizer(sentence):

    listofwords = sentence.split(' ')
    punct_words = []
    
    for word in listofwords:
        # Remove stopwords
        if word in ENGLISH_STOP_WORDS: continue

        for punctuation_mark in string.punctuation:
            # Remove punctuation and set to lower case
            punct_word = word.replace(punctuation_mark, '').lower()
        punct_words.append(punct_word)

    return punct_words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# import spacy

from nltk.corpus import stopwords

web_nltk_df = web_df.copy()

X_nltk = web_nltk_df['script']
y_nltk = web_nltk_df['target']

# create a new train test split
X_train_nltk, X_test_nltk, y_train_nltk, y_test_nltk = train_test_split(X_nltk,y_nltk,test_size=0.3,stratify=y,random_state=1)

# remove stop words and lemmatize all words in the corpus 
vect = CountVectorizer(min_df=5, tokenizer = stop_punct_tokenizer,ngram_range=(1,3)).fit(X_train_nltk)
# vect_punct = CountVectorizer(min_df=5, tokenizer = punct_tokenizer).fit(X_train_nltk)

X_train_nltk = vect.transform(X_train_nltk)
# X_train_nltk = vect_punct.transform(X_train_nltk)

X_test_nltk = vect.transform(X_test_nltk)
# X_test_nltk = vect_punct.transform(X_test_nltk)


In [None]:
# count features after stemming
print('# Features With Stemming:   ',np.shape(vect.get_feature_names()))
# print('# Features Without Stemming:',np.shape(vect_punct.get_feature_names()))

In [None]:
# similiarly to above, we balance the question/non-question classesin the NLTK train test sets using a 1:1 ratio
from imblearn.over_sampling import SMOTE

# instantiate
smote = SMOTE(ratio=1.0)
# fit
X_train_nltk_smote, y_train_nltk_smote = smote.fit_sample(X_train_nltk,y_train_nltk)

In [None]:
from sklearn.linear_model import LogisticRegression

# instantiate
logit_nltk_smote = LogisticRegression(solver='lbfgs')
logit_nltk = LogisticRegression(solver='lbfgs')
logit_nltk_weight = LogisticRegression(solver='lbfgs',class_weight='balanced')

# fit
logit_nltk_smote.fit(X_train_nltk_smote,y_train_nltk_smote)
logit_nltk.fit(X_train_nltk,y_train_nltk)
logit_nltk_weight.fit(X_train_nltk,y_train_nltk)

# predict
y_pred_nltk_smote = logit_nltk_smote.predict(X_test_nltk)
y_pred_nltk = logit_nltk.predict(X_test_nltk)
y_pred_nltk_weight = logit_nltk_weight.predict(X_test_nltk)


In [None]:
from sklearn.metrics import classification_report

print('NLTK Logit Train Score:         ',logit_nltk.score(X_train_nltk,y_train_nltk))
print('NLTK Logit Test Score:          ',logit_nltk.score(X_test_nltk,y_test_nltk))
print('NLTK Logit Train Smote Score:   ',logit_nltk_smote.score(X_train_nltk_smote,y_train_nltk_smote))
print('NLTK Logit Test Smote Score:    ',logit_nltk_smote.score(X_test_nltk,y_test_nltk))
print('NLTK Logit Balanced Train Score:',logit_nltk_weight.score(X_train_nltk,y_train_nltk))
print('NLTK Logit Balanced Test Score: ',logit_nltk_weight.score(X_test_nltk,y_test_nltk))

print('\n')

print('NLTK y_pred_smote:\n', classification_report(y_test, y_pred_smote))
print('NLTK y_pred_smote confusion matrix: \n',confusion_matrix(y_test, y_pred_smote),'\n')
print('NLTK y_pred:\n', classification_report(y_test, y_pred))
print('NLTK y_pred confusion matrix: \n',confusion_matrix(y_test, y_pred),'\n')
print('NLTK y_pred_weight:\n', classification_report(y_test, y_pred_weight))
print('NLTK y_pred_weight confusion matrix: \n',confusion_matrix(y_test, y_pred_weight),'\n')

__Conclusion:__ The scores are comparable to the logistic regression about without lemmetization and stemming. No need to pursue this method any further.

#### Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
import time

#Transform data
# scaler = StandardScaler()
# scaler.fit(X_train)
# X_train_nn = scaler.transform(X_train)
# X_test_nn = scaler.transform(X_test)

start = time.time()
NN_model = MLPClassifier(hidden_layer_sizes=(500,500),solver='lbfgs')
NN_model.fit(X_train,y_train)
y_pred_NN = NN_model.predict(X_test)
end = time.time()

print('Execution Time:',end-start)

In [None]:
display(NN_model)
print('Train Score:', NN_model.score(X_train,y_train))
print('Test Score: ', NN_model.score(X_test,y_test))
print('Loss Score: ', NN_model.loss_)
print('\n')
print('NN y_pred:\n', classification_report(y_test, y_pred_NN))
print('NN y_pred confusion matrix: \n',confusion_matrix(y_test, y_pred_NN),'\n')

__Note:__ The following is find the optimal number of epochs to train the Neural Network.

In [None]:
import time

# for storing accuracies and score
epoch_train_accuracies = []
epoch_test_accuracies = []
num_epoch_times = []
epoch_loss_scores = []

# controls the number of epochs and step sizes
num_epochs = 101
epoch_step_size = 1

# loop through several neural network models
for i in np.arange(1,num_epochs,epoch_step_size):
    # for timing the execution
    start = time.time()
    # instantiate and fit neural network, controlling the number of epochs using max_iter=i
    # where i is controlled by the epoch_step_size up to a max of num_epochs
    NN_model = MLPClassifier(hidden_layer_sizes=500,solver='lbfgs',max_iter=i)
    NN_model.fit(X_train,y_train)
    
    # capture local accuracy and scores for each model
    train_accuracy = NN_model.score(X_train,y_train)
    test_accuracy = NN_model.score(X_test,y_test)
    epoch_loss_scores.append(NN_model.loss_)
    
    # store local accuracy and scores for each model in global variable
    epoch_train_accuracies.append(train_accuracy)
    epoch_test_accuracies.append(test_accuracy)
        
    end = time.time()
    num_epoch_times.append(end-start)
    print('Epochs =',i,'Execution Time: ', end-start, 'seconds')
    print('Train Accuracy: ', train_accuracy)
    print('Test Accuracy:  ', test_accuracy)
    print('Loss Score:     ', NN_model.loss_)

In [None]:
sum(num_epoch_times)

__Note:__ The following is find the optimal train range and step size to train the Neural Network.

In [None]:
import time

# for storing accuracies globally
neuron_train_accuracies = []
neuron_test_accuracies = []
neuron_size_hidden_layer_times = []
neuron_loss_scores = []

train_range = 501
step_size = 100

# loop through several neural network models
for j in np.arange(1,train_range,step_size):
    start = time.time()
    # instantiate and fit neural network, controlling the hidden layer size using hidden_layer_sizes=j
    # where j is controlled by the step_size up to a max of train_range
    NN_model = MLPClassifier(hidden_layer_sizes=j,solver='lbfgs')
    NN_model.fit(X_train,y_train)

    # capture local accuracy and loss scores
    neuron_train_accuracy = NN_model.score(X_train,y_train)
    neuron_test_accuracy = NN_model.score(X_test,y_test)
    neuron_loss_score = NN_model.loss_
    
    # store local accuracy and loss scores in global variable
    neuron_train_accuracies.append(neuron_train_accuracy)
    neuron_test_accuracies.append(neuron_test_accuracy)
    neuron_loss_scores.append(neuron_loss_score)
        
    end = time.time()
    neuron_size_hidden_layer_times.append(end-start)
    print('Hidden Layer = (',j,'), Execution Time: ', end-start, 'seconds')
    print('Train Accuracy: ', neuron_train_accuracy)
    print('Test Accuracy:  ', neuron_test_accuracy)
    print('Loss Score:     ', neuron_loss_score)

In [None]:
sum(neuron_size_hidden_layer_times)

# Conclusion

The best classifier was the Logistic Regression, with the following metrics/scores:

Logit Train Score:       0.9274289916352606 <br>
Logit Test Score:        0.9204289544235925 <br>
Logit Train Smote Score: 0.9209275565339203 <br>
Logit Test Smote Score:  0.8441823056300268 <br>
Logit Balanced Train Score:       0.87990624138248 <br>
Logit Balanced Test Score:        0.828739946380697 <br>


__y_pred_smote:__<br>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; precision &emsp;    recall &emsp;&emsp;  f1-score &emsp;   support &emsp;<br>

           0       0.95      0.87      0.91      8567 
           1       0.27      0.52      0.35       758

    accuracy                           0.84      9325
   macro avg &emsp;&emsp;&emsp;&emsp;&emsp;       0.61&emsp;      0.70&emsp;      0.63&emsp;      9325<br>
weighted avg &emsp;&emsp;&emsp;&emsp;       0.90&emsp;      0.84&emsp;      0.87&emsp;      9325<br>

__y_pred_smote confusion matrix:__<br>
 [[7480 1087] <br>
 [ 366  392]] <br>

__y_pred_weight:__ <br>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; precision &emsp;    recall &emsp;&emsp;  f1-score &emsp;   support &emsp;<br>

           0       0.96      0.85      0.90      8567
           1       0.26      0.58      0.36       758

    accuracy                           0.83      9325
macro avg &emsp;&emsp;&emsp;&emsp;&emsp;       0.61&emsp;      0.72&emsp;      0.63&emsp;      9325 <br>
weighted avg &emsp;&emsp;&emsp;&emsp;       0.90&emsp;      0.83&emsp;      0.86&emsp;      9325 <br>

__y_pred_weight confusion matrix:__ <br>
 [[7287 1280] <br>
 [ 317  441]] <br>

__Despite__ the accuracy being high, it is not the best indicator of the success of a classifier in this particular case. The primary metric/score used to determine the best classifier was the precision, recall and f1-score. <br> 

Precision is defined as: 
\begin{equation*}
\frac{True Positive}{True Positive+False Positive}
\end{equation*}

Recall is defined as:
\begin{equation*}
\frac{True Positive}{True Positive+False Negative}
\end{equation*}

F1-Score is defined as:
\begin{equation*}
2 \times \frac{Precision*Recall}{Precision+Recall}
\end{equation*}

Recall was used as we are primarily interested in the correct identification of a question. That being said, the F1 Score is also a good measure to use because we seek a balance between Precision and Recall AND there is an uneven class distribution (larger number of Actual Negatives), which is apparent in this particular dataset. The confusion matrix is also used as a visual grid summary that displays the predict vs actual true negative/positives and false negative/positives.

# End Conclusion