##  Contents<a id='Contents'></a>
* [Introduction](#Introduction)
  * [Imports](#Imports)
  * [Load The Data](#Load_The_Data)
  * [Predicting](#Predicting)
      * [Multinomial Naive Bayes](#_Multinomial_Naive_Bayes)
      * [Random Forest](#_Random_Forest)
      * [Multinomial Naive Bayes for Resumes](#_Multinomial_Naive_Bayes_for_Resumes)
      * [Random Forest for Resumes](#_Random_Forest_for_Resumes)
  * [Preprocessing](#_Preprocessing)
  * [More Preprocessing](#_More_Preprocessing)
  * [Tensorflow Deep Learning](#_Tensorflow_Deep_Learning_)
  * [Summary](#_Summary)

## Introduction<a id='Introduction'></a>

Preprocessing is an often overlooked step in NLP problems.

To preprocess your text means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. 

Task = approach + domain

Text preprocessing is not directly transferable from task to task! Like all data science projects, a well defined goal for the project must be declared at the beginning of the task. This goal informs all of the steps in the data science method, including preprocessing.

A lot of the usual NLP preprocessing techniques (lowercasing, lemmatization, stemming, stop word removal, etc.) was performed in the wrangling/EDA step in this project. These techniques will be applied in this notebook.

## Imports<a id='Imports'></a>

In [1]:
import numpy as np
import os 
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import matplotlib as plty
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
stop = set(stopwords.words('english'))
from sklearn import preprocessing
Encode = preprocessing.LabelEncoder()
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [3]:
os.chdir(r'C:\Users\daenj\OneDrive\Desktop\Datasets\Capstone 2 2021')

## Loading the Data<a id='Loading the Data'></a>

Due to the size and well-written nature of both data sets, only light to no preprocessing is required. A domain specific dataset with sparse data (i.e tweets about a product) would require noise removal, lowercasing, stemming, etc. Basically, lot's of preprocessing layers would be required in this case, this situation is not one of those cases.

In [4]:
data = pd.read_csv('Sheet_1.csv', encoding='latin-1')

data.drop(["Unnamed: 3","Unnamed: 4","Unnamed: 5","Unnamed: 6","Unnamed: 7"], axis = 1, inplace =True)

data = pd.concat([data["class"],data["response_text"]], axis = 1)

data.dropna(axis=0, inplace =True)

In [5]:
data2 = pd.read_csv('Sheet_2.csv', encoding='latin-1')

data2.dropna(axis=0, inplace =True)

In [6]:
data["class"] = [1 if each == "flagged" else 0 for each in data["class"]]

In [7]:
data2["class"] = [1 if each == "flagged" else 0 for each in data2["class"]]

The above code changes the 'class' column of both datasets into binary values, 1 for 'flagged', 0 for 'not flagged'.

## Predicting<a id='Predicting'></a>

Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the preprocessing of text data prior to generating the vector representation. Countvectorizer is the only preprocessing used on the text data in this section.

The following cells use a train/test split of 75/25 on both datasets. The first model used is a multinomial Naive Bayes classifier. The multinomial Naive Bayes classifier is suitable for classification with discrete features (i.e, word counts for text classification). The multinomial distribution normally requires integer feature counts, which comes via the count vectorizer. The second model used is a random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Let's look at various metrics.

## Multinomial Naive Bayes<a id='Multinomial Naive Bayes'></a>

In [8]:
x = data.response_text
y = data['class']
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.25, random_state=42)
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)
NB.fit(x_train_dtm,y_train)
y_predict = NB.predict(x_test_dtm)
metrics.accuracy_score(y_test,y_predict)

0.65

In [9]:
metrics.precision_score(y_test,y_predict)

0.3333333333333333

In [10]:
metrics.f1_score(y_test,y_predict)

0.46153846153846156

## Random Forest<a id='Random Forest'></a>

In [11]:
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(x_train_dtm,y_train)
rf_predict = rf.predict(x_test_dtm)
metrics.accuracy_score(y_test,rf_predict)

0.75

## Multinomial Naive Bayes for Resumes<a id='Multinomial Naive Bayes for Resumes'></a>

In [12]:
x = data2.resume_text
y = data2['class']
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.25, random_state=42)
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)
NB.fit(x_train_dtm,y_train)
y_predict = NB.predict(x_test_dtm)
metrics.accuracy_score(y_test,y_predict)

0.71875

In [13]:
metrics.precision_score(y_test,y_predict)

0.75

In [14]:
metrics.f1_score(y_test,y_predict)

0.39999999999999997

## Random Forest for Resumes<a id='Random Forest for Resumes'></a>

In [15]:
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(x_train_dtm,y_train)
rf_predict = rf.predict(x_test_dtm)
metrics.accuracy_score(y_test,rf_predict)

0.65625

The precision and f1_scores for the random forest models was 0, so these metrics will be ignored.

## Preprocessing<a id='Preprocessing'></a>

In this next section, various layers of preprocessing will be applied to the data. These include lowercasing, tokenization, lemmatization, and count vectorization. A quick note, using the function below and the resulting sparse matrix in the train/test split only works for the chatbot response data. The Gaussian Naive Bayes classifier will be used. 

In [16]:
import nltk as nlp
nltk.download("stopwords")    
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\daenj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
description_list = []
for description in data.response_text:
    
    
    description = re.sub("[^a-zA-Z]"," ",description)
    description = description.lower() 
    
    description = nltk.word_tokenize(description)
    description = [ word for word in description if not word in set(stopwords.words("english"))]
    
    lemmatizer = WordNetLemmatizer()
    description = (lemmatizer.lemmatize(lemmatizer.lemmatize(lemmatizer.lemmatize(word, "n"),pos = "v"),pos="a") for word in description)
    
    description = " ".join(description)
    description_list.append(description)

In [18]:
max_features = 100

In [19]:
count_vectorizer = CountVectorizer(max_features=max_features)

In [20]:
sparce_matrix = count_vectorizer.fit_transform(description_list).toarray()

In [21]:
y = data['class']
x = sparce_matrix

In [22]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.25, random_state = 42)

In [23]:
nb = GaussianNB()
nb.fit(x_train,y_train)

GaussianNB()

In [24]:
y_pred = nb.predict(x_test)
print("Accuracy:",nb.score(y_pred.reshape(-1,1),y_test))

Accuracy: 0.75


The accuracy score for both the GaussianNB and Random Forest classifiers are the same! The GaussianNB classifier performs better on the chatbot response data than the MultinomialNB classifier.

## More Preprocessing<a id='More Preprocessing'></a>

The following section won't be used in a predictive model, it's purpose is to show some of the NLP preprocessing techniques that are available to use.

In [25]:
from nltk.stem import PorterStemmer

# init stemmer
porter_stemmer=PorterStemmer()

In [26]:
# stem raw words with noise
raw_words=data['response_text']
stemmed_words=[porter_stemmer.stem(word=word) for word in raw_words]

In [27]:
# stem raw words with noise
raw_words2=data2['resume_text']
stemmed_words=[porter_stemmer.stem(word=word) for word in raw_words2]

In [28]:
def scrub_words(text):
    """Basic cleaning of texts."""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text

In [29]:
# stem words already cleaned
cleaned_words1=[scrub_words(w) for w in raw_words]
cleaned_stemmed_words1=[porter_stemmer.stem(word=word) for word in cleaned_words1]
stemdf= pd.DataFrame({'raw_word': raw_words,'cleaned_word':cleaned_words1,'stemmed_word': cleaned_stemmed_words1})
stemdf=stemdf[['raw_word','cleaned_word','stemmed_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word,stemmed_word
0,I try and avoid this sort of conflict,I try and avoid this sort of conflict,i try and avoid this sort of conflict
1,Had a friend open up to me about his mental ad...,Had a friend open up to me about his mental ad...,had a friend open up to me about his mental ad...
2,I saved a girl from suicide once. She was goin...,I saved a girl from suicide once She was goin...,i saved a girl from suicide once she was goin...
3,i cant think of one really...i think i may hav...,i cant think of one really i think i may hav...,i cant think of one really i think i may hav...
4,Only really one friend who doesn't fit into th...,Only really one friend who doesn t fit into th...,only really one friend who doesn t fit into th...
...,...,...,...
75,"Now that I've been through it, although i'm no...",Now that I ve been through it although i m no...,now that i ve been through it although i m no...
76,when my best friends mom past away from od'ing...,when my best friends mom past away from od ing...,when my best friends mom past away from od ing...
77,As a camp counselor I provide stability in kid...,As a camp counselor I provide stability in kid...,as a camp counselor i provide stability in kid...
78,My now girlfriend used to have serious addicti...,My now girlfriend used to have serious addicti...,my now girlfriend used to have serious addicti...


In [30]:
# stem words already cleaned
cleaned_words2=[scrub_words(w) for w in raw_words2]
cleaned_stemmed_words2=[porter_stemmer.stem(word=word) for word in cleaned_words2]
stemdf= pd.DataFrame({'raw_word': raw_words2,'cleaned_word':cleaned_words2,'stemmed_word': cleaned_stemmed_words2})
stemdf=stemdf[['raw_word','cleaned_word','stemmed_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word,stemmed_word
0,\rCustomer Service Supervisor/Tier - Isabella ...,Customer Service Supervisor Tier Isabella Ca...,customer service supervisor tier isabella ca...
1,\rEngineer / Scientist - IBM Microelectronics ...,Engineer Scientist IBM Microelectronics Di...,engineer scientist ibm microelectronics di...
2,\rLTS Software Engineer Computational Lithogra...,LTS Software Engineer Computational Lithograph...,lts software engineer computational lithograph...
3,TUTOR\rWilliston VT - Email me on Indeed: ind...,TUTOR Williston VT Email me on Indeed indee...,tutor williston vt email me on indeed indee...
4,\rIndependent Consultant - Self-employed\rBurl...,Independent Consultant Self employed Burling...,independent consultant self employed burling...
...,...,...,...
120,\rBrattleboro VT - Email me on Indeed: indeed....,Brattleboro VT Email me on Indeed indeed co...,brattleboro vt email me on indeed indeed co...
121,\rResearch and Teaching Assistant - University...,Research and Teaching Assistant University o...,research and teaching assistant university o...
122,\rMedical Coder - Highly Skilled - Entry Level...,Medical Coder Highly Skilled Entry Level S...,medical coder highly skilled entry level s...
123,\rWaterbury VT - Email me on Indeed: indeed.co...,Waterbury VT Email me on Indeed indeed com ...,waterbury vt email me on indeed indeed com ...


The resume dataset benefits greatly by using the scrub words function. 

## Tensorflow Deeplearning<a id='Tensorflow Deeplearning'></a>

Deep learning is a category of machine learning models that use multi-layer neural networks. Deep learning is a technique for implementing machine learning. It uses neural networks to learn, sometimes, using decision trees may also be referred to as deep learning, but for the most part deep learning involves the use of neural networks. A neural network is a collection of layers that transform the input in some way to produce an output.

In [31]:
df = pd.read_csv('Sheet_1.csv', encoding='latin-1')
df.drop(["Unnamed: 3","Unnamed: 4","Unnamed: 5","Unnamed: 6","Unnamed: 7"], axis = 1, inplace =True)
df = df.rename(columns={"class":"Sentiment", "response_text":"Lables"})
df.dropna(axis=0, inplace =True)

In [32]:
df2 = pd.read_csv('Sheet_2.csv', encoding='latin-1')
df2 = df2.drop(["resume_id"], axis = 1)
df2 = df2.rename(columns={"class":"Sentiment", "resume_text":"Lables"})
df2.dropna(axis=0, inplace =True)

In [33]:
combined  = pd.concat([df,df2])
combined.replace(('flagged','not_flagged'),(1,0),inplace=True)

In [34]:
combined.head()

Unnamed: 0,response_id,Sentiment,Lables
0,response_1,0,I try and avoid this sort of conflict
1,response_2,1,Had a friend open up to me about his mental ad...
2,response_3,1,I saved a girl from suicide once. She was goin...
3,response_4,0,i cant think of one really...i think i may hav...
4,response_5,0,Only really one friend who doesn't fit into th...


Both of the datasets are combined with two new columns: 'Labels' and 'Sentiment'. Sentiment is the same as the binary values assigned to the 'class' column earlier. 1 is 'flagged', 0 is 'not flagged'.

In [35]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences=combined['Lables'].tolist()
labels=combined['Sentiment'].tolist()

In [36]:
# Separate out the sentences and labels into training and test sets
training_size = int(len(sentences) * 0.8)

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

In [37]:
vocab_size = 300
embedding_dim = 16
max_length = 25
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 16)            4800      
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 6)                 102       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 4,909
Trainable params: 4,909
Non-trainable params: 0
_________________________________________________________________


In [38]:
num_epochs = 30
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Summary<a id='Summary'></a>

The nature of these data (well written, abundant) meant that minimal preprocessing was required when building a predictive model. Multinomial Naive Bayes and Random Forest classifiers were used on both datasets that had been affected by sklearn's count vectorizer. A gaussian naive bayes classifier was used to predict on the chatbot response data. Tensorflow deep learning was utilized on a combined dataset of both the chatbot responses and resumes. This TF deep learning uses neural networks to predict, similar to an ensemble method.