<a href="https://colab.research.google.com/github/eliaswalyba/digital-genius/blob/master/models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Engineer / Conversational AI Engineer - Technical
In this test you will build a system that helps to classify simple conversations. This test can be treated as an opportunity to show skills and knowledge, as a learning exercise and as a prompt for further interviews in your process with DigitalGenius.

## Install required librairies
1.   Gensim for the Word2Vec and Doc2Vec models
2.   NLTK for text manipulation
3.   BeautifulSoup4 for markup tags pre-processing

In [59]:
!pip install gensim
!pip install nltk
!pip install beautifulsoup4



## Import all the librairies we need for this project

When you excute this cell, it will prompt an input box for setting up the NLTK library. Just type ***q*** in that input field and hit ***enter***.

In [80]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn import utils
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import gensim
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument

import nltk
nltk.download()
from nltk.corpus import stopwords


import tensorflow as tf
from tensorflow import keras
from keras.preprocessing import text, sequence
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras import utils

from bs4 import BeautifulSoup

from itertools import islice
import itertools
import os
import re

from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


## Load the data set from Github

In [0]:
df = pd.read_csv("https://raw.githubusercontent.com/eliaswalyba/digital-genius/master/tech_test_data.csv")
df = pd.concat([df.message, df.case_type], axis=1)
df = df.sample(frac=1).reset_index(drop=True)
x_train, x_test, y_train, y_test = tts(df.message, df.case_type)

## Text Pre-processing
For this particular data set, our text cleaning step includes HTML decoding, remove stop words, change text to lower case, remove punctuation, remove bad characters, and so on.

In [0]:
def print_plot(index):
    example = df[df.index == index][['message', 'case_type']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Case Type:', example[1])

In [0]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ''.join(i for i in text if ord(i) < 128)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

In [16]:
df.message = df.message.apply(clean_text)
print_plot(10)

course let assist please share account number order id ill check
Case Type: order_status


## Simple Statistical Learning Models
In this section we are going to try out 3 simple statistical learning models(Multinomial Naive Bayes, Linear Support Vector Machines and Logistic Regression) and compare their accuracies on this dataset.

To avoid code duplication we will create a model builder and use it for the rest of this part.

### Model Builder

In [0]:
class Model():
    
    def __init__(self, 
        classifier, 
        logreg_jobs=1, 
        logreg_c=1e5, 
        lsvm_loss='hinge', 
        lsvm_penalty='l2', 
        lsvm_alpha=1e-3, 
        lsvm_random_state=42, 
        lsvm_max_iter=5, 
        lsvm_tol=None
    ):
        if classifier == 'naive bayes':
            self.classifier = MultinomialNB()
        elif classifier == 'logistic regression':
            self.classifier = LogisticRegression(n_jobs=logreg_jobs, C=logreg_c)
        elif classifier == 'lsvm':
            self.classifier = SGDClassifier(
                loss=lsvm_loss, 
                penalty=lsvm_penalty, 
                alpha=lsvm_alpha, 
                random_state=lsvm_random_state, 
                max_iter=lsvm_max_iter, 
                tol=lsvm_tol
            )
        else:
            self.classifier = None
            
            
    def train(self, feature, label):
        self.model = Pipeline([
            ('vectorizer', CountVectorizer()),
            ('transformer', TfidfTransformer()),
            ('classifier', self.classifier),
        ]).fit(feature, label)
        
        return self
    
    def test(self, test):
        return self.model.predict(test)

In [0]:
def print_score_report(score, report):
    print(f'Accuracy\n------------------------------------------------------\n{score}\n')
    print(f'Report\n------------------------------------------------------\n{report}\n')

### Multinomial Naive Bayes Classifier
Let's train a Naive Bayes classifier, which provides a nice baseline, to try to predict the **case type** of a **message**.

In [51]:
y_pred = Model('naive bayes').train(x_train, y_train).test(x_test)
score  = accuracy_score(y_pred, y_test)
report = classification_report(y_test, y_pred, target_names=df['case_type'].unique())
print_score_report(score, report)

Accuracy
------------------------------------------------------
0.7272727272727273

Report
------------------------------------------------------
              precision    recall  f1-score   support

order_status       0.90      0.64      0.75        14
cancel_order       0.58      0.88      0.70         8

    accuracy                           0.73        22
   macro avg       0.74      0.76      0.73        22
weighted avg       0.78      0.73      0.73        22




### Linear Support Vector Machine
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms. Let's build one and try it out !

In [52]:
y_pred = Model('lsvm').train(x_train, y_train).test(x_test)
score  = accuracy_score(y_pred, y_test)
report = classification_report(y_test, y_pred, target_names=df['case_type'].unique())
print_score_report(score, report)

Accuracy
------------------------------------------------------
0.6818181818181818

Report
------------------------------------------------------
              precision    recall  f1-score   support

order_status       0.82      0.64      0.72        14
cancel_order       0.55      0.75      0.63         8

    accuracy                           0.68        22
   macro avg       0.68      0.70      0.68        22
weighted avg       0.72      0.68      0.69        22




### Logistic Regression
Logistic regression is a simple and easy to understand classification algorithm, and Logistic regression can be easily generalized to multiple classes.

In [54]:
y_pred = Model('logistic regression').train(x_train, y_train).test(x_test)
score  = accuracy_score(y_pred, y_test)
report = classification_report(y_test, y_pred, target_names=df['case_type'].unique())
print_score_report(score, report)

Accuracy
------------------------------------------------------
0.6363636363636364

Report
------------------------------------------------------
              precision    recall  f1-score   support

order_status       0.80      0.57      0.67        14
cancel_order       0.50      0.75      0.60         8

    accuracy                           0.64        22
   macro avg       0.65      0.66      0.63        22
weighted avg       0.69      0.64      0.64        22




## Word embedings and Neural Networks
As you can see, following some very basic steps and using a simple linear model, we were able to reach as high as an 72% accuracy on this multi-class text classification data set.
Using the same data set, we are going to try some advanced techniques such as word embedding and neural networks.

### Word2vec and Logistic Regression
Word2vec, like doc2vec, belongs to the text preprocessing phase. Specifically, to the part that transforms a text into a row of numbers. Word2vec is a type of mapping that allows words with similar meaning to have similar vector representation.
The idea behind Word2vec is rather simple: we want to use the surrounding words to represent the target words with a Neural Network whose hidden layer encodes the word representation.
First we load a word2vec model. It has been pre-trained by Google on a 100 billion word Google News corpus.

In [55]:
wv = gensim.models.KeyedVectors.load_word2vec_format("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz", binary=True)
wv.init_sims(replace=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [74]:
list(islice(wv.vocab, 13030, 13050))

['Memorial_Hospital',
 'Seniors',
 'memorandum',
 'elephant',
 'Trump',
 'Census',
 'pilgrims',
 'De',
 'Dogs',
 '###-####_ext',
 'chaotic',
 'forgive',
 'scholar',
 'Lottery',
 'decreasing',
 'Supervisor',
 'fundamentally',
 'Fitness',
 'abundance',
 'Hold']

BOW based approaches that includes averaging, summation, weighted addition. The common way is to average the two word vectors. Therefore, we will follow the most common way.

In [0]:
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

We will tokenize the text and apply the tokenization to “post” column, and apply word vector averaging to tokenized text.

In [0]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

In [64]:
train, test = tts(df, test_size=0.2, random_state = 42)

test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['message']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['message']), axis=1).values

X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)

  


Its time to see how logistic regression classifiers performs on these word-averaging document features.

In [65]:
logreg = LogisticRegression(n_jobs=1, C=1e5)
y_pred = logreg.fit(X_train_word_average, train['message']).predict(X_test_word_average)
print('accuracy %s' % accuracy_score(y_pred, test.message))

accuracy 0.5


It was disappointing, worst we have seen so far.

### Doc2vec and Logistic Regression
The same idea of word2vec can be extended to documents where instead of learning feature representations for words, we learn it for sentences or documents.

Doc2Vec extends the idea of word2vec, however words can only capture so much, there are times when we need relationships between documents and not just words.

The way to train doc2vec model for our dataset is very similar with when we train with Doc2vec and Logistic Regression.

First, we label the sentences. Gensim’s Doc2Vec implementation requires each document/paragraph to have a label associated with it. and we do this by using the TaggedDocument method. The format will be “TRAIN_i” or “TEST_i” where “i” is a dummy index of the post.

In [0]:
def label_sentences(corpus, label_type):
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(TaggedDocument(v.split(), [label]))
    return labeled

In [0]:
x_train, x_test, y_train, y_test = tts(df.message, df.case_type, random_state=0, test_size=0.3)
x_train = label_sentences(x_train, 'Train')
x_test = label_sentences(x_test, 'Test')
all_data = x_train + x_test

According to Gensim doc2vec tutorial, its doc2vec class was trained on the entire data, and we will do the same. Let’s have a look what the tagged document looks like:

In [69]:
all_data[:5]

[TaggedDocument(words=['Hey,', 'do', 'you', 'know', 'where', 'my', 'order', 'is?'], tags=['Train_0']),
 TaggedDocument(words=['No', 'worries,', 'my', 'order', 'ID', 'is', 'BEDSW912,', 'let', 'me', 'check', 'the', 'account', 'number'], tags=['Train_1']),
 TaggedDocument(words=['Of', 'course!', 'Let', 'me', 'assist.', 'Please', 'share', 'your', 'account', 'number', 'and', 'order', 'ID', 'and', 'I’ll', 'see', 'what', 'the', 'options', 'are.'], tags=['Train_2']),
 TaggedDocument(words=['Of', 'course!', 'Let', 'me', 'assist.', 'Please', 'share', 'your', 'account', 'number', 'and', 'order', 'ID', 'and', 'I’ll', 'see', 'what', 'the', 'options', 'are.'], tags=['Train_3']),
 TaggedDocument(words=['account', 'number', '01928340'], tags=['Train_4'])]

Let's train a Doc2Vec model by varyin some it's parameters

In [70]:
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), total_examples=len(all_data), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 87/87 [00:00<00:00, 7412.39it/s]
100%|██████████| 87/87 [00:00<00:00, 35562.27it/s]
100%|██████████| 87/87 [00:00<00:00, 167695.06it/s]
100%|██████████| 87/87 [00:00<00:00, 248572.51it/s]
100%|██████████| 87/87 [00:00<00:00, 158516.27it/s]
100%|██████████| 87/87 [00:00<00:00, 271708.45it/s]
100%|██████████| 87/87 [00:00<00:00, 26010.72it/s]
100%|██████████| 87/87 [00:00<00:00, 221557.04it/s]
100%|██████████| 87/87 [00:00<00:00, 246058.29it/s]
100%|██████████| 87/87 [00:00<00:00, 275191.89it/s]
100%|██████████| 87/87 [00:00<00:00, 370085.65it/s]
100%|██████████| 87/87 [00:00<00:00, 280695.73it/s]
100%|██████████| 87/87 [00:00<00:00, 249763.48it/s]
100%|██████████| 87/87 [00:00<00:00, 258430.91it/s]
100%|██████████| 87/87 [00:00<00:00, 261580.25it/s]
100%|██████████| 87/87 [00:00<00:00, 205579.97it/s]
100%|██████████| 87/87 [00:00<00:00, 26854.90it/s]
100%|██████████| 87/87 [00:00<00:00, 366002.46it/s]
100%|██████████| 87/87 [00:00<00:00, 350869.66it/s]
100%|██████████| 

Next, we get vectors from trained doc2vec model.

In [0]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors

In [0]:
train_vectors_dbow = get_vectors(model_dbow, len(x_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(x_test), 300, 'Test')

Finally, we get a logistic regression model trained by the doc2vec features.

In [76]:
y_pred = LogisticRegression(n_jobs=1, C=1e5).fit(train_vectors_dbow, y_train).predict(test_vectors_dbow)
score = accuracy_score(y_pred, y_test)
classification_report(y_test, y_pred,target_names=df['case_type'].unique())
print_score_report(score, report)

Accuracy
------------------------------------------------------
0.6666666666666666

Report
------------------------------------------------------
              precision    recall  f1-score   support

order_status       0.80      0.57      0.67        14
cancel_order       0.50      0.75      0.60         8

    accuracy                           0.64        22
   macro avg       0.65      0.66      0.63        22
weighted avg       0.69      0.64      0.64        22




We achieve an accuracy score of 80% which is 1% higher than SVM.

### Neural Network Using Keras

**Prepare the model**

In [0]:
train_size = int(len(df) * .7)
train_messages = df['message'][:train_size]
train_case_types = df['case_type'][:train_size]

test_messages = df['message'][train_size:]
test_case_types = df['case_type'][train_size:]

max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_messages)

x_train = tokenize.texts_to_matrix(train_messages)
x_test = tokenize.texts_to_matrix(test_messages)

encoder = LabelEncoder()
encoder.fit(train_case_types)
y_train = encoder.transform(train_case_types)
y_test = encoder.transform(test_case_types)

num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)

batch_size = 32
epochs = 2

**Build the model**

In [85]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)
              
history = model.fit(
    x_train, 
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_split=0.1
)

Train on 54 samples, validate on 6 samples
Epoch 1/2
Epoch 2/2
