# Resume Chatbot

## Web Analytics Final project

This is part of the final project on 620 Web Analytics course. Objective of this project is to create a resume chatbot for Shyam BV. This will provide most of the information about me.

## Introduction

In this notebook, we will build a deep neural network with that functions as part of an end-to-end machine translation pipeline which also uses text processing libraries. A question will be asked, an answer will be provided by bot. Also bot's ability to respond to new questions.

- **Dataset Creation** - Here we will create a dataset with various text inputs
- **Preprocess** - You'll convert text to sequence of integers with various preprocessing.
- **Models** - A Deep neural network model will be created. By creating a word embedding, this model will be able to learn new words.
- **Prediction** - Run the model and generate responses from bot.

## Dataset

To perform this deep learning chatbot, we need to train the model with lots of data. Unfortunately my resume is not in format of QA(question answering type). So we need to generate lot of data with some existing data for this model.
 
Here is a sample data which [I found](
https://raw.githubusercontent.com/bvshyam/make_yourself_a_bot/master/workspace-watson.json). This will be me base to create a dataset.

Over the period, due to many interactions, we will gather more data and our bot will get better over time.

## Dataset creation

As mentioned earlier, need create the initial dataset. That will be performed in  data_creation.ipynb

## Neural Network

As the model will be created using deep neural networks, we will be using Keras library with Tensorflow backend.

In [1]:
from keras.layers import GRU, Input, Dense, TimeDistributed, SimpleRNN, LSTM
from keras.models import Model
from keras.models import Sequential
from keras.layers import Activation
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.callbacks import ModelCheckpoint
from keras.layers import Dropout
from keras.layers import Dropout, Bidirectional, RepeatVector
from keras.layers.embeddings import Embedding
import collections
import pandas as pd
import nltk

from keras.preprocessing.text import Tokenizer,text_to_word_sequence
import numpy as np
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from autocorrect import spell
from scipy.spatial import distance
import time
from nltk.translate.bleu_score import corpus_bleu
from keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard
from keras.models import load_model

Using TensorFlow backend.


A dataset has been created with dataset_creation.ipynb and some manual effort to correct it.Now we will load it and perform further pre-processing.

In [2]:
#Create a pandas dataframe for the input data

df = pd.read_csv('./data/final_qa_data.csv',sep=',')
df.head()

Unnamed: 0,Question,Answer
0,Tell me 5 positive things about you,Tricky question detecting. Waiting for next qu...
1,Tell me your strengths,Tricky question detecting. Waiting for next qu...
2,Tell us Unique Selling Points,Tricky question detecting. Waiting for next qu...
3,What are you good at ?,Tricky question detecting. Waiting for next qu...
4,What are your professional strengths ?,Tricky question detecting. Waiting for next qu...


In [3]:
# Loading dataset to numpy array

question_sentences = df.Question.values
answer_sentences = df.Answer.values

print('Dataset Loaded')

Dataset Loaded


### Files

In each line, the first part is question and then the next part is answers. Lets see sample.

In [4]:
for sample_i in range(2):
    print('Question Line {}:  {}'.format(sample_i + 1, question_sentences[sample_i]))
    print('Answer Line {}:  {}'.format(sample_i + 1, answer_sentences[sample_i]))

Question Line 1:  Tell me 5 positive things about you
Answer Line 1:  Tricky question detecting. Waiting for next question...
Question Line 2:  Tell me your strengths
Answer Line 2:  Tricky question detecting. Waiting for next question...


From looking at the sentences, we need to perform some pre-processing. Below are some of them.

- Complete string lower case. Parent vocab has only lower case characters.
- Separate symbols and text. Else it will be considered as unique word.
- Preserve the line ending.
- Also replace some of the junk characters.
- Replace won't, wouldn't etc into proper words.


### Vocabulary

Also need to build a word vocablary for the questions and answers. This will be our knowledge base. All the new questions will be stored and the vocabulary will be updated.

In [5]:
def preprocess_cleaning(texts):
    
    text_cleaned = []
    
    text_cleaned = [' '.join(nltk.tokenize.word_tokenize(word.lower(),preserve_line=True)) for word in texts]

    return text_cleaned
    

In [6]:
#Clean and store in the same variables.

question_sentences = preprocess_cleaning(question_sentences)
answer_sentences = preprocess_cleaning(answer_sentences)


Nex we will analyze the overall dataset and draw some inference out of it.

In [7]:

question_words_counter = collections.Counter([word for sentence in question_sentences for word in sentence.split()])
answer_words_counter = collections.Counter([word for sentence in answer_sentences for word in sentence.split()])

print('{} words in questions.'.format(len([word for sentence in question_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(question_words_counter)))
print('10 Most common words in the questions dataset:')
print('"' + '" "'.join(list(zip(*question_words_counter.most_common(10)))[0]) + '"')
print()
print('{} words in answers.'.format(len([word for sentence in answer_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(answer_words_counter)))
print('10 Most common words in the answers dataset:')
print('"' + '" "'.join(list(zip(*answer_words_counter.most_common(10)))[0]) + '"')

1124 words in questions.
288 unique English words.
10 Most common words in the questions dataset:
"you" "?" "what" "do" "are" "your" "how" "i" "'s" "tell"

3361 words in answers.
292 unique English words.
10 Most common words in the answers dataset:
"i" "you" ":" "to" "," "my" "..." "a" "and" "."


In [8]:
#Most common top 20 word.

question_words_counter.most_common(20)

[('you', 110),
 ('?', 102),
 ('what', 66),
 ('do', 38),
 ('are', 36),
 ('your', 31),
 ('how', 27),
 ('i', 19),
 ("'s", 19),
 ('tell', 17),
 ('me', 17),
 ('about', 17),
 ('to', 16),
 ('a', 15),
 ('is', 15),
 ('can', 13),
 ('work', 13),
 ('the', 13),
 ('have', 12),
 ('where', 10)]

## Preprocess

As a next step we will perform below ones.
1. Tokenize the words into ids
2. Add padding to make all the sequences the same length.


### Tokenize (IMPLEMENTATION)
For a neural network to predict on text data, it first has to be turned into data it can understand. 

A word level model uses word ids that generate text predictions for each word. Also we will use the pre-trained word embedding called as [Glove](https://nlp.stanford.edu/projects/glove/)

Turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function. Using this function we will tokenize `question_sentences` and `answer_sentences` in the cell below.


In [9]:
def tokenize(x):
    """
    Tokenize x
    param x: List of sentences/strings to be tokenized
    return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # convert to nltk tokenizer to preserve the symbols and tokenize
    x = [' '.join(nltk.tokenize.word_tokenize(word.lower(),preserve_line=True)) for word in x]

    #Use tokenizer from Keras
    tokenizer = Tokenizer(num_words=None, filters="", lower=True, split=" ")
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer


# Tokenize sample output

text_sentences = question_sentences[:3]
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_sentences,"\n")
print(text_tokenizer.word_index)
print()

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

['tell me 5 positive things about you', 'tell me your strengths', 'tell us unique selling points'] 

{'tell': 1, 'me': 2, '5': 3, 'positive': 4, 'things': 5, 'about': 6, 'you': 7, 'your': 8, 'strengths': 9, 'us': 10, 'unique': 11, 'selling': 12, 'points': 13}

Sequence 1 in x
  Input:  tell me 5 positive things about you
  Output: [1, 2, 3, 4, 5, 6, 7]
Sequence 2 in x
  Input:  tell me your strengths
  Output: [1, 2, 8, 9]
Sequence 3 in x
  Input:  tell us unique selling points
  Output: [1, 10, 11, 12, 13]


### Padding

When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the **end** of each sequence using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [10]:
def pad(x, length=None):
    """
    Pad x
    param x: List of sequences.
    param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    return: Padded numpy array of sequences
    """
    if length==None:
        maxLenX = 0
        for sequence in x:
            if len(sequence) > maxLenX:
                maxLenX = len(sequence)
        
        padded = pad_sequences(sequences=x,maxlen=maxLenX, dtype='int32', padding='post', truncating='post', value=0)

    else:
        padded = pad_sequences(sequences=x,maxlen=length, dtype='int32', padding='post', truncating='post', value=0)
    return padded


# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 3 4 5 6 7]
  Output: [1 2 3 4 5 6 7]
Sequence 2 in x
  Input:  [1 2 8 9]
  Output: [1 2 8 9 0 0 0]
Sequence 3 in x
  Input:  [ 1 10 11 12 13]
  Output: [ 1 10 11 12 13  0  0]


### Preprocess Pipeline

As a next step we will call the above functions for the questions and answers and store those tokenizer and outputs for further usage.

In [11]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Feature output List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)
        
    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_question_sentences, preproc_answer_sentences, question_tokenizer, answer_tokenizer =\
    preprocess(question_sentences, answer_sentences)

print('Data Preprocessed')


Data Preprocessed


## Glove Embedding 

As a next step we will try Glove word embedding. Here we want to convert the preprocessed padded word sentences to a word vector with dimentions. By performing this, we know the relationship between the words. 

For this project, we have used a simple word embedding which is of 100 dimensions. Means each word is converted to a 100 dimension vector. That will be one of the input to Embedding layer.

We will assign the word embedding vector to each vocablary in the questions layer. 

In [12]:
#word embeddings from glove

embeddings_index = dict()

with open('./data/glove.6B.100d.txt',encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word]= coefs

print('loaded %s word vectors' % len(embeddings_index))

loaded 400000 word vectors


In [13]:
#Create an encoder matrix with all the vocablary in questions.

num_question_tokens = len(question_tokenizer.word_index)+1

#Create a dummy matrix to store all the word embeddings
encoder_embedding_matrix = np.zeros((num_question_tokens, 100))

# List to find nearest words for unknown words in current vocab list
unknown_helper_list = []

for word,i in question_tokenizer.word_index.items():
    
    embedding_vector = embeddings_index.get(word)
    unknown_helper_list.append(word)

    if embedding_vector is not None:
        encoder_embedding_matrix[i] = embedding_vector
        
#Final matrix is the collection of vectors for the complete questions vocab

## Model Implementation

As a next step, we will build a deep neural network model. This will be a series of different layers. We will specify some hyper parameters and then use it to train our model.

Below are the different layers of deep neural network

1. Embedding layer which acts as input layer and gets input from embeddings vector.
2. Create a Bidirectional LSTM layer.
3. Add some random dropout neurons. Repeat Step 2 and 3.
4. Create an output layer with sigmoid activation function.
5. Finally compile the model with optimizer and accuracy metrics.
6. Fit the model with training questions and answers.

In [14]:
#Learning rate for optimizer
learning_rate = 0.01
# Hidden layers for LSTM layers
hidden_dim = 256
# Word embedding vector length
embedding_vector_length = 100
# Number of epochs for complete training
epochs =1000


In [15]:

# Reshaping the questions input to work with a basic RNN with output shape
tmp_x = pad(preproc_question_sentences, preproc_answer_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_answer_sentences.shape[-2]))


In [16]:
#Assign to the variables
input_shape = tmp_x.shape
output_sequence_length = preproc_answer_sentences.shape[1]
question_vocab_size = len(question_tokenizer.word_index)+1
answer_vocab_size= len(answer_tokenizer.word_index)+1


### Ids Back to Text

The neural network will be translating the input to words ids, which isn't the final form we want.  We want the answers for that question.  Below functions  will bridge the gab between the logits from the neural network to the answers. 


In [20]:
# Load models
model = load_model('./saved_models/resume_chatbot_save_3000.hdf5')

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 39, 100)           28900     
_________________________________________________________________
bidirectional_4 (Bidirection (None, 39, 512)           731136    
_________________________________________________________________
dropout_5 (Dropout)          (None, 39, 512)           0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, 39, 512)           1574912   
_________________________________________________________________
dropout_6 (Dropout)          (None, 39, 512)           0         
_________________________________________________________________
bidirectional_6 (Bidirection (None, 39, 512)           1574912   
_________________________________________________________________
dropout_7 (Dropout)          (None, 39, 512)           0         
__________

In [21]:
def similar_word(unknown_word_dim):
    """
    To find the similar if a typed word is not available in questions vocab. 
    Here we are finding the nearest word using euclidean distance 
    and perform the approximate word  which is similar to it.
    :param unknown_word_dim: unknown word entered in the text
    """
    all_distance = []

    for known_word in encoder_embedding_matrix:
        all_distance.append(distance.euclidean(unknown_word_dim,known_word))
    
    #Get the minimum distance using argsort
    return(unknown_helper_list[np.array(all_distance).argsort()[:1][0]])

In [22]:
def preprocess_test(raw_word, question_tokenizer):
    """
    Preprocess the text which is entered by user. We need to remove and clean 
    the text before we predict the answer.
    :param raw_word: Raw sentence entered by the user.
    :param question_tokenizer: Question tokenizer for vocab
    """
    
    # Cleaning the text
    l1 = ['won’t','won\'t','wouldn’t','wouldn\'t','’m', '’re', '’ve', '’ll', '’s','’d', 'n’t', '\'m', '\'re', '\'ve', '\'ll', '\'s', '\'d', 'can\'t', 'n\'t', 'B: ', 'A: ', ',', ';', '.', '?', '!', ':', '. ?', ',   .', '. ,', 'EOS', 'BOS', 'eos', 'bos']
    l2 = ['will not','will not','would not','would not',' am', ' are', ' have', ' will', ' is', ' had', ' not', ' am', ' are', ' have', ' will', ' is', ' had', 'can not', ' not', '', '', ' ,', ' ;', ' .', ' ?', ' !', ' :', '? ', '.', ',', '', '', '', '']

    raw_word = raw_word.lower()

    for j, term in enumerate(l1):
        raw_word = raw_word.replace(term,l2[j])
       
    for j in range(30):
        raw_word = raw_word.replace('. .', '')
        raw_word = raw_word.replace('.  .', '')
        raw_word = raw_word.replace('..', '')
        raw_word = raw_word.replace('...', '')
        
    for j in range(5):
        raw_word = raw_word.replace('  ', ' ')
 
    #Spell checker and call similar words function

    final_corrected_words  = []
    
    for text in nltk.tokenize.word_tokenize(raw_word.lower(),preserve_line=True):
        text = text.lower()
        #Spell checker changes the symbols. So passing only strings.
        if text not in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n':
            #Spell checker
            text = spell(text).lower()
            
            # Finding unknown similar words from the vocab
            if text not in question_tokenizer.word_index.keys():
                
                try:
                    final_corrected_words.append(similar_word(embeddings_index[text]))
                except:
                    final_corrected_words.append('?')
            else:
                final_corrected_words.append(text)
        else:
           
            if text not in question_tokenizer.word_index.keys():
                final_corrected_words.append(similar_word(embeddings_index[text]))
            else:
                final_corrected_words.append(text)
            
    #print(' '.join(final_corrected_words))
    return ' '.join(final_corrected_words)


In [26]:


que = ''
last_query  = ' '
last_last_query = ''
text = ' '
last_text = ''
name_of_computer = 'Jammy AI'

# Open files to save the conversation for further training:

name = time.strftime("%Y%m%d-%H%M%S")
qf = open('./session_data/'+name+'.txt', 'w')


def final_predictions(model, x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: Questions tokenizer
    :param y_tk: Answers tokenizer
    """
    
    ## Create a answer dictionary
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'
    

    print('Chatbot: Hi ! please type your name.\n')
    name = input('user: ')
    print('Chatbot: hi , ' + name +' ! My name is ' + name_of_computer + '.\n') 

    
    while(True):
        
        que = input()
        
        qf.write("Question typed:" + que + '\n')
        
        if que =='exit':
            break
        else:
            #Preprocess the text
            que = preprocess_test(que,x_tk)
            #print(que)
        sentence =que
        
        qf.write("Question interpreted:" + que + '\n')
        
        #sentence = [x_tk.word_index[word] for word in text_to_word_sequence(sentence,filters='')]
        
        sentence = [x_tk.word_index[word] for word in nltk.tokenize.word_tokenize(sentence.lower())]
        
        #Convert to padded sequence
        sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
        sentences = np.array([sentence[0], x[0]])
        
        #print(sentences.shape)
        
        tmp_sentences = pad(sentences, y.shape[1])
        tmp_sentences = tmp_sentences.reshape((-1, y.shape[-2]))

        #print(tmp_sentences)
        predictions = model.predict(tmp_sentences, len(tmp_sentences))

        #print('Sample 1:')
        prediction_text = ' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]).replace('<PAD>','')
        
        qf.write("Answer by bot:" + prediction_text + '\n')
        print(prediction_text+'\n')

    qf.close()

final_predictions(model,preproc_question_sentences, preproc_answer_sentences, question_tokenizer, answer_tokenizer)

Chatbot: Hi ! please type your name.

user: Shyam
Chatbot: hi , Shyam ! My name is Jammy AI.

How are you?
i 'm perfectly fine. the servers are paid and in case of network failure i 'll automatically be available from a different region .               

what can you do?
think of me as shyam at an interview. ask me `` interview '' related questions. what books i read , what have i been doing lately or to tell you a joke .      

What have you worked?
think open source projects. you can see them on github : https : //github.com/bvshyam/. this bot is also an achievement : )                 

connect with shyam
i have no words for old : )                               

how old are you?
... i 'm 31 years old : )                               

exit


# Summary

### Learnings

1. We have successfully created a dataset with the existing data sources with automated and manual effort.
2. Preprocessed the dataset and performed tokenization, padding, cleaning symbols, creating word dictinary.
3. Loaded an pretrained word embeddings matrix(GLove) and used it to convert the vocab to a word embeddings 100 dimention matrix.
4. Create a deep neural network model and input questions and answers to train the neural network.
5. As the training takes longer time, we will run the training on a 8 GB GPU+ memory with 8 core, 30 GB RAM. It took ~2 hours to run 2500 epochs.
6. Next step is to predict the new sentances which are typed. Lot of preprocessing needs to be performed on user entered text.
7. Clean any symbols and tokenize the user input text. Also convert it to vector format.
8. Finally all the input text, interpreted text and response text is stored in text file with date stamp for future training.

### Flask App:

1. There is also an flask app which was created for this interaction with the user. It is shown in the video and code.

### Potential exploration and development:

Although this chatbot is providing some answers, there is a lot of space for improvements. Some of them are .

1. Chatbot response in a proper grammatically correct way.
2. Perform more preprocessing on the input text and finding correct match words for the unknown words.
3. Create a sequence to sequence encoder-decoder model using bidirectional RNN. That will provide better results. I have tried to implement it. But require some  more time for deep exploration. 
4. Better UI for the chatbot. Currently UI is very simple and have minor issues in it.