## A intenção do projeto é criar um chatbot baseado em reviews de filmes para que se possa fazer perguntas e manter uma conversa livre sobre este tema

- link do banco de dados https://www.kaggle.com/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv
- referências
>- https://shanebarker.com/blog/deep-learning-chatbot/
> -https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

In [None]:
import string
import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import gensim
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
import math
import random
import bz2

### Opening movie reviews

In [2]:
messages = pd.read_csv('./chatdata/movie_lines_pre_processed.tsv', delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [3]:
messages.head()

Unnamed: 0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
0,L49,u0,m0,Did you change your hair?,No.,did you change your hair,1
1,L50,u3,m0,No.,You might wanna think about it,no,0
2,L51,u0,m0,You might wanna think about it,maybe...,you might wanna think about it,0
3,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,i missed you,0
4,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,it say here you exposed yourself to a group of...,0


### Processing for deep learning

In [55]:
#setting the sample data for tests
i = 0
n = 20000

In [57]:
X_train, X_test, y_train, y_test = train_test_split(messages['msg_pre_processed'][i:n].astype(str), messages['target'][i:n].astype(str), test_size=0.33, stratify=messages['target'][i:n], random_state=42)

In [58]:
#dataframe with sample X and y
df_small = pd.DataFrame()

In [59]:
df_small['msg'] = X_train

In [60]:
df_small['target'] = y_train

In [61]:
df_small.head()

Unnamed: 0,msg,target
1486,take the car get outta here tommy,0
159,i can't date her sister until that one get a b...,0
3047,mr christian tom welles here,0
16358,he's out of prison,1
17913,maybe we better stay in under the tree till da...,0


In [62]:
df_small.shape

(13400, 2)

In [63]:
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X_train)

In [64]:
X_train

1486                     take the car get outta here tommy
159      i can't date her sister until that one get a b...
3047                          mr christian tom welles here
16358                                   he's out of prison
17913    maybe we better stay in under the tree till da...
                               ...                        
337                                             no i m not
5492                          do you know who put it there
9354                                         eve don't cry
6042                                        yeah grotesque
16662                                  superior number kid
Name: msg_pre_processed, Length: 13400, dtype: object

In [65]:
y_train

1486     0
159      0
3047     0
16358    1
17913    0
        ..
337      0
5492     1
9354     0
6042     0
16662    0
Name: target, Length: 13400, dtype: object

In [66]:
# encode training data set
X_train_token = tokenizer.texts_to_matrix(X_train)

In [67]:
X_train_token

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [68]:
X_train_token.shape

(13400, 9820)

In [69]:
#set the number of rows of X_train
num_rows, num_cols = X_train_token.shape

In [70]:
classes = set(df_small['target'])
classes

{'0', '1'}

In [71]:
df_small['target'] = df_small['target'].astype('int')

In [72]:
df_small.head()

Unnamed: 0,msg,target
1486,take the car get outta here tommy,0
159,i can't date her sister until that one get a b...,0
3047,mr christian tom welles here,0
16358,he's out of prison,1
17913,maybe we better stay in under the tree till da...,0


### Training the model

In [73]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(20, input_dim=num_cols, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 20)                196420    
_________________________________________________________________
dropout_2 (Dropout)          (None, 20)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                210       
_________________________________________________________________
dropout_3 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 11        
Total params: 196,641
Trainable params: 196,641
Non-trainable params: 0
_________________________________________________________________


In [74]:
%%time
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=False)
model.compile(loss='BinaryCrossentropy', optimizer=sgd, metrics=['accuracy'])

#fitting and saving the model
hist = model.fit(X_train_token, df_small['target'], epochs=10, validation_split=0.3 ,batch_size=20, verbose=1)
model.save('chatbot_model.h5', hist)

print("model created")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
model created
CPU times: user 33.5 s, sys: 4.72 s, total: 38.2 s
Wall time: 18.7 s


### Testing the prototipe

In [75]:
lemmatizer = WordNetLemmatizer()
def pre_processing_text(corpus):   
    #remove duplicated spaces
    corpus = re.sub(r' +', ' ', corpus)
    
    #capitalization
    corpus = corpus.lower()
    
    #tokenization
    corpus = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    
    #lammatization
    corpus = [lemmatizer.lemmatize(c) for c in corpus]
    
    #remove punctuation
    corpus = [t for t in corpus if t not in string.punctuation]
    
    #remove stopwords
    #it makes the model worst
    #stopwords_ = stopwords.words("english")
    #corpus = [t for t in corpus if t not in stopwords_]
    
    corpus = ' '.join(corpus)

    return corpus

In [76]:
msg_raw = 'I heard you are a good guy. Is it right?'

In [77]:
msg = pre_processing_text(msg_raw)

In [78]:
p = tokenizer.texts_to_matrix([msg])

In [79]:
p.shape

(1, 9820)

In [80]:
res = model.predict(p)

In [81]:
res

array([[0.4959879]], dtype=float32)

### Defining the list of questions and answers

In [82]:
questions = set(df_small[df_small['target'] == 1]['msg'])

In [83]:
answers = set(df_small[df_small['target'] == 0]['msg'])

In [84]:
answers

{"good i'm getting you clear too let's just keep the line open",
 "ah i see yes paul's disappearance yes",
 'meet me in the moonpool move fast',
 "here are the key to my apartment i'm going to park you in my place while i take carol home",
 'yes well i hope the crew got back safely',
 'i take it she read well',
 "i had no idea you'd be this good",
 'oh no the honour would be all mine',
 'oh you were talking to jesse',
 "she's coming",
 'everyone regret something',
 'i have some very good memory there',
 'this answer satisfies me a hundred percent',
 'there are reason for the way we do thing here',
 "promise you'll be nice to the neighbor",
 "i'm about to mr president",
 'get on the expressway',
 "it's good what you're doing",
 "you're nut man katka's right",
 "now we'll both call him",
 'fuck you fuck cleveland and fuck your contract',
 'oh ten dollarsĂ¤ Â\x82 Ă˘ Â\x82 Ä\x83 Â\x82 Ă˘ Â\x97',
 "wow you're right u poor dumb ol boy might've had to think for ourselves coulda been a disaste

## Returning the conversation for the message using Jaccard Similarity

In [85]:
def jaccard_similarity(f1, f2):
    f1 = set(f1)
    f2 = set(f2)
    
    intersecao = f1.intersection(f2)
    uniao = f1.union(f2)
    
    return len(intersecao) / len(uniao)

In [86]:
def return_conversation_by_jaccard(msg, res, questions, answers):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions
        similarity = [jaccard_similarity(msg, m) for m in questions]     
    else:
        similarity = [jaccard_similarity(msg, m) for m in answers]
        msg_list = answers
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
    

In [87]:
conversations = return_conversation_by_jaccard(msg, res, questions, answers)
conversations

{'repairing the antenna is a pretty dangerous operation': 0.8666666666666667,
 'oh right right with those thing running around no way': 0.8666666666666667,
 'sure monday and thursday and monday again and thursday again': 0.8666666666666667,
 'our last game is this saturday': 0.8666666666666667,
 'you heard the ant dig': 0.8571428571428571,
 'i guess this is a first for you': 0.8571428571428571,
 'but you said the other': 0.8571428571428571,
 'this here is the younger gang': 0.8571428571428571,
 'your father is dead': 0.8571428571428571,
 'there go your ride': 0.8461538461538461,
 'i had to see you': 0.8461538461538461,
 'your daughter': 0.8461538461538461,
 "it's good what you're doing": 0.8125,
 "that's not what you said the u other u u night u": 0.8125,
 'second thing that come to your mind': 0.8125,
 'i wa there i saw you and heard you through the dressing room door': 0.8125,
 'this ha nothing to do with your fire': 0.8125,
 'yeah but he insisted on u bringing him to the station': 0

In [88]:
#get the first item in the dict
def get_the_next_conversation(conversations):
    keys_view = conversations.keys()
    keys_iterator = iter(keys_view)
    conversation = next(keys_iterator)
    return conversation

In [89]:
conversation = get_the_next_conversation(conversations)
conversation

'repairing the antenna is a pretty dangerous operation'

### The returned message

In [90]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< It doesn't have to be Hal. It's more dangerous to be out of touch with Earth. Let me have manual control please.


## Return the result using the Cossine Similarity

In [91]:
bow = CountVectorizer()

In [92]:
def return_conversation_by_cossine(msg, res, questions, answers, bow):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers
       
    similarity = []
    for m in msg_list:
        new_msg_list = [msg, m]
        vector_bow = bow.fit_transform(new_msg_list)
        msg_bow = vector_bow.todense()[0]
        m_bow   = vector_bow.todense()[1]
        similarity.append(distance.cosine(msg_bow, m_bow))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [93]:
conversations = return_conversation_by_cossine(msg, res, questions, answers, bow)
conversations

  dist = 1.0 - uv / np.sqrt(uu * vv)


{'excellency you are right': 0.4696699141100894,
 "yeah it's you all right": 0.5256583509747431,
 "this is mole it's good": 0.5256583509747431,
 "you were right you were right z it's u beautiful u": 0.527544408738466,
 'this is who you were exactly who you are is up to you': 0.5576741315353086,
 "it's good what you're doing": 0.5669872981077806,
 "it's it's all right": 0.5669872981077807,
 "life is funny isn't it you find the right girl and then you lose her": 0.5712535371437278,
 "what you are saying is you don't know what this thing is": 0.5833333333333334,
 'you did it': 0.591751709536137,
 "it's a good thought": 0.591751709536137,
 "no it's all right i'm not very good at controlling it anyway": 0.6077677297236319,
 'mozart it wa good of you to come': 0.625,
 "when you start with that attitude it's like i don't know who you are": 0.6348516283298893,
 "so he sent you gave you the money his errand boy and if you refused it wasn't like you could tell anyone your pervert bos just asked 

In [94]:
conversation = get_the_next_conversation(conversations)
conversation

'excellency you are right'

### Return result

In [95]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< can you explain it better?


## Get result using Cossine Similarity with Embedding

In [96]:
def download_embedding(get_it):
    if get_it:
        !gdown https://drive.google.com/uc?id=1zI8pGfbUHuU_0wY_FV4tD6w6ZCUJTQbh
    print('Download finished')

In [97]:
#The embedding is already downloaded
#Change to True to download
download_embedding(False)

Download finished


In [98]:
%%time
#get the embedding
newfilepath = "embedding_wiki_100d_pt.txt"
filepath = "ptwiki_20180420_100d.txt.bz2"
with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb') as file:
    for data in iter(lambda : file.read(100 * 1024), b''):
        new_file.write(data)

CPU times: user 56.6 s, sys: 1.14 s, total: 57.7 s
Wall time: 58 s


In [99]:
%%time
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=False)

CPU times: user 3min 43s, sys: 6.7 s, total: 3min 49s
Wall time: 3min 50s


In [100]:
word_vectors

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f277deb4b50>

In [101]:
def calculate_embedding(phrase):
    """
    Return the mean of embeddings of a phrase
    """
    
    arr = np.array([word_vectors[word] for word in phrase if word in word_vectors.vocab])
    
    sum = np.zeros(len(arr[0]))
    for a in arr:
        sum = sum + a
        
    arr_mean = sum / len(arr) 
    
    return arr_mean

In [102]:
def return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers       
    
    msg = msg.split(' ')
    
    similarity = []
    for m in msg_list:        
        m = m.split(' ')
        
        try:
            msg_vector_embedding = calculate_embedding(msg)
            m_vector_embedding   = calculate_embedding(m)
        
            similarity.append(distance.cosine(msg_vector_embedding, m_vector_embedding))
        except:
            print("An exception occurred")
            print('> '+ ' '.join(m))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [103]:
%%time
conversations = return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors)
conversations

An exception occurred
> ahhhhhhh
An exception occurred
> coffee's perked
An exception occurred
> sobering
An exception occurred
> sssssh
An exception occurred
> stanzi
An exception occurred
> skywire
An exception occurred
> dinoine chagantakat
An exception occurred
> wholesome
An exception occurred
> it's insured
An exception occurred
> shurrup
An exception occurred
> m'm
An exception occurred
> twombley
An exception occurred
> dorsia
An exception occurred
> don't dramatise
An exception occurred
> compensate
An exception occurred
> kablooey
An exception occurred
> don't
An exception occurred
> he's convulsing
An exception occurred
> shhhhh i'm concentrating
An exception occurred
> sshh sshh
An exception occurred
> tumescent
An exception occurred
> jeeeeeeeeesus
An exception occurred
> wellĂ¤ Â Ă˘ Â Ä Â Ă˘ Â
An exception occurred
> yeeeeeeeaaaawwwwww
An exception occurred
> portchnik
An exception occurred
> coninued
An exception occurred
> m'mm
An exception occurred
> hearsay
An ex

{'we got a lot to talk about': 0.024004499528219703,
 'not what who': 0.02543383816936584,
 'anything you say mammacitta': 0.026040793329683676,
 'i remember you now the so called art dealer': 0.026379794137182833,
 'check': 0.02794685742042813,
 'i know i know': 0.028047823243650405,
 "no i'm the best that's ever threatened you": 0.02835924646792487,
 "i'm playing white remember you can't tell me which piece to move it doesn't work that way": 0.02919562172205059,
 'i reckoned': 0.03036615248273311,
 'no they will not i know how thing work in this city': 0.03049336824004112,
 'so you u are u a man after all': 0.030518745069989084,
 'the little creep hate it that eric actually doe what the company hired him to do': 0.030716052904914304,
 "i may be an asshole but at least i'm a real detective not some outer shit space thing": 0.030865834031351236,
 "well we might a well start filling it in now a long a you leave enough room around that end of the pipe so i can get to it we're set use the

In [104]:
conversation = get_the_next_conversation(conversations)
conversation

'we got a lot to talk about'

### Return result

In [105]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Yeah old times.
