## A intenção do projeto é criar um chatbot baseado em reviews de filmes para que se possa fazer perguntas e manter uma conversa livre sobre este tema

- link do banco de dados https://www.kaggle.com/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv
- referências
>- https://shanebarker.com/blog/deep-learning-chatbot/
> -https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

In [366]:
import string
import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import gensim
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
import math
import random
import bz2
import itertools
from keras.callbacks import ModelCheckpoint, EarlyStopping

[nltk_data] Downloading package wordnet to /home/douglas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [367]:
#expand jupyter cells
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Opening movie reviews

In [368]:
messages = pd.read_csv('./chatdata/movie_lines_pre_processed.tsv', delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [369]:
messages.head()

Unnamed: 0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
0,L49,u0,m0,Did you change your hair?,No.,did you change your hair,1
1,L50,u3,m0,No.,You might wanna think about it,no,0
2,L51,u0,m0,You might wanna think about it,I need to think more about it,you might wanna think about it,0
3,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,i missed you,0
4,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,it say here you exposed yourself to a group of...,0


### Processing for deep learning

In [370]:
#setting the sample data for tests
i = 0
n = 20000

In [371]:
X_train, X_test, y_train, y_test = train_test_split(messages['msg_pre_processed'][i:n].astype(str), messages['target'][i:n].astype(str), test_size=0.33, stratify=messages['target'][i:n], random_state=42)

In [372]:
#dataframe with sample X and y
df_small = pd.DataFrame()

In [373]:
df_small['msg_pre_processed'] = X_train

In [374]:
df_small['target'] = y_train

In [375]:
df_small.head()

Unnamed: 0,msg_pre_processed,target
1486,look you cant shoot him in cold blood,0
159,i cant date her sister until that one get a bo...,0
3049,ive been ordered into bed the doctor say ive g...,0
16363,he wasnt acting,1
17912,he didnt look like hed take that sitting down,0


In [376]:
df_small.shape

(13400, 2)

In [377]:
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X_train)

In [378]:
X_train

1486                 look you cant shoot him in cold blood
159      i cant date her sister until that one get a bo...
3049     ive been ordered into bed the doctor say ive g...
16363                                      he wasnt acting
17912        he didnt look like hed take that sitting down
                               ...                        
337                                             no i m not
5492                          do you know who put it there
9350                              let not get overdramatic
6042                                        yeah grotesque
16659    our platoon ha the best assignment of all were...
Name: msg_pre_processed, Length: 13400, dtype: object

In [379]:
y_train

1486     0
159      0
3049     0
16363    1
17912    0
        ..
337      0
5492     1
9350     0
6042     0
16659    0
Name: target, Length: 13400, dtype: object

In [380]:
# encode training data set
X_train_token = tokenizer.texts_to_matrix(X_train)

In [381]:
X_train_token

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [382]:
X_train_token.shape

(13400, 10222)

In [383]:
#set the number of rows of X_train
num_rows, num_cols = X_train_token.shape

In [384]:
classes = set(df_small['target'])
classes

{'0', '1'}

In [385]:
df_small['target'] = df_small['target'].astype('int')

In [386]:
df_small.head()

Unnamed: 0,msg_pre_processed,target
1486,look you cant shoot him in cold blood,0
159,i cant date her sister until that one get a bo...,0
3049,ive been ordered into bed the doctor say ive g...,0
16363,he wasnt acting,1
17912,he didnt look like hed take that sitting down,0


### Search for the best parameters

In [387]:
def create_model(X, y, activation='relu', momentum=0.9, learn_rate=0.01, decay=1e-6,
                 dropout_rate=0.5, weight_constraint=1, neurons=20, init='uniform',
                 optimizer='SGD', nesterov=False, num_cols=10, pos_fix='',
                 epochs=10, validation_split=0.3, batch_size=20):
        
    model = Sequential()
    model.add(Dense(neurons, input_dim=num_cols, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(neurons/2, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    
    #model.summary()
    
    # Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
    sgd = SGD(lr=learn_rate, decay=decay, momentum=momentum, nesterov=nesterov)
    model.compile(loss='BinaryCrossentropy', optimizer=sgd, metrics=['accuracy'])
    
    callbacks = [EarlyStopping(monitor='val_accuracy', patience=3, verbose=0),
                tf.keras.callbacks.ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.2f}.h5'),]
    
    hist = model.fit(X, y, epochs=epochs, validation_split=validation_split, batch_size=batch_size, verbose=1, callbacks=callbacks)
    
    model_name = './models/chatbot_model_'+ pos_fix +'_.h5'
    #model.save(model_name, hist)

    print('model '+ model_name +' created')
    
    return model

### Training the model with fixed parameters

In [388]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(20, input_dim=num_cols, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_30 (Dense)             (None, 20)                204460    
_________________________________________________________________
dropout_20 (Dropout)         (None, 20)                0         
_________________________________________________________________
dense_31 (Dense)             (None, 10)                210       
_________________________________________________________________
dropout_21 (Dropout)         (None, 10)                0         
_________________________________________________________________
dense_32 (Dense)             (None, 1)                 11        
Total params: 204,681
Trainable params: 204,681
Non-trainable params: 0
_________________________________________________________________


In [389]:
%%time
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=False)
model.compile(loss='BinaryCrossentropy', optimizer=sgd, metrics=['accuracy'])

CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 17.2 ms


In [390]:
%%time

callbacks = [EarlyStopping(monitor='val_accuracy', patience=10, verbose=0),
                ModelCheckpoint(filepath='model.{epoch:02d}-{val_accuracy:.2f}.h5'),
            ]

#fitting and saving the model
hist = model.fit(X_train_token, df_small['target'], epochs=500, validation_split=0.3, batch_size=20, verbose=1, callbacks=callbacks)
model.save('chatbot_model.h5', hist)

print("model created")

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
model created
CPU times: user 1min 9s, sys: 9.03 s, total: 1min 18s
Wall time: 56.7 s


### Testing the prototipe

In [334]:
from tensorflow.keras.models import load_model

In [335]:
model = load_model('model.01-0.80.h5')

In [336]:
lemmatizer = WordNetLemmatizer()
def pre_processing_text(corpus):
    #remove html tags
    corpus = re.sub(r'<.*?>', '', str(corpus))
    
    #remove non-alphanumeric characters
    corpus = re.sub(r'[^a-z A-Z 0-9 \s]', '', str(corpus))
    
    #remove duplicated spaces
    corpus = re.sub(r' +', ' ', str(corpus))
    
    #capitalization
    corpus = corpus.lower()
    
    #tokenization
    corpus = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    
    #lammatization
    corpus = [lemmatizer.lemmatize(c) for c in corpus]
    
    #remove punctuation
    corpus = [t for t in corpus if t not in string.punctuation]
    
    #remove stopwords
    #it makes the model worst
    #stopwords_ = stopwords.words("english")
    #corpus = [t for t in corpus if t not in stopwords_]
    
    corpus = ' '.join(corpus)

    return corpus

In [337]:
msg_raw = 'I heard you are a good guy. Is it right?'

In [338]:
msg = pre_processing_text(msg_raw)

In [339]:
p = tokenizer.texts_to_matrix([msg])

In [340]:
p.shape

(1, 9257)

In [341]:
res = model.predict(p)



In [342]:
res

array([[0.82286173]], dtype=float32)

### Defining the list of questions and answers

In [343]:
questions = set(df_small[df_small['target'] == 1]['msg_pre_processed'])

In [344]:
answers = set(df_small[df_small['target'] == 0]['msg_pre_processed'])

In [345]:
answers

{'im sorry to hear that',
 'i know you did but i assure you there wa an impending failure',
 'i wasnt finished',
 'whatever you say',
 'we have to prepare you for an audience with sophie',
 'im going to see what else i can find out about mr',
 'no thank you i take it black like my men',
 'we are launching a major offensive to expand our foraging territory',
 'she wa european',
 'i feel like shit',
 'actually only one cell survived',
 'once in church dude',
 'ettore',
 'i cant listen to this',
 'two minute and already youre a dead man let passion overwhelm you colon',
 'of course it would be very easy for to find out now',
 'and here is our illustrious herr salieri',
 'at the school crossing in his bmw hurt some kid im gonna bust his as',
 'he busy being dead',
 'i know how you feel but there are already two of staying',
 'and be right back',
 'shes been trying to reach you for the last twenty minute they want you up stair',
 'of course i do very good of course now and then just now and

## Returning the conversation for the message using Jaccard Similarity

In [346]:
def jaccard_similarity(f1, f2):
    f1 = set(f1)
    f2 = set(f2)
    
    intersecao = f1.intersection(f2)
    uniao = f1.union(f2)
    
    return len(intersecao) / len(uniao)

In [347]:
def return_conversation_by_jaccard(msg, res, questions, answers):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions
        similarity = [jaccard_similarity(msg, m) for m in questions]     
    else:
        similarity = [jaccard_similarity(msg, m) for m in answers]
        msg_list = answers
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
    

In [348]:
conversations = return_conversation_by_jaccard(msg, res, questions, answers)
conversations

{'whered you get all these rider': 0.8666666666666667,
 'how did she hide it during the day': 0.8666666666666667,
 'figured that out for yourself did you': 0.8666666666666667,
 'is there anything you need money': 0.8666666666666667,
 'is that all that interest you gold': 0.8666666666666667,
 'do you feel all right sir': 0.8666666666666667,
 'thats right you see that mr get it': 0.8571428571428571,
 'you hear what i said': 0.8571428571428571,
 'hey did you hear what he said': 0.8571428571428571,
 'whered you get this': 0.8571428571428571,
 'got his deer yet': 0.8461538461538461,
 'what were you doing at this morning': 0.8125,
 'right see ready for the quiz': 0.8125,
 'distracting enough for you': 0.8125,
 'then why the hell are you sitting around here': 0.8125,
 'hey dont you think a hair stylist got any interest in gettin it on': 0.8125,
 'why didnt you do anything to stop her': 0.8125,
 'so did you do shoot the photograph in there or what': 0.8125,
 'you got what you wanted you going 

In [349]:
#get the first item in the dict
def get_the_next_conversation(conversations):
    keys_view = conversations.keys()
    keys_iterator = iter(keys_view)
    conversation = next(keys_iterator)
    return conversation

In [350]:
conversation = get_the_next_conversation(conversations)
conversation

'whered you get all these rider'

### The returned message

In [351]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< We didn't. Zerelda did. Turns out your wife makes a hell of an outlaw.


## Return the result using the Cossine Similarity

In [352]:
bow = CountVectorizer()

In [353]:
def return_conversation_by_cossine(msg, res, questions, answers, bow):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers
       
    similarity = []
    for m in msg_list:
        new_msg_list = [msg, m]
        vector_bow = bow.fit_transform(new_msg_list)
        msg_bow = vector_bow.todense()[0]
        m_bow   = vector_bow.todense()[1]
        similarity.append(distance.cosine(msg_bow, m_bow))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [354]:
conversations = return_conversation_by_cossine(msg, res, questions, answers, bow)
conversations

  dist = 1.0 - uv / np.sqrt(uu * vv)


{'are you all right': 0.4696699141100894,
 'are you': 0.5,
 'is it': 0.5,
 'is it is it really': 0.5285954792089683,
 'no he absolutely right absolutely right ryan talk must be banned canine conversation are completely discouraged it really good of you to join u can i get you a drink': 0.5691797815723354,
 'anyway whats it matter to you if we think it funny right you care': 0.5833333333333333,
 'are you afraid': 0.591751709536137,
 'of course it him who do you think it is': 0.591751709536137,
 'is it true': 0.591751709536137,
 'are you hit': 0.591751709536137,
 'no you idiot i said is it a receptacle tip not is a despicable twit is it a receptacle tip get off me': 0.6189996189994285,
 'thats right you see that mr get it': 0.625,
 'it is not your place to ask question is it true': 0.6348516283298893,
 'is everyone all right': 0.6464466094067263,
 'you up for it': 0.6464466094067263,
 'is it still dripping': 0.6464466094067263,
 'no are you proposing': 0.6464466094067263,
 'you hear it a

In [355]:
conversation = get_the_next_conversation(conversations)
conversation

'are you all right'

### Return result

In [356]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Sure. What's that funny smell?


## Get result using Cossine Similarity with Embedding

In [357]:
def download_embedding(get_it):
    if get_it:
        !gdown https://drive.google.com/uc?id=1zI8pGfbUHuU_0wY_FV4tD6w6ZCUJTQbh
    print('Download finished')

In [358]:
#The embedding is already downloaded
#Change to True to download
download_embedding(False)

Download finished


In [359]:
%%time
#get the embedding
newfilepath = "embedding_wiki_100d_pt.txt"
filepath = "ptwiki_20180420_100d.txt.bz2"
with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb') as file:
    for data in iter(lambda : file.read(100 * 1024), b''):
        new_file.write(data)

CPU times: user 1min 1s, sys: 1.2 s, total: 1min 2s
Wall time: 1min 5s


In [None]:
%%time
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=False)

In [360]:
word_vectors

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f3848c10760>

In [361]:
def calculate_embedding(phrase):
    """
    Return the mean of embeddings of a phrase
    """
    
    arr = np.array([word_vectors[word] for word in phrase if word in word_vectors.vocab])
    
    sum = np.zeros(len(arr[0]))
    for a in arr:
        sum = sum + a
        
    arr_mean = sum / len(arr) 
    
    return arr_mean

In [362]:
def return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers       
    
    msg = msg.split(' ')
    
    similarity = []
    for m in msg_list:        
        m = m.split(' ')
        
        try:
            msg_vector_embedding = calculate_embedding(msg)
            m_vector_embedding   = calculate_embedding(m)
        
            similarity.append(distance.cosine(msg_vector_embedding, m_vector_embedding))
        except:
            print("An exception occurred")
            print('> '+ ' '.join(m))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [363]:
%%time
conversations = return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors)
conversations

An exception occurred
> fanfuckingtastic
An exception occurred
> cornilius
An exception occurred
> ywhat
An exception occurred
> airsick
An exception occurred
> uhuhwhy
An exception occurred
> mcdairmo
An exception occurred
> imbusy
An exception occurred
> andand
An exception occurred
> magdelen
An exception occurred
> notbusy
An exception occurred
> cspan
An exception occurred
> whatisit
An exception occurred
> grotty
An exception occurred
> autoshop
An exception occurred
> wheres irth
An exception occurred
> divinely
An exception occurred
> mommymommy
An exception occurred
> didnt
CPU times: user 938 ms, sys: 31.2 ms, total: 969 ms
Wall time: 973 ms


{'would you leave alone right now': 0.0277536050874172,
 'well whatta you mean i mean it perfectly fine out here i mean tonys very nice and uh well i meet people and i go to party and and we play tennis i mean thats thats a very big step for me you know i mean im able to enjoy people more': 0.027874825816351256,
 'why not now that theyre kicking me upstairs': 0.02798507136473971,
 'sssh what is it tell me': 0.02993621271078506,
 'why cant i ever fall in love with nice like you': 0.031356231883946206,
 'in government': 0.031382567448420495,
 'and tell corrado too that im here if he want me you can also tell him that my tiny little heart is beating like mad and that at this moment it the only thing that interest me is that clear': 0.03160381381780497,
 'you like it there dont you': 0.03206581654738938,
 'a better idea': 0.03230562653486446,
 'well look at it this way i mean when you come right down to it that girl shes a bit of a scrubber isnt she': 0.032685532525543515,
 'always count o

In [364]:
conversation = get_the_next_conversation(conversations)
conversation

'would you leave alone right now'

### Return result

In [365]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< I love her too Joe.
