## A intenção do projeto é criar um chatbot baseado em reviews de filmes para que se possa fazer perguntas e manter uma conversa livre sobre este tema

- link do banco de dados https://www.kaggle.com/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv
- referências
>- https://shanebarker.com/blog/deep-learning-chatbot/
> -https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

In [53]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import gensim
import numpy as np
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import load_model
from scipy.spatial import distance
import math
import random
import bz2
import itertools
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import wordnet

In [54]:
#expand jupyter cells
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Opening movie reviews

In [55]:
messages = pd.read_csv('./chatdata/movie_lines_pre_processed_for_test.tsv', delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [56]:
messages.head()

Unnamed: 0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
0,L49,u0,m0,Did you change your hair?,No.,change hair,1
1,L50,u3,m0,No.,You might wanna think about it,no,0
2,L51,u0,m0,You might wanna think about it,can you explain it better?,might wanna think,0
3,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,miss,0
4,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,say expose group freshman girl,0


### Defining the list of questions and answers

In [57]:
questions = set(messages[messages['target'] == 1]['msg_pre_processed'])

In [58]:
answers = set(messages[messages['target'] == 0]['msg_pre_processed'])

In [59]:
len(answers)

2093

In [60]:
len(questions)

907

In [61]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [62]:
lemmatizer = WordNetLemmatizer()
def pre_processing_text(corpus):
    #remove html tags
    corpus = re.sub(r'<.*?>', '', str(corpus))
    
    #remove non-alphanumeric characters
    corpus = re.sub(r'[^a-z A-Z 0-9 \s]', '', str(corpus))
    
    #remove duplicated spaces
    corpus = re.sub(r' +', ' ', str(corpus))
    
    #capitalization
    corpus = corpus.lower()
    
    #tokenization
    corpus = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    
    #lammatization
    corpus = [lemmatizer.lemmatize(c, get_wordnet_pos(c)) for c in corpus]
    
    #remove punctuation
    corpus = [t for t in corpus if t not in string.punctuation]
    
    #remove stopwords
    #it makes the model worst
    #stopwords_ = stopwords.words("english")
    #corpus = [t for t in corpus if t not in stopwords_]
    
    corpus = ' '.join(corpus)

    return corpus

In [63]:
msg_raw = 'I heard you are a good guy. Is it right?'
#msg_raw = 'yes i heard you all right 20000000 thats quite a lot isnt it'
msg = pre_processing_text(msg_raw)
print(msg)

with open('./chatdata/tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
p = tokenizer.texts_to_matrix([msg])

model = load_model('./chatdata/chatbot_model.h5')
res = model.predict(p)

print(res)

i heard you be a good guy be it right
[[0.61740196]]


## Returning the conversation for the message using Jaccard Similarity

In [64]:
def jaccard_similarity(f1, f2):    
    f1 = set(str(f1).split(' '))
    f2 = set(str(f2).split(' '))
    
    intersecao = f1.intersection(f2)
    uniao = f1.union(f2)
    
    return len(intersecao) / len(uniao)

In [65]:
def return_conversation_by_jaccard(msg, res, questions, answers, threshold=None):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions
        similarity = [jaccard_similarity(msg, str(m)) for m in questions]        
    else:
        similarity = [jaccard_similarity(msg, str(m)) for m in answers]
        msg_list = answers
    
    result = {} 
    for key in msg_list: 
        for value in similarity:
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}    

In [66]:
%%time
conversations = return_conversation_by_jaccard(msg, res, questions, answers)
conversations

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.26 ms


{'hey high good right': 0.18181818181818182,
 'know ive ever heard say shed dip date guy smoke': 0.11764705882352941,
 'new guy': 0.1,
 'macbeth right': 0.1,
 'knew right': 0.1,
 'know right': 0.1,
 'good time': 0.1,
 'good stuff': 0.1,
 'heard song': 0.1,
 'want know difference make im guy bed last three month make feel good make feel good hell else want guy': 0.09523809523809523,
 'guy ganz hotel': 0.09090909090909091,
 'heard hell let': 0.09090909090909091,
 'one guy checked': 0.09090909090909091,
 'movie right book right': 0.09090909090909091,
 'whatever fuck right': 0.09090909090909091,
 'good morning hows go': 0.08333333333333333,
 'think police good job': 0.08333333333333333,
 'yeah right want try': 0.08333333333333333,
 'shakespeare maybe youve heard': 0.08333333333333333,
 'girl like always like guy like': 0.08333333333333333,
 'paulie youve get kid right': 0.07692307692307693,
 'ill minute dont move right': 0.07692307692307693,
 'right see youre ready quiz': 0.076923076923076

In [67]:
#get the first item in the dict
def get_the_next_conversation(conversations):
    keys_view = conversations.keys()
    keys_iterator = iter(keys_view)
    conversation = next(keys_iterator)
    return conversation

In [68]:
conversation = get_the_next_conversation(conversations)
conversation

'hey high good right'

### The returned message

In [69]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< talk more about it


## Calculate PageRank
- create a bi-directional graph of messages using similarity bigger then a threshould

In [70]:
"""
Classe para criação de um nó (página) do grafo.
O cosntrutuor recebe recebe o nome do nó, a lista de nós de entrada e saída.
"""
class Node(object):
    def __init__(self, node_name: str, inlinks: list, outlinks: list):
        self.node_name = node_name
        self.inlinks = inlinks
        self.outlinks = outlinks

"""
Classe para criação do grafo de páginas.
O construtor inicializa um dicionário vazio de lista de adjacência. 
"""
class Graph(object):
    def __init__(self):
        self.adj_list = dict()

    #Adiciona um nó ao grafo com seus nós de entrada e sapida
    def add_node(self, node_name: str, inlinks: list, outlinks: list):
        node = Node(node_name = node_name, inlinks=inlinks, outlinks=outlinks)
        self.adj_list[node_name] = node

    #Imprime os dados do grafo criado
    def print_graph(self):
        for key in self.adj_list:
            print(f"{key}:")
            print(f"\tEntrada: {self.adj_list[key].inlinks}")
            print(f"\tSaída: {self.adj_list[key].outlinks}")

In [71]:
import numpy as np

"""
A classe PageRank possui métodos para computar o Page Rank de cada página dado um número de iterações.
"""
class PageRank(object):
    #Construtor da classe que recebe um objeto Graph, inicializa um dicionário vazio de scores
    def __init__(self, graph: Graph):
        self.graph = graph
        self.scores = dict()
        self.__initialize_scores()

    #Inicializa os scores do Page Rank com o valor inicial 1/n, onde n é o número de nós (páginas) do grafo
    def __initialize_scores(self):
        n = len(self.graph.adj_list)
        for key in self.graph.adj_list:
            self.scores[key] = 1/n

    #Calcula o Page Rank para cada página dado o número de iterações. Ainda não utiliza a convergence_rate no cálculos
    def compute(self, iterations: int = 10, convergence_rate: float = 0.01):
        new_scores = dict()
        for i in range(iterations):
            for  node in self.graph.adj_list:
                in_to_node = np.asarray([
                    self.scores[x] for x in self.graph.adj_list[node].inlinks
                ])
                out_to_node = self.graph.adj_list[node].inlinks
                amount_out_to_node = np.asarray([
                    len(self.graph.adj_list[x].outlinks) for x in out_to_node
                ])
                score = np.sum(in_to_node / amount_out_to_node)
                new_scores[node] = score
            #print(self.scores)
            self.scores = new_scores.copy()
        return self.scores

    def power_method(self, iterations: int = 10):
        #Gera a matrix de probabilidades de navegação a cada nó
        lenght = len(self.graph.adj_list)
        matrix = np.zeros((lenght, lenght))
        i = j = 0
        for node in self.graph.adj_list:
            for link in self.graph.adj_list[node].outlinks:
                for row in self.graph.adj_list:
                    if link == row:
                        matrix[i][j] = 1/len(self.graph.adj_list[node].outlinks)
                    i = i + 1
                i = 0
            j = j + 1
        print("Matriz de probabilidades ")
        print(matrix)
        #muliplica a matrix pelo score inicial
        print("Page Rank das iterações")
        scores_arr = np.asarray([
                    self.scores[key] for key in self.graph.adj_list
                ])
        itn = np.dot(matrix, scores_arr)
        print(itn)
        for i in range(iterations - 1):
            itn = np.dot(matrix, itn)
            print(itn)

In [72]:
def make_in_links(conversations, threshold=0.3):
    li = list()
    if threshold is None:
        for c in conversations.keys():
            li.append(c)
    else:
        for c in conversations.keys():
            if conversations[c] >= threshold:
                li.append(c)
    return li

In [73]:
def get_conversations(msg, res):
    return return_conversation_by_jaccard(msg, res, questions, answers)

In [74]:
def make_graph(qea, res, threshold=None):
    g = Graph()
    lenght = len(qea)
    i = 1
    
    for k in qea:
        conversations = get_conversations(k, res)
        if conversations is not None:        
            #in_links = {k: v for k, v in conversations.items() if v >= threshold and v != 1}
            in_links = make_in_links(conversations, threshold=threshold)
            in_links.remove(k)
            #if in_links:
            g.add_node(k, in_links, in_links)
            #print(g.print_graph())
        if (i % 100) == 0:
            print('Processed '+ str(i) +' of '+ str(lenght))    
        i += 1
    return g

In [75]:
def save_page_compute(qea, res, file_name, threshold=None, iterations=3):

    g = make_graph(qea=qea, res=res, threshold=threshold)

    p = PageRank(graph=g)
    pc = p.compute(iterations=iterations)
    pc = {k: v for k, v in sorted(pc.items(), key=lambda item: item[1], reverse=True)}

    f = open( './chatdata/' + file_name + '.txt', 'w' )
    f.write( repr(pc) )
    f.close()
    return pc

In [78]:
%%time
pc_q = save_page_compute(qea=questions, res=1, threshold=0.001, file_name='page_rank_questions')

Processed 100 of 907
Processed 200 of 907
Processed 300 of 907
Processed 400 of 907
Processed 500 of 907
Processed 600 of 907
Processed 700 of 907
Processed 800 of 907
Processed 900 of 907
CPU times: user 3.16 s, sys: 31.2 ms, total: 3.19 s
Wall time: 3.16 s


In [77]:
%%time
pc_a = save_page_compute(qea=answers, res=0, threshold=0.001, file_name='page_rank_answers')

Processed 100 of 2093
Processed 200 of 2093
Processed 300 of 2093
Processed 400 of 2093
Processed 500 of 2093
Processed 600 of 2093
Processed 700 of 2093
Processed 800 of 2093
Processed 900 of 2093
Processed 1000 of 2093
Processed 1100 of 2093
Processed 1200 of 2093
Processed 1300 of 2093
Processed 1400 of 2093
Processed 1500 of 2093
Processed 1600 of 2093
Processed 1700 of 2093
Processed 1800 of 2093
Processed 1900 of 2093
Processed 2000 of 2093
CPU times: user 17.9 s, sys: 0 ns, total: 17.9 s
Wall time: 18 s


In [52]:
#checking page rank tends to 1
s = 0
for p in pc_q:
    s += pc_q[p]
    
s

0.9195148842337374

## Similary of Jaccard based on Page Rank

In [346]:
def return_conversation_by_page_rank(msg, conversations, page_compute):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """  
    similarity = {k: v for k, v in conversations.items()}
    
    result = dict()
    for k, v in similarity.items():        
        result[k] = page_compute[k] * v
    
    result = {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
    return next(iter(result))

In [347]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: i heard you scream be it a bad one
Page compute: 0.0013261765629984257
Similarity: 0.5


In [348]:
print('Original: '+ msg)
print('Most similar: '+conversation)

Original: i heard you be a good guy be it right
Most similar: i heard you scream be it a bad one


In [349]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< It was bad.


## Return the result using the Cossine Similarity

In [350]:
bow = CountVectorizer()

In [351]:
def return_conversation_by_cossine(msg, res, questions, answers, bow):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers
       
    similarity = []
    for m in msg_list:
        new_msg_list = [msg, m]
        vector_bow = bow.fit_transform(new_msg_list)
        msg_bow = vector_bow.todense()[0]
        m_bow   = vector_bow.todense()[1]
        similarity.append(distance.cosine(msg_bow, m_bow))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [352]:
conversations = return_conversation_by_cossine(msg, res, questions, answers, bow)
conversations

{'be that it be this right': 0.3291796067500631,
 'be it': 0.3291796067500632,
 'i be': 0.36754446796632423,
 'i heard you scream be it a bad one': 0.4023856953328032,
 'be you joking': 0.4522774424948339,
 'be i disturb you': 0.4522774424948339,
 'be you lose': 0.4522774424948339,
 'who be you': 0.4522774424948339,
 'how be it': 0.4522774424948339,
 'what be you hiding why be you afraid': 0.4522774424948339,
 'be you religious': 0.4522774424948339,
 'be you hit': 0.4522774424948339,
 'whereve you be': 0.4522774424948339,
 'be that right': 0.4522774424948339,
 'be you married': 0.4522774424948339,
 'what be it': 0.4522774424948339,
 'what be you': 0.4522774424948339,
 'how be you': 0.4522774424948339,
 'be he with you be you travel together': 0.4522774424948339,
 'be you alright': 0.4522774424948339,
 'when you be go last year where be you': 0.4737651884157825,
 'you know what that be right': 0.4836022205056777,
 'how can you be so certain the ocean be say to be infinite': 0.4921666249

In [353]:
conversation = get_the_next_conversation(conversations)
conversation

'be that it be this right'

### Return result

In [354]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Yeah! I mean I don't know... it looks right.


In [355]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: listen patrizia the marshal say there a current that pass by here and end up at another island i dont know which he want to send one of his men over to have a look one never know do you mind if i ask raimondo to go with him
Page compute: 0.0014705432439456848
Similarity: 0.9573598567288779


In [356]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< I don't see why I should mind.


## Get result using Cossine Similarity with Embedding

In [357]:
def download_embedding(get_it):
    if get_it:
        !gdown https://drive.google.com/uc?id=1zI8pGfbUHuU_0wY_FV4tD6w6ZCUJTQbh
    print('Download finished')

In [358]:
#The embedding is already downloaded
#Change to True to download
download_embedding(False)

Download finished


In [359]:
%%time
#get the embedding
newfilepath = "embedding_wiki_100d_pt.txt"
filepath = "ptwiki_20180420_100d.txt.bz2"
with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb') as file:
    for data in iter(lambda : file.read(100 * 1024), b''):
        new_file.write(data)

CPU times: user 1min 20s, sys: 1.34 s, total: 1min 22s
Wall time: 1min 38s


In [360]:
%%time
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=False)

CPU times: user 5min 33s, sys: 7.47 s, total: 5min 40s
Wall time: 5min 54s


In [361]:
word_vectors

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fdec1b313a0>

In [362]:
def calculate_embedding(phrase):
    """
    Return the mean of embeddings of a phrase
    """
    
    arr = np.array([word_vectors[word] for word in phrase if word in word_vectors.vocab])
    
    sum = np.zeros(len(arr[0]))
    for a in arr:
        sum = sum + a
        
    arr_mean = sum / len(arr) 
    
    return arr_mean

In [363]:
def return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers       
    
    msg = msg.split(' ')
    
    similarity = []
    for m in msg_list:        
        m = m.split(' ')
        
        try:
            msg_vector_embedding = calculate_embedding(msg)
            m_vector_embedding   = calculate_embedding(m)
        
            similarity.append(distance.cosine(msg_vector_embedding, m_vector_embedding))
        except:
            print("An exception occurred")
            print('> '+ ' '.join(m))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [364]:
%%time
conversations = return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors)
conversations

An exception occurred
> cornilius
An exception occurred
> tempestuous
CPU times: user 297 ms, sys: 0 ns, total: 297 ms
Wall time: 291 ms


{'whatd he say': 0.029875306388547407,
 'be i suppose to feel well like right now or do i have some time to think about it': 0.034059318672903904,
 'so tell me about this dance be it fun': 0.03424502650369621,
 'listen goddamn it if you think im happy about it youre nut i just gotta take care of a few thing okay': 0.034806584619113456,
 'look at this you have blood on your shirt whose be it': 0.03513713541704733,
 'where the christ do you think youre go': 0.035582504481475374,
 'how could i be the mainland have be found exactly a i say it would': 0.03602647778306178,
 'be you a fireman that how you knew how to rig the apartment': 0.037026938617928384,
 'now tell the truth arent you a bit disappointed but i already told you': 0.03720380541103385,
 'which one pull the trigger': 0.0372386205366092,
 'dr floyd at the risk of press you on a point you seem reticent to discus may i ask you a straightforward question': 0.037644315278638096,
 'what do you want u to do i dont know myself but wel

In [365]:
conversation = get_the_next_conversation(conversations)
conversation

'whatd he say'

### Return result

In [366]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Who cares?


In [367]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: what be you do here
Page compute: 0.0013133376922688927
Similarity: 0.6336503543731205


In [368]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< I heard there was a poetry reading.
