## A intenção do projeto é criar um chatbot baseado em reviews de filmes para que se possa fazer perguntas e manter uma conversa livre sobre este tema

- link do banco de dados https://www.kaggle.com/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv
- referências
>- https://shanebarker.com/blog/deep-learning-chatbot/
> -https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

In [1]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import gensim
import numpy as np
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import load_model
from scipy.spatial import distance
import math
import random
import bz2
import itertools
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import wordnet

In [2]:
#expand jupyter cells
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Opening movie reviews

In [96]:
messages = pd.read_csv('./chatdata/movie_lines_pre_processed.tsv', delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [97]:
messages.head()

Unnamed: 0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
0,L49,u0,m0,Did you change your hair?,No.,change hair,1
1,L50,u3,m0,No.,You might wanna think about it,no,0
2,L51,u0,m0,You might wanna think about it,can you explain it better?,might wanna think,0
3,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,miss,0
4,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,say expose group freshman girl,0


### Defining the list of questions and answers

In [98]:
questions = set(messages[messages['target'] == 1]['msg_pre_processed'])

In [99]:
answers = set(messages[messages['target'] == 0]['msg_pre_processed'])

In [100]:
len(answers)

162997

In [57]:
len(questions)

66111

In [6]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [7]:
lemmatizer = WordNetLemmatizer()
def pre_processing_text(corpus):
    #remove html tags
    corpus = re.sub(r'<.*?>', '', str(corpus))
    
    #remove non-alphanumeric characters
    corpus = re.sub(r'[^a-z A-Z 0-9 \s]', '', str(corpus))
    
    #remove duplicated spaces
    corpus = re.sub(r' +', ' ', str(corpus))
    
    #capitalization
    corpus = corpus.lower()
    
    #tokenization
    corpus = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    
    #lammatization
    corpus = [lemmatizer.lemmatize(c, get_wordnet_pos(c)) for c in corpus]
    
    #remove punctuation
    corpus = [t for t in corpus if t not in string.punctuation]
    
    #remove stopwords
    #it makes the model worst
    #stopwords_ = stopwords.words("english")
    #corpus = [t for t in corpus if t not in stopwords_]
    
    corpus = ' '.join(corpus)

    return corpus

In [8]:
msg_raw = 'I heard you are a good guy. Is it right?'
#msg_raw = 'yes i heard you all right 20000000 thats quite a lot isnt it'
msg = pre_processing_text(msg_raw)
print(msg)

with open('./chatdata/tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
p = tokenizer.texts_to_matrix([msg])

model = load_model('./chatdata/chatbot_model.h5')
res = model.predict(p)

print(res)

i heard you be a good guy be it right
[[0.61740196]]


## Returning the conversation for the message using Jaccard Similarity

In [23]:
def jaccard_similarity(f1, f2):    
    f1 = set(str(f1).split(' '))
    f2 = set(str(f2).split(' '))
    
    intersecao = f1.intersection(f2)
    uniao = f1.union(f2)
    
    return len(intersecao) / len(uniao)

In [91]:
def return_conversation_by_jaccard(msg, res, questions, answers, threshold=None):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions
        similarity = [jaccard_similarity(msg, str(m)) for m in questions]        
    else:
        similarity = [jaccard_similarity(msg, str(m)) for m in answers]
        msg_list = answers
    
    result = dict(zip(msg_list, similarity))
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}    

In [101]:
%%time
conversations = return_conversation_by_jaccard(msg, res, questions, answers)
conversations

CPU times: user 312 ms, sys: 31.2 ms, total: 344 ms
Wall time: 336 ms


{'right good': 0.2222222222222222,
 'good right': 0.2222222222222222,
 'guy right': 0.2222222222222222,
 'right awhile guy good look rich witty god': 0.21428571428571427,
 'good man right': 0.2,
 'guy treatin right': 0.2,
 'yeah good guy': 0.2,
 'thats good right': 0.2,
 'guy terminator right like': 0.18181818181818182,
 'hey high good right': 0.18181818181818182,
 'heard good bunch killer': 0.18181818181818182,
 'guy british museum right': 0.18181818181818182,
 'guy right get london': 0.18181818181818182,
 'youve heard story right': 0.18181818181818182,
 'check hang guy right': 0.18181818181818182,
 'people good people people right place': 0.18181818181818182,
 'say youre good guy': 0.18181818181818182,
 'luke pretty good guy wasnt': 0.16666666666666666,
 'good beautiful intelligent knew right': 0.16666666666666666,
 'theyre good guy see holly': 0.16666666666666666,
 'hey guy thing ok right': 0.16666666666666666,
 'see guy right middle road': 0.16666666666666666,
 'look dont know good

In [64]:
#get the first item in the dict
def get_the_next_conversation(conversations):
    keys_view = conversations.keys()
    keys_iterator = iter(keys_view)
    conversation = next(keys_iterator)
    return conversation

In [65]:
conversation = get_the_next_conversation(conversations)
conversation

'good right'

### The returned message

In [66]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< can you explain it better?


## Calculate PageRank
- create a bi-directional graph of messages using similarity bigger then a threshould

In [26]:
"""
Classe para criação de um nó (página) do grafo.
O cosntrutuor recebe recebe o nome do nó, a lista de nós de entrada e saída.
"""
class Node(object):
    def __init__(self, node_name: str, inlinks: list, outlinks: list):
        self.node_name = node_name
        self.inlinks = inlinks
        self.outlinks = outlinks

"""
Classe para criação do grafo de páginas.
O construtor inicializa um dicionário vazio de lista de adjacência. 
"""
class Graph(object):
    def __init__(self):
        self.adj_list = dict()

    #Adiciona um nó ao grafo com seus nós de entrada e sapida
    def add_node(self, node_name: str, inlinks: list, outlinks: list):
        node = Node(node_name = node_name, inlinks=inlinks, outlinks=outlinks)
        self.adj_list[node_name] = node

    #Imprime os dados do grafo criado
    def print_graph(self):
        for key in self.adj_list:
            print(f"{key}:")
            print(f"\tEntrada: {self.adj_list[key].inlinks}")
            print(f"\tSaída: {self.adj_list[key].outlinks}")

In [27]:
import numpy as np

"""
A classe PageRank possui métodos para computar o Page Rank de cada página dado um número de iterações.
"""
class PageRank(object):
    #Construtor da classe que recebe um objeto Graph, inicializa um dicionário vazio de scores
    def __init__(self, graph: Graph):
        self.graph = graph
        self.scores = dict()
        self.__initialize_scores()

    #Inicializa os scores do Page Rank com o valor inicial 1/n, onde n é o número de nós (páginas) do grafo
    def __initialize_scores(self):
        n = len(self.graph.adj_list)
        for key in self.graph.adj_list:
            self.scores[key] = 1/n

    #Calcula o Page Rank para cada página dado o número de iterações. Ainda não utiliza a convergence_rate no cálculos
    def compute(self, iterations: int = 10, convergence_rate: float = 0.01):
        new_scores = dict()
        for i in range(iterations):
            for  node in self.graph.adj_list:
                in_to_node = np.asarray([
                    self.scores[x] for x in self.graph.adj_list[node].inlinks
                ])
                out_to_node = self.graph.adj_list[node].inlinks
                amount_out_to_node = np.asarray([
                    len(self.graph.adj_list[x].outlinks) for x in out_to_node
                ])
                score = np.sum(in_to_node / amount_out_to_node)
                new_scores[node] = score
            #print(self.scores)
            self.scores = new_scores.copy()
        return self.scores

    def power_method(self, iterations: int = 10):
        #Gera a matrix de probabilidades de navegação a cada nó
        lenght = len(self.graph.adj_list)
        matrix = np.zeros((lenght, lenght))
        i = j = 0
        for node in self.graph.adj_list:
            for link in self.graph.adj_list[node].outlinks:
                for row in self.graph.adj_list:
                    if link == row:
                        matrix[i][j] = 1/len(self.graph.adj_list[node].outlinks)
                    i = i + 1
                i = 0
            j = j + 1
        print("Matriz de probabilidades ")
        print(matrix)
        #muliplica a matrix pelo score inicial
        print("Page Rank das iterações")
        scores_arr = np.asarray([
                    self.scores[key] for key in self.graph.adj_list
                ])
        itn = np.dot(matrix, scores_arr)
        print(itn)
        for i in range(iterations - 1):
            itn = np.dot(matrix, itn)
            print(itn)

In [None]:
def make_in_links(conversations, threshold=0.3):
    li = list()
    if threshold is None:
        li = conversations.keys()
    else:
        dic = {k: v for k, v in conversations.items() if v >= threshold}
        li = [c for c in dic.keys()]
        
    return li

In [14]:
def get_conversations(msg, res):
    return return_conversation_by_jaccard(msg, res, questions, answers)

In [44]:
def make_graph(qea, res, threshold=None):
    g = Graph()
    lenght = len(qea)
    i = 1
    
    for k in qea:
        conversations = get_conversations(k, res)
        if conversations is not None:
            in_links = make_in_links(conversations, threshold=threshold)
            in_links.remove(k)
            g.add_node(k, in_links, in_links)
        if (i % 100) == 0:
            print('Processed '+ str(i) +' of '+ str(lenght))    
        i += 1
    return g

In [16]:
def save_page_compute(qea, res, file_name, threshold=None, iterations=3):

    g = make_graph(qea=qea, res=res, threshold=threshold)

    p = PageRank(graph=g)
    pc = p.compute(iterations=iterations)
    pc = {k: v for k, v in sorted(pc.items(), key=lambda item: item[1], reverse=True)}

    f = open( './chatdata/' + file_name + '.txt', 'w' )
    f.write( repr(pc) )
    f.close()
    return pc

In [17]:
threshold=0.01

In [73]:
%%time
pc_q = save_page_compute(qea=questions, res=1, threshold=threshold, file_name='page_rank_questions')

Processed 100 of 66111
Processed 200 of 66111
Processed 300 of 66111
Processed 400 of 66111
Processed 500 of 66111
Processed 600 of 66111
Processed 700 of 66111
Processed 800 of 66111
Processed 900 of 66111
Processed 1000 of 66111
Processed 1100 of 66111
Processed 1200 of 66111
Processed 1300 of 66111
Processed 1400 of 66111
Processed 1500 of 66111
Processed 1600 of 66111
Processed 1700 of 66111
Processed 1800 of 66111
Processed 1900 of 66111
Processed 2000 of 66111
Processed 2100 of 66111
Processed 2200 of 66111
Processed 2300 of 66111
Processed 2400 of 66111
Processed 2500 of 66111
Processed 2600 of 66111
Processed 2700 of 66111
Processed 2800 of 66111
Processed 2900 of 66111
Processed 3000 of 66111
Processed 3100 of 66111
Processed 3200 of 66111
Processed 3300 of 66111
Processed 3400 of 66111
Processed 3500 of 66111
Processed 3600 of 66111
Processed 3700 of 66111
Processed 3800 of 66111
Processed 3900 of 66111
Processed 4000 of 66111
Processed 4100 of 66111
Processed 4200 of 66111
P

Processed 33400 of 66111
Processed 33500 of 66111
Processed 33600 of 66111
Processed 33700 of 66111
Processed 33800 of 66111
Processed 33900 of 66111
Processed 34000 of 66111
Processed 34100 of 66111
Processed 34200 of 66111
Processed 34300 of 66111
Processed 34400 of 66111
Processed 34500 of 66111
Processed 34600 of 66111
Processed 34700 of 66111
Processed 34800 of 66111
Processed 34900 of 66111
Processed 35000 of 66111
Processed 35100 of 66111
Processed 35200 of 66111
Processed 35300 of 66111
Processed 35400 of 66111
Processed 35500 of 66111
Processed 35600 of 66111
Processed 35700 of 66111
Processed 35800 of 66111
Processed 35900 of 66111
Processed 36000 of 66111
Processed 36100 of 66111
Processed 36200 of 66111
Processed 36300 of 66111
Processed 36400 of 66111
Processed 36500 of 66111
Processed 36600 of 66111
Processed 36700 of 66111
Processed 36800 of 66111
Processed 36900 of 66111
Processed 37000 of 66111
Processed 37100 of 66111
Processed 37200 of 66111
Processed 37300 of 66111


CPU times: user 20h 51min 23s, sys: 4min 25s, total: 20h 55min 49s
Wall time: 20h 57min 59s


In [102]:
%%time
pc_a = save_page_compute(qea=answers, res=0, threshold=threshold, file_name='page_rank_answers')

Processed 100 of 162997
Processed 200 of 162997
Processed 300 of 162997
Processed 400 of 162997
Processed 500 of 162997
Processed 600 of 162997
Processed 700 of 162997
Processed 800 of 162997
Processed 900 of 162997
Processed 1000 of 162997
Processed 1100 of 162997
Processed 1200 of 162997
Processed 1300 of 162997
Processed 1400 of 162997
Processed 1500 of 162997
Processed 1600 of 162997
Processed 1700 of 162997
Processed 1800 of 162997
Processed 1900 of 162997
Processed 2000 of 162997
Processed 2100 of 162997
Processed 2200 of 162997
Processed 2300 of 162997
Processed 2400 of 162997
Processed 2500 of 162997
Processed 2600 of 162997
Processed 2700 of 162997
Processed 2800 of 162997
Processed 2900 of 162997
Processed 3000 of 162997
Processed 3100 of 162997
Processed 3200 of 162997
Processed 3300 of 162997
Processed 3400 of 162997
Processed 3500 of 162997
Processed 3600 of 162997
Processed 3700 of 162997
Processed 3800 of 162997
Processed 3900 of 162997
Processed 4000 of 162997
Processed

Processed 32100 of 162997
Processed 32200 of 162997
Processed 32300 of 162997
Processed 32400 of 162997
Processed 32500 of 162997
Processed 32600 of 162997
Processed 32700 of 162997
Processed 32800 of 162997
Processed 32900 of 162997
Processed 33000 of 162997
Processed 33100 of 162997
Processed 33200 of 162997
Processed 33300 of 162997
Processed 33400 of 162997
Processed 33500 of 162997
Processed 33600 of 162997
Processed 33700 of 162997
Processed 33800 of 162997
Processed 33900 of 162997
Processed 34000 of 162997
Processed 34100 of 162997
Processed 34200 of 162997
Processed 34300 of 162997
Processed 34400 of 162997
Processed 34500 of 162997
Processed 34600 of 162997
Processed 34700 of 162997
Processed 34800 of 162997
Processed 34900 of 162997
Processed 35000 of 162997
Processed 35100 of 162997
Processed 35200 of 162997
Processed 35300 of 162997
Processed 35400 of 162997
Processed 35500 of 162997
Processed 35600 of 162997
Processed 35700 of 162997
Processed 35800 of 162997
Processed 35

Processed 63700 of 162997
Processed 63800 of 162997
Processed 63900 of 162997
Processed 64000 of 162997
Processed 64100 of 162997
Processed 64200 of 162997
Processed 64300 of 162997
Processed 64400 of 162997
Processed 64500 of 162997
Processed 64600 of 162997
Processed 64700 of 162997
Processed 64800 of 162997
Processed 64900 of 162997
Processed 65000 of 162997
Processed 65100 of 162997
Processed 65200 of 162997
Processed 65300 of 162997
Processed 65400 of 162997
Processed 65500 of 162997
Processed 65600 of 162997
Processed 65700 of 162997
Processed 65800 of 162997
Processed 65900 of 162997
Processed 66000 of 162997
Processed 66100 of 162997
Processed 66200 of 162997
Processed 66300 of 162997
Processed 66400 of 162997
Processed 66500 of 162997
Processed 66600 of 162997
Processed 66700 of 162997
Processed 66800 of 162997
Processed 66900 of 162997
Processed 67000 of 162997
Processed 67100 of 162997
Processed 67200 of 162997
Processed 67300 of 162997
Processed 67400 of 162997
Processed 67

Processed 95300 of 162997
Processed 95400 of 162997
Processed 95500 of 162997
Processed 95600 of 162997
Processed 95700 of 162997
Processed 95800 of 162997
Processed 95900 of 162997
Processed 96000 of 162997
Processed 96100 of 162997
Processed 96200 of 162997
Processed 96300 of 162997
Processed 96400 of 162997
Processed 96500 of 162997
Processed 96600 of 162997
Processed 96700 of 162997
Processed 96800 of 162997
Processed 96900 of 162997
Processed 97000 of 162997
Processed 97100 of 162997
Processed 97200 of 162997
Processed 97300 of 162997
Processed 97400 of 162997
Processed 97500 of 162997
Processed 97600 of 162997
Processed 97700 of 162997
Processed 97800 of 162997
Processed 97900 of 162997
Processed 98000 of 162997
Processed 98100 of 162997
Processed 98200 of 162997
Processed 98300 of 162997
Processed 98400 of 162997
Processed 98500 of 162997
Processed 98600 of 162997
Processed 98700 of 162997
Processed 98800 of 162997
Processed 98900 of 162997
Processed 99000 of 162997
Processed 99

Processed 125900 of 162997
Processed 126000 of 162997
Processed 126100 of 162997
Processed 126200 of 162997
Processed 126300 of 162997
Processed 126400 of 162997
Processed 126500 of 162997
Processed 126600 of 162997
Processed 126700 of 162997
Processed 126800 of 162997
Processed 126900 of 162997
Processed 127000 of 162997
Processed 127100 of 162997
Processed 127200 of 162997
Processed 127300 of 162997
Processed 127400 of 162997
Processed 127500 of 162997
Processed 127600 of 162997
Processed 127700 of 162997
Processed 127800 of 162997
Processed 127900 of 162997
Processed 128000 of 162997
Processed 128100 of 162997
Processed 128200 of 162997
Processed 128300 of 162997
Processed 128400 of 162997
Processed 128500 of 162997
Processed 128600 of 162997
Processed 128700 of 162997
Processed 128800 of 162997
Processed 128900 of 162997
Processed 129000 of 162997
Processed 129100 of 162997
Processed 129200 of 162997
Processed 129300 of 162997
Processed 129400 of 162997
Processed 129500 of 162997
P

Processed 156300 of 162997
Processed 156400 of 162997
Processed 156500 of 162997
Processed 156600 of 162997
Processed 156700 of 162997
Processed 156800 of 162997
Processed 156900 of 162997
Processed 157000 of 162997
Processed 157100 of 162997
Processed 157200 of 162997
Processed 157300 of 162997
Processed 157400 of 162997
Processed 157500 of 162997
Processed 157600 of 162997
Processed 157700 of 162997
Processed 157800 of 162997
Processed 157900 of 162997
Processed 158000 of 162997
Processed 158100 of 162997
Processed 158200 of 162997
Processed 158300 of 162997
Processed 158400 of 162997
Processed 158500 of 162997
Processed 158600 of 162997
Processed 158700 of 162997
Processed 158800 of 162997
Processed 158900 of 162997
Processed 159000 of 162997
Processed 159100 of 162997
Processed 159200 of 162997
Processed 159300 of 162997
Processed 159400 of 162997
Processed 159500 of 162997
Processed 159600 of 162997
Processed 159700 of 162997
Processed 159800 of 162997
Processed 159900 of 162997
P

In [95]:
#checking page rank tends to 1
s = 0
#pc_q = pc_a
for p in pc_q:
    s += pc_q[p]
    
s

0.9660774008600099

## Similary of Jaccard based on Page Rank

In [346]:
def return_conversation_by_page_rank(msg, conversations, page_compute):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """  
    similarity = {k: v for k, v in conversations.items()}
    
    result = dict()
    for k, v in similarity.items():        
        result[k] = page_compute[k] * v
    
    result = {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
    return next(iter(result))

In [347]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: i heard you scream be it a bad one
Page compute: 0.0013261765629984257
Similarity: 0.5


In [348]:
print('Original: '+ msg)
print('Most similar: '+conversation)

Original: i heard you be a good guy be it right
Most similar: i heard you scream be it a bad one


In [349]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< It was bad.


## Return the result using the Cossine Similarity

In [350]:
bow = CountVectorizer()

In [351]:
def return_conversation_by_cossine(msg, res, questions, answers, bow):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers
       
    similarity = []
    for m in msg_list:
        new_msg_list = [msg, m]
        vector_bow = bow.fit_transform(new_msg_list)
        msg_bow = vector_bow.todense()[0]
        m_bow   = vector_bow.todense()[1]
        similarity.append(distance.cosine(msg_bow, m_bow))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [352]:
conversations = return_conversation_by_cossine(msg, res, questions, answers, bow)
conversations

{'be that it be this right': 0.3291796067500631,
 'be it': 0.3291796067500632,
 'i be': 0.36754446796632423,
 'i heard you scream be it a bad one': 0.4023856953328032,
 'be you joking': 0.4522774424948339,
 'be i disturb you': 0.4522774424948339,
 'be you lose': 0.4522774424948339,
 'who be you': 0.4522774424948339,
 'how be it': 0.4522774424948339,
 'what be you hiding why be you afraid': 0.4522774424948339,
 'be you religious': 0.4522774424948339,
 'be you hit': 0.4522774424948339,
 'whereve you be': 0.4522774424948339,
 'be that right': 0.4522774424948339,
 'be you married': 0.4522774424948339,
 'what be it': 0.4522774424948339,
 'what be you': 0.4522774424948339,
 'how be you': 0.4522774424948339,
 'be he with you be you travel together': 0.4522774424948339,
 'be you alright': 0.4522774424948339,
 'when you be go last year where be you': 0.4737651884157825,
 'you know what that be right': 0.4836022205056777,
 'how can you be so certain the ocean be say to be infinite': 0.4921666249

In [353]:
conversation = get_the_next_conversation(conversations)
conversation

'be that it be this right'

### Return result

In [354]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Yeah! I mean I don't know... it looks right.


In [355]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: listen patrizia the marshal say there a current that pass by here and end up at another island i dont know which he want to send one of his men over to have a look one never know do you mind if i ask raimondo to go with him
Page compute: 0.0014705432439456848
Similarity: 0.9573598567288779


In [356]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< I don't see why I should mind.


## Get result using Cossine Similarity with Embedding

In [357]:
def download_embedding(get_it):
    if get_it:
        !gdown https://drive.google.com/uc?id=1zI8pGfbUHuU_0wY_FV4tD6w6ZCUJTQbh
    print('Download finished')

In [358]:
#The embedding is already downloaded
#Change to True to download
download_embedding(False)

Download finished


In [359]:
%%time
#get the embedding
newfilepath = "embedding_wiki_100d_pt.txt"
filepath = "ptwiki_20180420_100d.txt.bz2"
with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb') as file:
    for data in iter(lambda : file.read(100 * 1024), b''):
        new_file.write(data)

CPU times: user 1min 20s, sys: 1.34 s, total: 1min 22s
Wall time: 1min 38s


In [360]:
%%time
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=False)

CPU times: user 5min 33s, sys: 7.47 s, total: 5min 40s
Wall time: 5min 54s


In [361]:
word_vectors

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fdec1b313a0>

In [362]:
def calculate_embedding(phrase):
    """
    Return the mean of embeddings of a phrase
    """
    
    arr = np.array([word_vectors[word] for word in phrase if word in word_vectors.vocab])
    
    sum = np.zeros(len(arr[0]))
    for a in arr:
        sum = sum + a
        
    arr_mean = sum / len(arr) 
    
    return arr_mean

In [363]:
def return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors):
    """
    Return a dictionary of message and similarity sorted by highter similarity
    """
    if res >= 0.5:
        msg_list = questions    
    else:
        msg_list = answers       
    
    msg = msg.split(' ')
    
    similarity = []
    for m in msg_list:        
        m = m.split(' ')
        
        try:
            msg_vector_embedding = calculate_embedding(msg)
            m_vector_embedding   = calculate_embedding(m)
        
            similarity.append(distance.cosine(msg_vector_embedding, m_vector_embedding))
        except:
            print("An exception occurred")
            print('> '+ ' '.join(m))
    
    result = {} 
    for key in msg_list: 
        for value in similarity: 
            result[key] = value
            similarity.remove(value) 
            break 
    
    return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=False)}
    

In [364]:
%%time
conversations = return_conversation_by_cossine_embedding(msg, res, questions, answers, word_vectors)
conversations

An exception occurred
> cornilius
An exception occurred
> tempestuous
CPU times: user 297 ms, sys: 0 ns, total: 297 ms
Wall time: 291 ms


{'whatd he say': 0.029875306388547407,
 'be i suppose to feel well like right now or do i have some time to think about it': 0.034059318672903904,
 'so tell me about this dance be it fun': 0.03424502650369621,
 'listen goddamn it if you think im happy about it youre nut i just gotta take care of a few thing okay': 0.034806584619113456,
 'look at this you have blood on your shirt whose be it': 0.03513713541704733,
 'where the christ do you think youre go': 0.035582504481475374,
 'how could i be the mainland have be found exactly a i say it would': 0.03602647778306178,
 'be you a fireman that how you knew how to rig the apartment': 0.037026938617928384,
 'now tell the truth arent you a bit disappointed but i already told you': 0.03720380541103385,
 'which one pull the trigger': 0.0372386205366092,
 'dr floyd at the risk of press you on a point you seem reticent to discus may i ask you a straightforward question': 0.037644315278638096,
 'what do you want u to do i dont know myself but wel

In [365]:
conversation = get_the_next_conversation(conversations)
conversation

'whatd he say'

### Return result

In [366]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< Who cares?


In [367]:
conversation = return_conversation_by_page_rank(msg, conversations, page_compute=pc_q)
print('Conversation: '+ conversation)
print('Page compute: '+ str(pc_q[conversation]))
print('Similarity: '+ str(conversations[conversation]))

Conversation: what be you do here
Page compute: 0.0013133376922688927
Similarity: 0.6336503543731205


In [368]:
print('>>> '+msg_raw)
msg2 = list(messages[messages['msg_pre_processed'] == conversation]['msg_2'])[0]
print('<<< '+msg2)

>>> I heard you are a good guy. Is it right?
<<< I heard there was a poetry reading.
