# Chandler was the best! 

<br>
Ask chandler whatever you like and he will give you a straight chandler type of answer! <br>

***Disclaimer**: Based on the chandler character from the 90's sitcom friends, know more here! ->* [Friends on Wikipedia! ](https://en.wikipedia.org/wiki/Friends)

<br>
The pipeline for this example goes as follows: 


1. Preprocessing 

2. Word2Vec (not using embedings)

3. Do a few more things to set a dataset

4. Train a model

5. Deploy an interface and track chandler's answers! 

In [1]:
import os
import re
import spacy as spy
from spacy import displacy
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## 1. Preprocessing

In [2]:
docs =r"D:\Archivos Santiago\_Especializacion Machine learning\_Elessanti\Non-touchable interfaces\NLP\chandler\dataset"

def get_filename(path):
    return [i.path for i in os.scandir(path) if i.is_file()]

files=get_filename(docs)

corpus_original = []

for filepath in files:
    with open(filepath, 'r', encoding='utf8') as file_to_read:
        text = file_to_read.read()
        corpus_original.append(text.split('\n'))

In [3]:
def corpus_pre_processing( corpus_original ):
    corpus_temp= corpus_original

    #clear empty lines
    for cidx, chapter in enumerate(corpus_temp):
        for lidx, line in enumerate(chapter):
            if (len(line)==0): corpus_temp[cidx].pop(lidx)

    #lowercase
    for cidx, chapter in enumerate(corpus_temp):
        for lidx, line in enumerate(chapter): 
            corpus_temp[cidx][lidx] = corpus_temp[cidx][lidx].lower()

    return corpus_temp

          

def get_chandler_lines(listarray):
   
    chandler_corpus = []

    for cidx, chapter in enumerate(listarray):
        for lidx, line in enumerate(chapter):
            if "chandler: " in line: 
                
                chandler_corpus.append(line)

    # delete script call
    for oidx in range(len(chandler_corpus)):
        chandler_corpus[oidx]= chandler_corpus[oidx].replace('chandler:','') 


    for oidx in range(len(chandler_corpus)):
        chandler_corpus[oidx] = re.sub("[\(\[].*?[\)\]]", "", chandler_corpus[oidx])


    return chandler_corpus



In [4]:
pre_porsessed_corpus= corpus_pre_processing(corpus_original=corpus_original)
full_chandler_lines = get_chandler_lines(listarray=pre_porsessed_corpus)


## 2. Tokenization and lemmatization using spaCy

In [5]:
# tokenization 
nlp = spy.load("en_core_web_sm")
tokens = [ nlp(full_chandler_lines[idx]) for idx in range(len(full_chandler_lines)) ]

In [6]:
print(tokens[0])

 all right joey, be nice.  so does he have a hump? a hump and a hairpiece?


In [19]:
displacy.serve(tokens[0], style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [8]:
tempy = 10
print("One cherrypicked token ----------> Original word: ", tokens[0][tempy].text,"    &       Word's Lemma: ", tokens[0][tempy].lemma_)

One cherrypicked token ----------> Original word:  does     &       Word's Lemma:  do


## 3. Do a few more things to set a dataset

Remember that after lemmatizing we're still working with words, and chandler bot needs to be fed with numbers. 

From what I've looked, spacy does not provide word2vec out of the box. so in this case, I'll use an sklearn algorithm to obtain a TF-IDF representation of each document. (each chandler line)

For better stability, we're gonna use the lemmas instead of the tokens by themselves.

In [9]:
print(type(tokens), type(tokens[0]))

<class 'list'> <class 'spacy.tokens.doc.Doc'>


In [10]:
# this is done only for using sklearn later.

chandler_lemma_corpus = []

for doc in tokens:
    doc_lemma_list = [token.lemma_ for token in doc]
    lemma_string = ' '.join(doc_lemma_list)
    chandler_lemma_corpus.append(lemma_string)

In [11]:
print("original document: ",tokens[8].text)
print("Lemmatized string ",chandler_lemma_corpus[8])


original document:   sometimes i wish i was a lesbian...  did i say that out loud?
Lemmatized string    sometimes I wish I be a lesbian ...   do I say that out loud ?


## 4. Train a model

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus_lemma , corpus_original):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar

    user_tokens = nlp(user_input)
    user_lemma_list = [token.lemma_ for token in user_tokens]
    user_lemma_strings = ' '.join(user_lemma_list)
    
    corpus_lemma.append(user_lemma_strings)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus_lemma)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus_original[similar_sentence_number] # obtener el documento del corpus más similar
    corpus_lemma.pop()
    return response

In [17]:
generate_response("hump?", corpus_lemma=chandler_lemma_corpus, corpus_original=tokens)

 all right joey, be nice.  so does he have a hump? a hump and a hairpiece?

## 5. Deploy an interface and track chandler's answers! 

In [14]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio --quiet


In [18]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus_lemma=chandler_lemma_corpus, corpus_original=tokens)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Q: were is ross?
A:  ross, don’t. ross!
Q: went to the party
A:  y’know, that party wasn’t bad.
Q: they were on a break
A:  we're taking a break!
Q: hey monica
A:  hey-hey, is monica here?
Q: sock
A:  it’s a sock bunny.
Q: does she has a house?
A:  we're getting the house.  we're getting the house.
Q: were's your dad?
A:  you're not a dad. you're not a dad.
Q: moody
A:  you know, i once dated a miss crankypants. lovely girl, kinda moody.
Q: well, that's enough
A: I am sorry, I could not understand you
Q: moon
A:  yeah, i guess i could use that. i could say that your love sends me to the moon.
Q: last time i've checked
A:  what check thing?
Q: armadillo
A:  because, if santa and the holiday…armadillo?  ...are ever in the same room for too long the universe will implode. merry christmas!
Keyboard interruption in main thread... closing server.


(<gradio.routes.App at 0x290dc950c40>, 'http://127.0.0.1:7860/', None)