<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia


In [30]:
import json
import string
import random
import re
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zerba\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zerba\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\zerba\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumira los datos del artículo de wikipedia sobre el deporte "tennis" en ingles.

In [31]:
# raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Harry_Potter')
# raw_html = raw_html.read()
# article_html = bs.BeautifulSoup(raw_html, 'lxml')
# article_paragraphs = article_html.find_all('p')

# article_text = ''

# for para in article_paragraphs:
#     article_text += para.text

# article_text = article_text.lower()

In [32]:
article_file = open(r"C:\Users\zerba\Downloads\Harry Potter.txt",encoding="utf8")

In [33]:
article_text = article_file.read()

In [34]:
# Demos un vistazo
article_text



In [35]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 446153


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [36]:
text = re.sub(r'\[[0-9]*\]', ' ', article_text)
text = re.sub(r'\s+', ' ', text)

In [37]:
# Demos un vistazo
text



In [38]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 441442


### 3 - Dividir el texto en sentencias y en palabras

In [39]:
corpus = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

In [40]:
# Demos un vistazo
corpus[:10]

['They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.',
 'Mr Dursley was the director of a firm called Grunnings, which made drills.',
 'He was a big, beefy man with hardly any neck, although he did have a very large moustache.',
 'Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours.',
 'The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.',
 'The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it.',
 'They didn’t think they could bear it if anyone found out about the Potters.',
 'Mrs Potter was Mrs Dursley’s sister, but they hadn’t met for several years; in fact, Mrs Dursley pretended she didn’t have a sister, because her sister and her good-for-n

In [41]:
# Demos un vistazo
words[:20]

['They',
 'were',
 'the',
 'last',
 'people',
 'you',
 '’',
 'd',
 'expect',
 'to',
 'be',
 'involved',
 'in',
 'anything',
 'strange',
 'or',
 'mysterious',
 ',',
 'because',
 'they']

In [42]:
print("Vocabulario:", len(words))

Vocabulario: 102004


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [43]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula
    # 2 - quitar los simbolos de puntuacion
    # 3 - realiza la tokenización
    # 4 - realiza la lematización
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus de wikipedia

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema



In [72]:
user_input_list = ['you-know-who', 'Harry bedroom', 'what is the train platform', 'what is quidditch?']

for user_input in user_input_list:
    response = generate_response(user_input, corpus)
    print('-------------------------------')
    print('User input: {}'.format(user_input))
    print('Bot response: {}'.format(response))

-------------------------------
User input: you-know-who
Bot response: I mean, You-Know-Who ‘Call him Voldemort, Harry.
-------------------------------
User input: Harry bedroom
Bot response: ‘Go to your cupboard - I mean, your bedroom,’ he wheezed at Harry.
-------------------------------
User input: what is the train platform
Bot response: Platform nine - platform ten.
-------------------------------
User input: what is quidditch
Bot response: ‘Play Quidditch at all?’ ‘No,’ Harry said again, wondering what on earth Quidditch could be.
