<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia


In [2]:
import json
import string
import random
import re
import urllib.request

import numpy as np

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Datos
Se consumira los datos del artículo de wikipedia sobre "climate change" en ingles.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
%cd /content/drive/MyDrive/EIA-UBA/Bimestre3/NLP

/content/drive/MyDrive/EIA-UBA/Bimestre3/NLP


In [12]:
f = open("bible.txt", "r")
article_text = f.read()

In [13]:
# Demos un vistazo
article_text

Output hidden; open in https://colab.research.google.com to view.

In [14]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 4351186


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [15]:
text = re.sub(r'\[[0-9]*\]', ' ', article_text)
text = re.sub(r'\s+', ' ', text)

In [16]:
# Demos un vistazo
text

Output hidden; open in https://colab.research.google.com to view.

In [17]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 4324027


### 3 - Dividir el texto en sentencias y en palabras

In [18]:
corpus = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

In [19]:
# Demos un vistazo
corpus[:10]

['1:1 In the beginning God created the heaven and the earth.',
 '1:2 And the earth was without form, and void; and darkness was upon the face of the deep.',
 'And the Spirit of God moved upon the face of the waters.',
 '1:3 And God said, Let there be light: and there was light.',
 '1:4 And God saw the light, that it was good: and God divided the light from the darkness.',
 '1:5 And God called the light Day, and the darkness he called Night.',
 'And the evening and the morning were the first day.',
 '1:6 And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.',
 '1:7 And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.',
 '1:8 And God called the firmament Heaven.']

In [20]:
# Demos un vistazo
words[:20]

['1:1',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.',
 '1:2',
 'And',
 'the',
 'earth',
 'was',
 'without',
 'form',
 ',']

In [21]:
print("Vocabulario:", len(words))

Vocabulario: 950295


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [22]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula
    # 2 - quitar los simbolos de puntuacion
    # 3 - realiza la tokenización
    # 4 - realiza la lematización
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus de wikipedia

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. 

In [24]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio --quiet

[K     |████████████████████████████████| 5.1 MB 28.7 MB/s 
[K     |████████████████████████████████| 54 kB 2.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 35.1 MB/s 
[K     |████████████████████████████████| 84 kB 2.9 MB/s 
[K     |████████████████████████████████| 57 kB 4.6 MB/s 
[K     |████████████████████████████████| 212 kB 34.5 MB/s 
[K     |████████████████████████████████| 140 kB 54.9 MB/s 
[K     |████████████████████████████████| 2.3 MB 47.9 MB/s 
[K     |████████████████████████████████| 272 kB 50.4 MB/s 
[K     |████████████████████████████████| 84 kB 3.7 MB/s 
[K     |████████████████████████████████| 271 kB 60.3 MB/s 
[K     |████████████████████████████████| 144 kB 47.6 MB/s 
[K     |████████████████████████████████| 94 kB 2.8 MB/s 
[K     |████████████████████████████████| 63 kB 1.7 MB/s 
[K     |████████████████████████████████| 80 kB 8.1 MB/s 
[K     |████████████████████████████████| 68 kB 6.2 MB/s 
[K     |███████████████████████████████

In [25]:
import gradio as gr

def bot_response(human_text):
    print(human_text)
    return generate_response(human_text.lower(), corpus)

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://39082.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


god


  % sorted(inconsistent)


jesus


  % sorted(inconsistent)


apocalypsis


  % sorted(inconsistent)


apocalypse


  % sorted(inconsistent)


judea


  % sorted(inconsistent)


king


  % sorted(inconsistent)


son


  % sorted(inconsistent)


earth


  % sorted(inconsistent)


lord


  % sorted(inconsistent)


Keyboard interruption in main thread... closing server.


(<gradio.routes.App at 0x7ff49161eb90>,
 'http://127.0.0.1:7860/',
 'https://39082.gradio.app')