<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="right">


# Procesamiento de lenguaje natural


## TP3

### Alumno: Emmanuel Cardozo

En este notebook se utilizará Gensim para crear nuestros propios embeddings basado en un corpus extraido del libro "Guía del autoestopista galáctico".

In [32]:
%load_ext lab_black

from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from keras.preprocessing.text import text_to_word_sequence
from sklearn.manifold import TSNE
import bs4 as bs
import html
import numpy as np
import plotly.express as px
import re
import urllib.request

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


Como corpus utilizaremos, al igual que en la ejercitación de la clase 2, el texto del libro "Guía del autoestopista galáctico". El mismo se puede encontrar en el siguiente [link]("https://archive.org/stream/TheultimateHitchhikersGuide/The%20Hitchhiker%27s%20Guide%20To%20The%20Galaxy_djvu.txt").

Luego se extrae la parte donde se encuentra el texto a procesar y se remueven caracteres especiales y símbolos html.

In [2]:
def get_html_article(url):
    raw_html = urllib.request.urlopen(url)
    raw_html = raw_html.read()
    return bs.BeautifulSoup(raw_html, "html.parser")


def extract_text_from_html(html, startDelimiter, endDelimiter):
    txt = str(html)
    start = txt.find(startDelimiter) + len(startDelimiter)
    end = txt.find(endDelimiter)
    return txt[start:end]


def preprocess_article(article):
    # Remover caracteres especiales
    filtered_string = re.sub(r"\[[0-9]*\]", " ", article)
    filtered_string = re.sub(r"\s+", " ", filtered_string)
    # Convertir simbolos especiales html. Ejemplo: &amp; -> &
    filtered_string = html.unescape(filtered_string)
    return filtered_string.lower()


html_article = get_html_article(
    "https://archive.org/stream/TheultimateHitchhikersGuide/The%20Hitchhiker%27s%20Guide%20To%20The%20Galaxy_djvu.txt"
)
article = extract_text_from_html(html_article, "<pre>", "</pre>")
article = preprocess_article(article)

Se separa el dataset por '.' para obtener diferentes oraciones y se los tokeniza utilizando la función *text_to_word_sequence* de keras.

In [3]:
sentence_tokens = [text_to_word_sequence(row) for row in article.split(".")]
sentence_tokens[5]

['we', 'are', 'talking', 'of', 'a', 'mild', 'inability', 'to', 'stand', 'up']

Definimos una clase callback para informar el loss por epoch.

In [4]:
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """

    def __init__(self):
        self.epoch = 0
        self.loss_previous_step = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        print(f"Loss after epoch {self.epoch}: {loss - self.loss_previous_step}")
        self.epoch += 1
        self.loss_previous_step = loss

Se genera el modelo utilizando *Word2Vec* y luego se cosntruye el vocabulario. Se probaron distintas combinaciones de hiperparámetros hasta encontrar un conjunto que birnda buenos resultados.

In [5]:
skipgram_model = Word2Vec(
    min_count=8,
    window=2,
    size=300,
    negative=20,
    workers=6,
    sg=1,  # skipgram
)

skipgram_model.build_vocab(sentence_tokens)

print("Cantidad de docs en el corpus:", skipgram_model.corpus_count)
print("Cantidad de words distintas en el corpus:", len(skipgram_model.wv.vocab))

Cantidad de docs en el corpus: 20723
Cantidad de words distintas en el corpus: 2966


Se entrena el modelo en 300 epochs.

In [6]:
skipgram_model.train(
    sentence_tokens,
    total_examples=skipgram_model.corpus_count,
    epochs=300,
    compute_loss=True,
    callbacks=[callback()],
)

Loss after epoch 0: 395961.09375
Loss after epoch 1: 248198.34375
Loss after epoch 2: 303334.0
Loss after epoch 3: 279829.8125
Loss after epoch 4: 265694.0
Loss after epoch 5: 261703.25
Loss after epoch 6: 260186.25
Loss after epoch 7: 244003.25
Loss after epoch 8: 232466.75
Loss after epoch 9: 188749.25
Loss after epoch 10: 232034.75
Loss after epoch 11: 228277.5
Loss after epoch 12: 229440.75
Loss after epoch 13: 225494.75
Loss after epoch 14: 224475.75
Loss after epoch 15: 224934.0
Loss after epoch 16: 219369.5
Loss after epoch 17: 207388.0
Loss after epoch 18: 209248.5
Loss after epoch 19: 204721.5
Loss after epoch 20: 202648.5
Loss after epoch 21: 200992.5
Loss after epoch 22: 205671.5
Loss after epoch 23: 198852.0
Loss after epoch 24: 200381.5
Loss after epoch 25: 198017.5
Loss after epoch 26: 200366.0
Loss after epoch 27: 203223.0
Loss after epoch 28: 200675.0
Loss after epoch 29: 203076.5
Loss after epoch 30: 199016.0
Loss after epoch 31: 198526.5
Loss after epoch 32: 200482.0


(51445203, 81200400)

Visualizamos los resultados para ciertas palabras de prueba.

In [7]:
def get_similars(model, words, topn):
    return [model.wv.most_similar(positive=[word], topn=topn) for word in words]


test_words = ["space", "guide", "human", "planet"]

skipgram_similars = get_similars(skipgram_model, test_words, 10)

In [8]:
print(test_words[0])
skipgram_similars[0]

space


[('prostetnic', 0.2842663824558258),
 ('fabric', 0.268716424703598),
 ('depths', 0.25458142161369324),
 ('easy', 0.25336119532585144),
 ('void', 0.25129246711730957),
 ('string', 0.2506865859031677),
 ('mess', 0.23902928829193115),
 ('blackness', 0.23811087012290955),
 ('appalling', 0.23720991611480713),
 ('dragons', 0.2355571985244751)]

In [9]:
print(test_words[1])
skipgram_similars[1]

guide


[("hitchhiker's", 0.46618199348449707),
 ('douglas', 0.3374624252319336),
 ('edition', 0.32471340894699097),
 ('s', 0.2906302213668823),
 ('5', 0.2782394289970398),
 ('hiker', 0.2729685306549072),
 ('sirius', 0.27272969484329224),
 ('remarkable', 0.2675398290157318),
 ('17', 0.2620775103569031),
 ('galaxy', 0.2612919211387634)]

In [10]:
print(test_words[2])
skipgram_similars[2]

human


[('beings', 0.4466100335121155),
 ('foolish', 0.26765668392181396),
 ('alert', 0.2674822211265564),
 ('video', 0.26567816734313965),
 ('race', 0.2636412978172302),
 ('smiling', 0.2620318531990051),
 ('remarkable', 0.2590775191783905),
 ('embarrassed', 0.2589922249317169),
 ('imagination', 0.2579191029071808),
 ('names', 0.25697314739227295)]

In [11]:
print(test_words[3])
skipgram_similars[3]

planet


[("planet's", 0.27888333797454834),
 ('section', 0.27513107657432556),
 ('daughter', 0.2724984288215637),
 ('race', 0.2540271282196045),
 ('yesterday', 0.25305086374282837),
 ('matches', 0.24764896929264069),
 ('land', 0.24588504433631897),
 ('surface', 0.2442721128463745),
 ('secret', 0.23974773287773132),
 ('wrecked', 0.23947396874427795)]

Se puede ver que algunas palabras, aunque no todas, guardan cierto sentido con la palabra dada como input.
También podemos ver que los valores de relación son bajos, ninguno supera en 0.5.

El resultado no es óptimo pero es aceptable.

Probamos crear otro modelo pero utilizando el método CBOW.

In [12]:
cbow_model = Word2Vec(
    min_count=8,
    window=4,
    size=300,
    negative=20,
    workers=6,
    sg=0,  # CBOW
)

cbow_model.build_vocab(sentence_tokens)

print("Cantidad de docs en el corpus:", cbow_model.corpus_count)
print("Cantidad de words distintas en el corpus:", len(cbow_model.wv.vocab))

Cantidad de docs en el corpus: 20723
Cantidad de words distintas en el corpus: 2966


In [13]:
cbow_model.train(
    sentence_tokens,
    total_examples=cbow_model.corpus_count,
    epochs=300,
    compute_loss=True,
    callbacks=[callback()],
)

Loss after epoch 0: 114552.1875
Loss after epoch 1: 103085.046875
Loss after epoch 2: 123013.953125
Loss after epoch 3: 120203.78125
Loss after epoch 4: 116334.03125
Loss after epoch 5: 91057.0
Loss after epoch 6: 110308.875
Loss after epoch 7: 109234.125
Loss after epoch 8: 106210.0
Loss after epoch 9: 100816.75
Loss after epoch 10: 97616.625
Loss after epoch 11: 77264.75
Loss after epoch 12: 76891.25
Loss after epoch 13: 93923.25
Loss after epoch 14: 93038.625
Loss after epoch 15: 91477.875
Loss after epoch 16: 91712.125
Loss after epoch 17: 90794.5
Loss after epoch 18: 89881.375
Loss after epoch 19: 87944.75
Loss after epoch 20: 87599.75
Loss after epoch 21: 83386.375
Loss after epoch 22: 81275.5
Loss after epoch 23: 80901.0
Loss after epoch 24: 80839.5
Loss after epoch 25: 79695.0
Loss after epoch 26: 79580.25
Loss after epoch 27: 79429.5
Loss after epoch 28: 80022.5
Loss after epoch 29: 79358.25
Loss after epoch 30: 79824.5
Loss after epoch 31: 79471.25
Loss after epoch 32: 78894.

(51439895, 81200400)

Vemos que en comparación con skipgram el valor de loss en CBOW es mucho menor, probemos con los mismos ejemplos.

In [14]:
cbow_similars = get_similars(cbow_model, test_words, 10)

In [15]:
print(test_words[0])
cbow_similars[0]

space


[('void', 0.313744455575943),
 ('air', 0.26463520526885986),
 ('future', 0.2248440980911255),
 ('doorway', 0.2174740433692932),
 ('emerged', 0.21690744161605835),
 ('interstellar', 0.21497587859630585),
 ('bush', 0.21397343277931213),
 ('half', 0.20871347188949585),
 ('blackness', 0.20778340101242065),
 ('ether', 0.20614898204803467)]

In [16]:
print(test_words[1])
cbow_similars[1]

guide


[("hitchhiker's", 0.30212295055389404),
 ('edition', 0.25454458594322205),
 ('hiker', 0.22720062732696533),
 ('smashed', 0.22524258494377136),
 ('department', 0.21082064509391785),
 ('test', 0.20610305666923523),
 ('share', 0.20533165335655212),
 ('re', 0.19851428270339966),
 ('form', 0.19268806278705597),
 ('douglas', 0.19128835201263428)]

In [17]:
print(test_words[2])
cbow_similars[2]

human


[('beings', 0.32553258538246155),
 ('properly', 0.20916011929512024),
 ('alert', 0.20582205057144165),
 ('intelligence', 0.2016240954399109),
 ('embarrassed', 0.20036527514457703),
 ('tea', 0.19660496711730957),
 ("waiter's", 0.19651682674884796),
 ('race', 0.18841463327407837),
 ('capable', 0.1881083846092224),
 ('merely', 0.18637681007385254)]

In [18]:
print(test_words[3])
cbow_similars[3]

planet


[("planet's", 0.32553794980049133),
 ('world', 0.2671772837638855),
 ('spacecraft', 0.2428017109632492),
 ('matches', 0.2262277603149414),
 ('ship', 0.22292262315750122),
 ('section', 0.21838359534740448),
 ('theory', 0.2171759009361267),
 ('race', 0.20825618505477905),
 ('galaxy', 0.2041124701499939),
 ('broadcast', 0.2007841169834137)]

A pesar de tener un valor de loss menor sucede algo similar al caso de skipgram, ciertas palabras guardan relacion con el input mientras que otras no.

Veamos la comparación de las palabras generadas por ambos métodos.

In [19]:
def print_comparison(idx):
    print(f"Word: {test_words[idx]}")
    print("SKIPGRAM - CBOW")
    print("---------------")
    for skipgram, cbow in zip(skipgram_similars[idx], cbow_similars[idx]):
        print(f"{skipgram[0]} - {cbow[0]}")


print_comparison(0)

Word: space
SKIPGRAM - CBOW
---------------
prostetnic - void
fabric - air
depths - future
easy - doorway
void - emerged
string - interstellar
mess - bush
blackness - half
appalling - blackness
dragons - ether


In [20]:
print_comparison(1)

Word: guide
SKIPGRAM - CBOW
---------------
hitchhiker's - hitchhiker's
douglas - edition
edition - hiker
s - smashed
5 - department
hiker - test
sirius - share
remarkable - re
17 - form
galaxy - douglas


In [21]:
print_comparison(2)

Word: human
SKIPGRAM - CBOW
---------------
beings - beings
foolish - properly
alert - alert
video - intelligence
race - embarrassed
smiling - tea
remarkable - waiter's
embarrassed - race
imagination - capable
names - merely


In [22]:
print_comparison(3)

Word: planet
SKIPGRAM - CBOW
---------------
planet's - planet's
section - world
daughter - spacecraft
race - matches
yesterday - ship
matches - section
land - theory
surface - race
secret - galaxy
wrecked - broadcast


A simple vista no podemos notar que uno tenga una performance claramente superior al otro. Ambos métodos arrojan resultados aceptables.

Realizamos gráficas para ver la distribución espacial de las palabras.

In [25]:
def reduce_dimensions(model):
    num_dimensions = 2

    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index2word)

    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

Para el modelo que utiliza Skipgram

In [33]:
x_vals, y_vals, labels = reduce_dimensions(skipgram_model)

MAX_WORDS = 200
fig = px.scatter(x=x_vals[:MAX_WORDS], y=y_vals[:MAX_WORDS], text=labels[:MAX_WORDS])
fig.show()

<img src="./plots/skipgram.png" width="900" align="center">

Zoom

<img src="./plots/skipgram-zoom.png" width="900" align="center">

Es dificil encontrar palabras relacionadas, parecen estar todas muy juntas, pero se pueden ver algunas agrupaciones con palabras que guardan cierto sentido. Como por ejemplo en el rango x=[0, 1] y y=[1, 2] palabras como eyes, looking, looked, head que tienen cierta relación. También podemos ver en el rango de x=[-1.5, -0.5] y y=[-1, 0] se encuentran agrupadas palabras como planet, space, y galaxy.

Para el modelo que utiliza CBOW

In [36]:
x_vals, y_vals, labels = reduce_dimensions(cbow_model)

MAX_WORDS = 200
fig = px.scatter(x=x_vals[:MAX_WORDS], y=y_vals[:MAX_WORDS], text=labels[:MAX_WORDS])
fig.show()

<img src="./plots/cbow.png" width="900" align="center">

Zoom

<img src="./plots/cbow.png" width="900" align="center">

En este caso vemos que las palabras se encuentran un poco más separadas y forman dos clusters principales. Sin embargo es difícil encontrar grupos de palabras muy relacionadas. Algo interesante se observa en x=[0, 3] y y=[2, 4], hay una agrupación de verbos en pasado: turned, went, put, left, looked. Esto indica que el algoritmo no solo aprendió a relacionar palabras con significado similar sino también los tipos de palabras como los verbos y su tiempo verbal.