<a href="https://colab.research.google.com/github/esgoty/Elessanti/blob/main/Custom_embedding_with_Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chandler and monica had diferent ways to speak.

this example will obtain different embeddings representation of chandler and Monica's lines. 

The analysis allows us to 2 verify two interesting things.

1. for starters, it allows us to obtain a numeric representation of words that is **context dependent** for training NLP models. 

2. but additionally, it allows us to analyze the differences in each character's narrative of the world. This is actually quite relevant nowadays because we are being exposed each day to hundreds of different takes on reality and some times, it's a cool thing to dissect the discourse and analyze it for a better grasp on things. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import multiprocessing
from gensim.models import Word2Vec

# 1- Chandler lines

### Datos
Utilizaremos como dataset canciones de bandas de habla inglés.

In [3]:
# Armar el dataset utilizando salto de línea para separar las oraciones/docs
df_chandler = pd.read_csv('/content/drive/MyDrive/NLP/chandler_lines.txt', sep='/n', header=None)
df_chandler.head()

  return func(*args, **kwargs)


Unnamed: 0,0
0,"all right joey, be nice. so does he have a hu..."
1,sounds like a date to me.
2,"alright, so i'm back in high school, i'm stand..."
3,"then i look down, and i realize there's a phon..."
4,that's right.


In [4]:
print("Cantidad de documentos:", df_chandler.shape[0])

Cantidad de documentos: 8410


### 1 - Preprocesamiento

In [5]:
from keras.preprocessing.text import text_to_word_sequence

sentence_tokens = []
# Recorrer todas las filas y transformar las oraciones
# en una secuencia de palabras (esto podría realizarse con NLTK o spaCy también)
for _, row in df_chandler[:None].iterrows():
    sentence_tokens.append(text_to_word_sequence(row[0]))

In [6]:
# Demos un vistazo
sentence_tokens[:2]

[['all',
  'right',
  'joey',
  'be',
  'nice',
  'so',
  'does',
  'he',
  'have',
  'a',
  'hump',
  'a',
  'hump',
  'and',
  'a',
  'hairpiece'],
 ['sounds', 'like', 'a', 'date', 'to', 'me']]

### 2 - Crear los vectores (word2vec)

In [7]:
from gensim.models.callbacks import CallbackAny2Vec
# Durante el entrenamiento gensim por defecto no informa el "loss" en cada época
# Sobracargamos el callback para poder tener esta información
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        else:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss

In [8]:
# Crearmos el modelo generador de vectoeres
# En este caso utilizaremos la estructura modelo Skipgram
w2v_model_chandler = Word2Vec(min_count=5,    # frecuencia mínima de palabra para incluirla en el vocabulario
                     window=2,       # cant de palabras antes y desp de la predicha
                     size=300,       # dimensionalidad de los vectores 
                     negative=20,    # cantidad de negative samples... 0 es no se usa
                     workers=1,      # si tienen más cores pueden cambiar este valor
                     sg=1)           # modelo 0:CBOW  1:skipgram

In [9]:
# Buildear el vocabularui con los tokens
w2v_model_chandler.build_vocab(sentence_tokens)

In [10]:
# Cantidad de filas/docs encontradas en el corpus
print("Cantidad de docs en el corpus:", w2v_model_chandler.corpus_count)

Cantidad de docs en el corpus: 8410


In [11]:
# Cantidad de words encontradas en el corpus
print("Cantidad de words distintas en el corpus:", len(w2v_model_chandler.wv.vocab))

Cantidad de words distintas en el corpus: 1387


### 3 - Entrenar el modelo generador

In [12]:
# Entrenamos el modelo generador de vectores
# Utilizamos nuestro callback
w2v_model_chandler.train(sentence_tokens,
                 total_examples=w2v_model_chandler.corpus_count,
                 epochs=200,
                 compute_loss = True,
                 callbacks=[callback()]
                 )

Loss after epoch 0: 678863.4375
Loss after epoch 1: 511033.5625
Loss after epoch 2: 462405.875
Loss after epoch 3: 454317.625
Loss after epoch 4: 392148.5
Loss after epoch 5: 393440.25
Loss after epoch 6: 389980.0
Loss after epoch 7: 387951.25
Loss after epoch 8: 384967.0
Loss after epoch 9: 367235.0
Loss after epoch 10: 354563.5
Loss after epoch 11: 354876.5
Loss after epoch 12: 350866.5
Loss after epoch 13: 350095.5
Loss after epoch 14: 347371.0
Loss after epoch 15: 345511.0
Loss after epoch 16: 342778.5
Loss after epoch 17: 341133.5
Loss after epoch 18: 341465.0
Loss after epoch 19: 336949.5
Loss after epoch 20: 338422.0
Loss after epoch 21: 326763.0
Loss after epoch 22: 315408.0
Loss after epoch 23: 314601.0
Loss after epoch 24: 314962.0
Loss after epoch 25: 310780.0
Loss after epoch 26: 311309.0
Loss after epoch 27: 311249.0
Loss after epoch 28: 309831.0
Loss after epoch 29: 307484.0
Loss after epoch 30: 307929.0
Loss after epoch 31: 307145.0
Loss after epoch 32: 304429.0
Loss aft

(10715062, 17262600)

### 4 - Ensayar

In [13]:
# Palabras que MÁS se relacionan con...:
w2v_model_chandler.wv.most_similar(positive=["work"], topn=10)

[('ehh', 0.332955926656723),
 ('dinner', 0.31776440143585205),
 ('bowl', 0.2836850881576538),
 ('catch', 0.2779606282711029),
 ('steal', 0.27760791778564453),
 ('screaming', 0.26454854011535645),
 ('helen', 0.26332029700279236),
 ('ugly', 0.2630704343318939),
 ('control', 0.26277899742126465),
 ('10', 0.26168137788772583)]

In [14]:
# Palabras que MENOS se relacionan con...:
w2v_model_chandler.wv.most_similar(negative=["love"], topn=10)

[('lights', 0.04228873550891876),
 ('nobody', 0.041226353496313095),
 ('plane', 0.039347097277641296),
 ('special', 0.039189413189888),
 ('straight', 0.039126597344875336),
 ('plan', 0.028084486722946167),
 ('hug', 0.02706683985888958),
 ('wonder', 0.026550067588686943),
 ('picture', 0.025340469554066658),
 ('stage', 0.025141429156064987)]

In [15]:
# Palabras que MÁS se relacionan con...:
w2v_model_chandler.wv.most_similar(positive=["money"], topn=10)

[('keys', 0.34803348779678345),
 ('borrow', 0.3285517692565918),
 ('days', 0.32529890537261963),
 ('foosball', 0.306495726108551),
 ('hurt', 0.295035719871521),
 ('shorts', 0.2921173572540283),
 ('birds', 0.2906684875488281),
 ('thoughts', 0.28386127948760986),
 ('known', 0.28331488370895386),
 ('mike', 0.2814350128173828)]

In [16]:
# Palabras que MÁS se relacionan con...:
w2v_model_chandler.wv.most_similar(positive=["duck"], topn=5)

[('glasses', 0.37916386127471924),
 ('stripper', 0.35650432109832764),
 ('watched', 0.3530530333518982),
 ('rabbit', 0.3492811918258667),
 ('college', 0.3489832580089569)]

### 5 - Visualizar agrupación de vectores

In [17]:
from sklearn.decomposition import IncrementalPCA    
from sklearn.manifold import TSNE                   
import numpy as np                                  

def reduce_dimensions(model):
    num_dimensions = 2  

    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index2word)  

    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

# 1- Monica lines

### Datos
Utilizaremos como dataset canciones de bandas de habla inglés.

In [18]:
# Armar el dataset utilizando salto de línea para separar las oraciones/docs
df_monica = pd.read_csv('/content/drive/MyDrive/NLP/monica_lines.txt', sep='/n', header=None, engine='python',encoding='latin1')
df_monica.head()

Unnamed: 0,0
0,there's nothing to tell! he's just some guy i ...
1,"okay, everybody relax. this is not even a date..."
2,and they weren't looking at you before?!
3,"are you okay, sweetie?"
4,carol moved her stuff out today.


In [19]:
print("Cantidad de documentos:", df_monica.shape[0])

Cantidad de documentos: 8383


### 1 - Preprocesamiento

In [20]:
from keras.preprocessing.text import text_to_word_sequence

sentence_tokens = []
# Recorrer todas las filas y transformar las oraciones
# en una secuencia de palabras (esto podría realizarse con NLTK o spaCy también)
for _, row in df_monica[:None].iterrows():
    sentence_tokens.append(text_to_word_sequence(row[0]))

In [21]:
# Demos un vistazo
sentence_tokens[:2]

[["there's",
  'nothing',
  'to',
  'tell',
  "he's",
  'just',
  'some',
  'guy',
  'i',
  'work',
  'with'],
 ['okay',
  'everybody',
  'relax',
  'this',
  'is',
  'not',
  'even',
  'a',
  'date',
  "it's",
  'just',
  'two',
  'people',
  'going',
  'out',
  'to',
  'dinner',
  'and',
  'not',
  'having',
  'sex']]

### 2 - Crear los vectores (word2vec)

In [22]:
from gensim.models.callbacks import CallbackAny2Vec
# Durante el entrenamiento gensim por defecto no informa el "loss" en cada época
# Sobracargamos el callback para poder tener esta información
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        else:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss

In [23]:
# Crearmos el modelo generador de vectoeres
# En este caso utilizaremos la estructura modelo Skipgram
w2v_model_monica = Word2Vec(min_count=5,    # frecuencia mínima de palabra para incluirla en el vocabulario
                     window=2,       # cant de palabras antes y desp de la predicha
                     size=300,       # dimensionalidad de los vectores 
                     negative=20,    # cantidad de negative samples... 0 es no se usa
                     workers=1,      # si tienen más cores pueden cambiar este valor
                     sg=1)           # modelo 0:CBOW  1:skipgram

In [24]:
# Buildear el vocabularui con los tokens
w2v_model_monica.build_vocab(sentence_tokens)

In [25]:
# Cantidad de filas/docs encontradas en el corpus
print("Cantidad de docs en el corpus:", w2v_model_monica.corpus_count)

Cantidad de docs en el corpus: 8383


In [26]:
# Cantidad de words encontradas en el corpus
print("Cantidad de words distintas en el corpus:", len(w2v_model_monica.wv.vocab))

Cantidad de words distintas en el corpus: 1267


### 3 - Entrenar el modelo generador

In [27]:
# Entrenamos el modelo generador de vectores
# Utilizamos nuestro callback
w2v_model_monica.train(sentence_tokens,
                 total_examples=w2v_model_monica.corpus_count,
                 epochs=200,
                 compute_loss = True,
                 callbacks=[callback()]
                 )

Loss after epoch 0: 648995.8125
Loss after epoch 1: 497137.3125
Loss after epoch 2: 442592.375
Loss after epoch 3: 436279.25
Loss after epoch 4: 385107.0
Loss after epoch 5: 375773.5
Loss after epoch 6: 370131.25
Loss after epoch 7: 371749.75
Loss after epoch 8: 368518.5
Loss after epoch 9: 362432.25
Loss after epoch 10: 340832.5
Loss after epoch 11: 340091.0
Loss after epoch 12: 336866.0
Loss after epoch 13: 337625.0
Loss after epoch 14: 335365.5
Loss after epoch 15: 332665.0
Loss after epoch 16: 330636.5
Loss after epoch 17: 328696.5
Loss after epoch 18: 327106.5
Loss after epoch 19: 326918.0
Loss after epoch 20: 325221.5
Loss after epoch 21: 324405.0
Loss after epoch 22: 311759.0
Loss after epoch 23: 302875.0
Loss after epoch 24: 303908.0
Loss after epoch 25: 300556.0
Loss after epoch 26: 301433.0
Loss after epoch 27: 300424.0
Loss after epoch 28: 299931.0
Loss after epoch 29: 298948.0
Loss after epoch 30: 297471.0
Loss after epoch 31: 297955.0
Loss after epoch 32: 296128.0
Loss aft

(10339145, 16586600)

### 4 - Ensayar

In [28]:
# Palabras que MÁS se relacionan con...:
w2v_model_monica.wv.most_similar(positive=["work"], topn=10)

[('clear', 0.31303489208221436),
 ('hospital', 0.30162501335144043),
 ('geoffrey', 0.3004313111305237),
 ('pizza', 0.29335469007492065),
 ('bank', 0.2863214910030365),
 ('van', 0.28570854663848877),
 ('nope', 0.28195327520370483),
 ('dinner', 0.2776287794113159),
 ('die', 0.2746597230434418),
 ('business', 0.2741873562335968)]

In [29]:
# Palabras que MENOS se relacionan con...:
w2v_model_monica.wv.most_similar(negative=["love"], topn=10)

[('apparently', 0.04466196149587631),
 ('took', 0.04278826713562012),
 ('ago', 0.04198911413550377),
 ('does', 0.041572995483875275),
 ("we've", 0.04048974812030792),
 ('point', 0.035859186202287674),
 ('hundred', 0.025472812354564667),
 ('breasts', 0.023149803280830383),
 ('christmas', 0.020376067608594894),
 ('first', 0.019811401143670082)]

In [30]:
# Palabras que MÁS se relacionan con...:
w2v_model_monica.wv.most_similar(positive=["money"], topn=10)

[('porn', 0.3444075882434845),
 ('soul', 0.33752816915512085),
 ('cookies', 0.3362638056278229),
 ('memories', 0.3276340663433075),
 ('lives', 0.32325267791748047),
 ('decision', 0.31753724813461304),
 ('van', 0.31297868490219116),
 ('children', 0.308643102645874),
 ('stole', 0.3045380115509033),
 ('candles', 0.3041505813598633)]

In [31]:
# Palabras que MÁS se relacionan con...:
w2v_model_monica.wv.most_similar(positive=["clean"], topn=5)

[('realized', 0.3678734302520752),
 ('opened', 0.3592709004878998),
 ('places', 0.33507323265075684),
 ('wondering', 0.33321452140808105),
 ('keeping', 0.33035358786582947)]

### 5 - Visualizar agrupación de vectores

In [32]:
from sklearn.decomposition import IncrementalPCA    
from sklearn.manifold import TSNE                   
import numpy as np                                  

def reduce_dimensions(model):
    num_dimensions = 2  

    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index2word)  

    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

# ploting embedings in 2d

# monica emmbedings

In [33]:
# Graficar los embedddings en 2D
import plotly.graph_objects as go
import plotly.express as px

x_vals, y_vals, labels = reduce_dimensions(w2v_model_monica)

MAX_WORDS=200
fig = px.scatter(x=x_vals[:MAX_WORDS], y=y_vals[:MAX_WORDS], text=labels[:MAX_WORDS])
fig.show(renderer="colab") # esto para plotly en colab



# chandler emmbedings

In [34]:
# Graficar los embedddings en 2D
import plotly.graph_objects as go
import plotly.express as px

x_vals, y_vals, labels = reduce_dimensions(w2v_model_chandler)

MAX_WORDS=200
fig = px.scatter(x=x_vals[:MAX_WORDS], y=y_vals[:MAX_WORDS], text=labels[:MAX_WORDS])
fig.show(renderer="colab") # esto para plotly en colab


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



# Quick takes

There are a few things to note from the experiments above

1. Different corpus create different embeddings representations

 We can check that words do not cluster in the same way for both characters. for example, "work" for Monica is closely related to food such as teeth, food, clear, pizza, and geoffrey (the maitre on her restaurant). But for chandler, the same work is related to more random terms. That's because we as spectators never get to know what chandler does for a living.

2. On both 2D diagrams, we can see a heavy clustering of stop words mostly. 
 
 As "by, put, out in" etc. those words actually are used a bunch along the series because of the "casual" tone the dialog has.
a needed improvement would be to discard such words for a more deep insight into concepts and common terms each character has.

3. Interestingly enough, gensim tends to clump together terms on the 2d diagram when more training epochs are used. 
 
 The terms (the embedding representation actually) is more evenly distributed when fewer epochs are used (that's because of model's weight being uniformly set at the beginning of the training process). 

 Additionally, when we use more training epochs, the embeddings tend to represent better the context of the word they represent, and the similar terms make more sense


