<a href="https://colab.research.google.com/github/cmtorresjimenez/curso_nlp/blob/main/Word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [None]:
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 22.4 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [None]:
import gensim.downloader as api

In [None]:
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [None]:
model.similarity("king", "queen")

0.6510957

In [None]:
model.similarity("king", "man")

0.22942673

In [None]:
model.similarity("king", "potato")

0.09978465

In [None]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [None]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [None]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [None]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [None]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

res1 = []
res2 = []

for w in words:
  res1.append(("man", w, model.similarity("man", w)))
  res2.append(("woman", w, model.similarity("woman", w)))

res1 = sorted(res1,  key=lambda tup:(tup[2]), reverse=True)
res2 = sorted(res2, key=lambda tup:(tup[2]), reverse=True)

for row in res1:
    print(*row, sep=' | ')

for row in res2:
    print(*row, sep=' | ')

man | man | 1.0
man | woman | 0.76640123
man | husband | 0.34499747
man | wife | 0.32920915
man | child | 0.31633338
man | doctor | 0.31448963
man | nurse | 0.2547229
man | teacher | 0.25000125
man | king | 0.22942673
man | queen | 0.16658202
man | scientist | 0.15824963
man | engineer | 0.15128928
man | birth | 0.11078789
man | professor | 0.09415862
man | president | 0.028424604
woman | woman | 1.0
woman | man | 0.76640123
woman | husband | 0.49281383
woman | child | 0.47500372
woman | wife | 0.444824
woman | nurse | 0.44135594
woman | doctor | 0.37945858
woman | queen | 0.31618136
woman | teacher | 0.31357846
woman | birth | 0.21471293
woman | scientist | 0.15486898
woman | professor | 0.13077852
woman | king | 0.12847973
woman | engineer | 0.09435377
woman | president | 0.062676705


**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _

b. giant is to dwarf as genius is to _

c. French is to France as Spaniard is to _

d. bad is to good as sad is to _

e. nurse is to hospital as teacher is to _

f. universe is to planet as house is to _

In [None]:
model.most_similar(positive=["king", "judge"], negative=["throne"], topn=5)

[('appeals_court', 0.5385673642158508),
 ('Judge', 0.5238491296768188),
 ('magistrate', 0.49228599667549133),
 ('court', 0.48597487807273865),
 ('appellate_court', 0.47618812322616577)]

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B. 

In [None]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [None]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

[('taco', 0.6266060471534729)]

In [None]:
dataset = [("king", "throne", "judge"),
("giant", "dwarf", "genius"),
("French", "France", "Spaniard"),
("bad", "good", "sad"),
("nurse", "hospital", "teacher"),
("universe", "planet", "house")]

for x,y,z in dataset:
  print('(',x,'-',y,') -> (',z,'-', model.most_similar(positive=[z, y], negative=[x], topn=5))

( king - throne ) -> ( judge - [('appellate_court', 0.5845253467559814), ('appeals_court', 0.5540135502815247), ('Judge', 0.529381513595581), ('presiding_judge', 0.5287210941314697), ('court', 0.5261871814727783)]
( giant - dwarf ) -> ( genius - [('savant', 0.44152510166168213), ('brilliance', 0.4409405589103699), ('genious', 0.43906503915786743), ('prose_stylist', 0.4382379949092865), ('mathematical_genius', 0.43737608194351196)]
( French - France ) -> ( Spaniard - [('rider_Dani_Pedrosa', 0.5646752119064331), ('Northern_Irishman', 0.5608561635017395), ('Ulsterman', 0.5575432181358337), ('Alberto_Contador_Astana', 0.5544100999832153), ('Frenchman', 0.5451046228408813)]
( bad - good ) -> ( sad - [('wonderful', 0.6414927840232849), ('happy', 0.6154337525367737), ('great', 0.5803679823875427), ('nice', 0.5683972835540771), ('saddening', 0.5588892698287964)]
( nurse - hospital ) -> ( teacher - [('school', 0.60170978307724), ('elementary', 0.5366939902305603), ('School', 0.516487181186676),