<a href="https://colab.research.google.com/github/anyfish/Transformers/blob/main/%5BSpainAI_01%5DWord2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [None]:
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.6 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [None]:
import gensim.downloader as api

In [None]:
# Modelo
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [None]:
model.similarity("king", "queen")

0.6510957

In [None]:
model.similarity("king", "man")

0.22942673

In [None]:
model.similarity("king", "potato")

0.09978465

In [None]:
model.similarity("king", "king")

1.0

In [None]:
model.similarity("dog", "cat")

0.76094574

In [None]:
model.similarity("face", "bug")

0.03785802

In [None]:
model.similarity("spider", "bug")

0.47293577

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [None]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [None]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

In [None]:
model.most_similar(["TV", "actor"], topn=4)

[('television', 0.7352086901664734),
 ('actress', 0.6457569003105164),
 ('Baywatch_Nights', 0.6409478187561035),
 ('Everyone_Loves_Raymond', 0.6347001194953918)]

In [None]:
model.most_similar(["red", "pepper"])

[('pepper_thyme', 0.6538592576980591),
 ('pepper_cayenne', 0.6425741910934448),
 ('yellow', 0.6284185647964478),
 ('participant_LOGIN', 0.6212106347084045),
 ('pepper_garlic', 0.6171141862869263),
 ('pepper_strips', 0.6166490316390991),
 ('yellow_bell_peppers', 0.600123941898346),
 ('brown', 0.5945205092430115),
 ('chilies_garlic', 0.594168484210968),
 ('pepper_onion', 0.5797962546348572)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [None]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

In [None]:
model.doesnt_match(["water", "fire", "ground", "steel", "fairy", "bug", "cosmic", "grass", "ice", "dragon", "rock"])

'steel'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [None]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

In [None]:
man_words = {}
for i in words:
  simil = model.similarity("man", i)
  man_words[i] = simil

sorted(man_words.items(), key=lambda x: x[1], reverse=True)

[('man', 1.0),
 ('woman', 0.76640123),
 ('husband', 0.34499747),
 ('wife', 0.32920915),
 ('child', 0.31633338),
 ('doctor', 0.31448963),
 ('nurse', 0.2547229),
 ('teacher', 0.25000125),
 ('king', 0.22942673),
 ('queen', 0.16658202),
 ('scientist', 0.15824963),
 ('engineer', 0.15128928),
 ('birth', 0.11078789),
 ('professor', 0.09415862),
 ('president', 0.028424604)]

In [None]:
woman_words = {}
for i in words:
  simil = model.similarity("woman", i)
  woman_words[i] = simil

sorted(woman_words.items(), key=lambda x: x[1], reverse=True)

[('woman', 1.0),
 ('man', 0.76640123),
 ('husband', 0.49281383),
 ('child', 0.47500372),
 ('wife', 0.444824),
 ('nurse', 0.44135594),
 ('doctor', 0.37945858),
 ('queen', 0.31618136),
 ('teacher', 0.31357846),
 ('birth', 0.21471293),
 ('scientist', 0.15486898),
 ('professor', 0.13077852),
 ('king', 0.12847973),
 ('engineer', 0.09435377),
 ('president', 0.062676705)]

**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to silla

b. giant is to dwarf as genius is to fool

c. French is to France as Spaniard is to Spain

d. bad is to good as sad is to happy

e. nurse is to hospital as teacher is to school

f. universe is to planet as house is to room

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B. 

In [None]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [None]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

[('taco', 0.6266060471534729)]

In [None]:
# a. king is to throne as judge is to silla
model.most_similar(positive=["king", "judge"], negative=["throne"], topn=1)

[('appeals_court', 0.5385673642158508)]

In [None]:
# b. giant is to dwarf as genius is to fool
model.most_similar(positive=["giant", "genius"], negative=["dwarf"], topn=1)

[('wizardry', 0.5062874555587769)]

In [None]:
# c. French is to France as Spaniard is to Spain
model.most_similar(positive=["French", "Spain"], negative=["France"], topn=1)

[('Spanish', 0.8269087076187134)]

In [None]:
# d. bad is to good as sad is to happy
model.most_similar(positive=["bad", "happy"], negative=["good"], topn=1)

[('unhappy', 0.5959899425506592)]

In [None]:
# e. nurse is to hospital as teacher is to school
model.most_similar(positive=["nurse", "teacher"], negative=["hospital"], topn=1)

[('guidance_counselor', 0.6577334403991699)]

In [None]:
# f. universe is to planet as house is to room
model.most_similar(positive=["universe", "house"], negative=["planet"], topn=1)

[('houses', 0.5309091210365295)]