<a href="https://colab.research.google.com/github/fcarrillo051/SomosNLP/blob/main/1_word_embeddings/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [2]:
!pip install --upgrade gensim



In [3]:
import gensim.downloader as api

In [4]:
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [5]:
model.similarity("king", "queen")

0.6510957

In [6]:
model.similarity("king", "man")

0.22942673

In [7]:
model.similarity("king", "potato")

0.09978465

In [8]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [9]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [10]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [11]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [17]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

def ranked_words(words, model, target_words):
  similarities = {}
  for word in words:
    for target in target_words:
      similarity = model.similarity(word,target)
      similarities[(word,target)] = similarity
      print(f"Similarity between {word} and {target}: {similarity}")
  ranked_words = sorted(words, key=lambda word: sum(similarities[(word, target)] for target in target_words) / len(target_words), reverse=True)
  return ranked_words


target_words = ["man","woman"]
words = ranked_words(words, model, target_words)

for word in words:
  print(word)

Similarity between wife and man: 0.3292091488838196
Similarity between wife and woman: 0.4448240101337433
Similarity between husband and man: 0.34499746561050415
Similarity between husband and woman: 0.4928138256072998
Similarity between child and man: 0.3163333833217621
Similarity between child and woman: 0.475003719329834
Similarity between queen and man: 0.16658201813697815
Similarity between queen and woman: 0.31618136167526245
Similarity between king and man: 0.22942672669887543
Similarity between king and woman: 0.1284797340631485
Similarity between man and man: 1.0
Similarity between man and woman: 0.7664012312889099
Similarity between woman and man: 0.7664012312889099
Similarity between woman and woman: 1.0
Similarity between birth and man: 0.11078789085149765
Similarity between birth and woman: 0.2147129327058792
Similarity between doctor and man: 0.3144896328449249
Similarity between doctor and woman: 0.37945857644081116
Similarity between nurse and man: 0.2547228932380676
Si

**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _

b. giant is to dwarf as genius is to _

c. French is to France as Spaniard is to _

d. bad is to good as sad is to _

e. nurse is to hospital as teacher is to _

f. universe is to planet as house is to _

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B.

In [None]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

In [None]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

In [19]:
# king is to throne as judge is to __?
model.most_similar(positive = ["judge","throne"], negative = ["king"], topn=1)

[('appellate_court', 0.5845253467559814)]

In [22]:
# giant is to dwarf as genius is to _?
model.most_similar(positive=["genius","dwarf"], negative=["giant"], topn=5)

[('savant', 0.44152510166168213),
 ('brilliance', 0.4409405589103699),
 ('genious', 0.43906503915786743),
 ('prose_stylist', 0.4382379949092865),
 ('mathematical_genius', 0.43737608194351196)]

In [26]:
# French is to France as Spaniard is to _?
model.most_similar(positive=["spaniard","France"], negative=["french"],topn=5)

[('Stade_De', 0.5246504545211792),
 ('Christophe_Legout', 0.5201050639152527),
 ('Albert_Montañés', 0.5125465989112854),
 ('Adrien_Mattenet', 0.5125324726104736),
 ('Paul_Henri_Matthieu', 0.508332371711731)]

In [28]:
# bad is to good as sad is to _?
model.most_similar(positive=["sad","good"], negative=["bad"], topn=5)

[('wonderful', 0.6414927840232849),
 ('happy', 0.6154337525367737),
 ('great', 0.5803679823875427),
 ('nice', 0.5683972835540771),
 ('saddening', 0.5588892698287964)]