This code snippet uses the gensim.downloader library to download and load a pre-trained word embedding model called 'glove-wiki-gigaword-100'. This model represents words as numerical vectors, capturing their semantic meaning. After loading the model, it demonstrates how to retrieve the embedding (vector) for the word 'example' and prints its characteristics like dimension, data type, and shape.

In [4]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


 In the context of the word embedding, these terms describe the characteristics of the numerical vector representing the word:

**Dimension (100)**: This refers to the length or size of the vector. In this case, the 'glove-wiki-gigaword-100' model creates 100-dimensional vectors, meaning each word is represented by 100 numbers.

**Type (<class 'numpy.ndarray'>)**: This indicates that the embedding is a NumPy array. NumPy is a popular Python library for numerical computing, especially with arrays and matrices.

**Shape ((100,))**: This describes the structure of the array. (100,) means it's a one-dimensional array (a vector) with 100 elements. If it were a 2D array (like a table), it might look something like (rows, columns).

In [13]:
import gensim.downloader as api

# Download and load the pre-trained GloVe model
model = api.load("glove-wiki-gigaword-100")  # You can choose different dimensions like 50, 100, 200, 300

# Example usage: get the embedding for a word
word = 'example'
embedding = model[word]

print(f"Embedding for '{word}':\n{embedding}")
print("Dimension:", len(embedding))
print("Type:", type(embedding))
print("Shape:", embedding.shape)

Embedding for 'example':
[-0.12617    0.61724    0.22581    0.39868    0.16111    0.1523
 -0.14715   -0.29447   -0.27348   -0.13753   -0.20898   -0.73436
  0.14144    0.15048    0.09179    0.018613   0.22539    0.15979
 -0.16935    0.42716    0.042284  -0.3477    -0.11413    0.12222
 -0.025027  -0.20805   -0.067264  -0.2956    -0.30807   -0.32903
  0.19059    0.77141   -0.19332   -0.31069    0.26745    0.32231
  0.2065     0.10497    0.49425   -0.38322   -0.12802   -0.069906
 -0.14828    0.085369  -0.18141    0.14688    0.60968   -0.21131
 -0.29148   -0.52773    0.59508    0.017369   0.15342    0.81925
 -0.20643   -2.0378    -0.11884   -0.16826    1.5288     0.15756
 -0.4994     0.39305    0.12672   -0.10968    1.3671    -0.21006
  0.15684    0.0063801  0.43836   -0.18765   -0.29088    0.18619
  0.085402   0.13985    0.40794   -0.14811    0.26702   -0.19142
 -0.6189     0.0091217  0.34971   -0.24079   -0.52476   -0.25071
 -1.5681     0.22101    0.046796  -0.62616   -0.043358  -0.42865


It calculates the analogy 'king' - 'man' + 'woman' to find words that are semantically similar to the result. It then prints the top 5 most similar words along with their similarity scores.

model.most_similar(...): This is a method of your loaded word embedding model (the GloVe model in this case). Its purpose is to find words in the vocabulary that are most semantically similar to a target concept, which is derived from the positive and negative lists you provide.

positive=['king', 'woman']: This argument takes a list of words whose vectors you want to add together. In an analogy, these words contribute to the desired semantic direction. Here, you're starting with the concept of 'king' and adding the concept of 'woman'.

negative=['man']: This argument takes a list of words whose vectors you want to subtract. In the analogy 'king' - 'man' + 'woman', you are subtracting the 'man-ness' from 'king'. The combined effect is to transform 'king' along a 'gender' dimension towards 'woman'.

topn=5: This argument specifies that you want the function to return the top 5 words that are most similar to the final vector generated by the positive and negative additions/subtractions. These are the words with the highest cosine similarity to the resultant vector.

When we do vector('king') - vector('man'):

vector('king'): This represents the concept of a male monarch.
- vector('man'): By subtracting the 'man' vector, we are essentially trying to remove the 'maleness' and specific attributes associated with 'man' from 'king'. What's left (conceptually) is the 'royalty' or 'leadership' aspect, minus the gender.
Then, when we add + vector('woman'), we are trying to add the 'femaleness' and attributes associated with 'woman' to that remaining 'royalty/leadership' concept. The goal is to find a word that embodies 'royalty/leadership' plus 'woman-ness'.

So, the subtraction of 'man' is crucial for isolating the semantic component (like gender) that we then want to transfer or modify with another word's component.

In [7]:
result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)
print("Analogy 'king' - 'man' + 'woman':")
for word, similarity in result:
    print(f"  {word}: {similarity:.4f}")

Analogy 'king' - 'man' + 'woman':
  queen: 0.7699
  monarch: 0.6843
  throne: 0.6756
  daughter: 0.6595
  princess: 0.6521


In [11]:
# finding most simlar word
result = model.most_similar(positive=['woman'],  topn=5)

for word, similarity in result:
    print(f"  {word}: {similarity:.4f}")

  girl: 0.8473
  man: 0.8323
  mother: 0.8276
  boy: 0.7721
  she: 0.7632


In [9]:
similar_words = model.most_similar(positive=['king'], topn=10)
print("Words similar to 'king':")
for word, similarity in similar_words:
    print(f"  {word}: {similarity:.4f}")

Words similar to 'king':
  prince: 0.7682
  queen: 0.7508
  son: 0.7021
  brother: 0.6986
  monarch: 0.6978
  throne: 0.6920
  kingdom: 0.6811
  father: 0.6802
  emperor: 0.6713
  ii: 0.6676
