# Word Embeddings in Python with gensim

Word2Vec is a group of related methods to learn word embeddings using shallow neural networks. Given a large corpus of text, these techniques allow to learn a representation for each word such that words that share common contexts in the corpus are located close to each other in the vector space.
###Skip-Gram Model
It is one of the two word2vec methods to build word embeddings. We want words that appear in the same context to have similar representations, so we will use as training instance a given word and as its label a word that appears in the same context.

The model has the following architecture:

![Model architecture](https://miro.medium.com/max/3138/0*FTfdlZ7yDBoQ8c9W.png)

The weights of the dense layers will be updated during training, and we will use the weights of the first layer as word embeddings.


In [None]:
!pip install wget # to download data
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=7833e875e601a7926be2c26c703aa23efe5c41f09c5acf99c49aa9eaa33cbd79
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
2023-10-25 05:46:12.779005: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-25 05:46:12.779067: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-25 05:46:12.779106: 

## Load Libraries

We're using a new library called `gensim`.  It's a great library for modeling text and comes with pre-trained models that you can easily use in other contexts.

In [None]:
%matplotlib inline
import numpy as np
import gensim
import matplotlib.pyplot as plt
import seaborn as sns
import wget
import spacy
import scipy.stats

from tqdm import tqdm
from nltk.corpus import stopwords
import nltk
import re
from collections import defaultdict



In [None]:
from google.colab import drive
drive.mount('/content/drive')

root_folder = '/content/drive/My Drive/nlp_data/' # to save checkpoints

MessageError: ignored

### Simple preprocessing example

Preprocessing is a crucial step when using Word2Vec or other word embedding techniques. It prepares the text data in a way that enhances the quality of learned embeddings, reduces noise, and improves the efficiency of training and the overall performance of natural language processing tasks such as text classification, sentiment analysis, and document retrieval.

1. **Convert to lower:** This step involves changing all the characters in the text to their lowercase counterparts. It ensures that the text is case-insensitive, making it easier for further text processing.

2. **Remove punctuations/symbols/numbers (but it is your choice, stopwords):** In this stage, you eliminate any punctuation marks, symbols, and numbers from the text. The specific choice of whether to remove stopwords depends on your text processing goals. Stopwords are commonly used words (e.g., "the," "and," "in") that are often removed to focus on more meaningful content words.

3. **Normalize the words (lemmatize and stem the words):** Normalizing words typically involves two processes:

   - **Lemmatization:** This is the process of reducing words to their base or dictionary form, known as the lemma. For example, "running" becomes "run," "better" becomes "good," and "mice" becomes "mouse." Lemmatization helps reduce words to their core meaning.

   - **Stemming:** Stemming involves removing prefixes or suffixes from words to obtain their root form (stem). For instance, "jumping" becomes "jump," "swimming" becomes "swim," and "flies" (as a verb) becomes "fli." Stemming is a more aggressive reduction of words compared to lemmatization and may result in words that are not valid English words but can be useful for information retrieval tasks.
4. **Remove unfrequent words:** Remove words that appear only once in the dataset

Combining all these steps can help in cleaning and preparing text for various natural language processing (NLP) tasks, such as text classification, information retrieval, or sentiment analysis. The choice between lemmatization and stemming depends on the specific requirements of your NLP application.

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [None]:
nlp = spacy.load("en_core_web_sm")
regexp_alphbetic = re.compile('[^a-zA-Z]+')

def preprocess_text(sentence, stopwords, lemmatize=True):
  doc = nlp(sentence)
  sentence_tokens = []
  for token in doc:
    token_text = token.lemma_ if lemmatize else token.text
    token_text = token_text.lower()
    # skip stopwords and NON alphanumeric
    if token_text in stopwords or regexp_alphbetic.search(token_text):
      continue
    sentence_tokens.append(token_text)
  return sentence_tokens


In [None]:
documents = [
    "Perfect is the enemy of good.",
    "I'm still learning.",
    "Life is a journey, not a destination.",
    "Learning is not attained by chance, it must be sought for with ardor and attended to with diligence.",
    "Yesterday I was clever, so I changed the world. Today I am wise, so I am changing myself.",
    "Be curious, not judgmental.",
    "You don't have to be great to start, but you have to start to be great.,"
    "Be stubborn about your goals and flexible about your methods.",
    "Nothing will work unless you do.",
    "Never give up on a dream just because of the time it will take to accomplish it. The time will pass anyway.",
    "Anyone who stops learning is old, whether at twenty or eighty.",
    "Tell me and I forget. Teach me and I remember. Involve me and I learn.",
    "Change is the end result of all true learning.",
    "Live as if you were to die tomorrow. Learn as if you were to live forever.",
    "A learning curve is essential to growth.",
]

# remove common words and tokenize
texts = [preprocess_text(sentence, stop_words, lemmatize=True) for sentence in documents]

## remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
texts

[['be', 'the', 'of'],
 ['i', 'be', 'learn'],
 ['be', 'a', 'not', 'a'],
 ['learning', 'be', 'not', 'it', 'be', 'with', 'and', 'to', 'with'],
 ['i', 'be', 'so', 'i', 'change', 'the', 'i', 'be', 'so', 'i', 'be', 'change'],
 ['be', 'not'],
 ['you',
  'do',
  'not',
  'have',
  'to',
  'be',
  'great',
  'to',
  'start',
  'you',
  'have',
  'to',
  'start',
  'to',
  'be',
  'great',
  'about',
  'your',
  'and',
  'about',
  'your'],
 ['will', 'you', 'do'],
 ['a', 'of', 'the', 'time', 'it', 'will', 'to', 'it', 'the', 'time', 'will'],
 ['learn', 'be'],
 ['i', 'and', 'i', 'i', 'and', 'i', 'i', 'and', 'i', 'learn'],
 ['change', 'be', 'the', 'of', 'learning'],
 ['live',
  'as',
  'if',
  'you',
  'be',
  'to',
  'learn',
  'as',
  'if',
  'you',
  'be',
  'to',
  'live'],
 ['a', 'learning', 'be', 'to']]

### Creating our Word2Vec Model

`gensim` makes it easy to train a Word2Vec model.  All training requires is passing in the corpus.

In [None]:
model = gensim.models.Word2Vec(texts, vector_size=10, window=2, min_count=1)
model

<gensim.models.word2vec.Word2Vec at 0x7bbe442080d0>

In [None]:
model.wv["live"]

array([-0.0560662 ,  0.01728904, -0.00880932,  0.06781744,  0.03992345,
        0.04525375,  0.0145028 , -0.02689059, -0.04391827, -0.01024627],
      dtype=float32)

And we can find the most similar words too.  Obviously, our dataset is too small and we won't find anything too interesting.

In [None]:
model.wv.most_similar('live')

[('you', 0.46276724338531494),
 ('will', 0.43382689356803894),
 ('learning', 0.42391228675842285),
 ('start', 0.39573609828948975),
 ('not', 0.3559991419315338),
 ('of', 0.21979232132434845),
 ('as', 0.21175067126750946),
 ('about', 0.17295706272125244),
 ('have', 0.12536422908306122),
 ('i', 0.11380045861005783)]

### Loading an existing corpus

We can load some existing text and train a model on it.  In this case, we're going to use which is a small subset of Wikipedia.

In [None]:
# wiki_10k.txt for short runs

url_data = 'https://github.com/dbamman/anlp19/raw/master/data/wiki.10K.txt'
train_data_path = wget.download(url_data)




In [None]:
texts =  []

#we limit the documents for time constrains
MAX_DOCUMENTS = 1000

with open(train_data_path) as fr:
  count = 0
  for line in tqdm(fr):
    texts.append(preprocess_text(line, stop_words, lemmatize=True))
    count += 1
    if count > MAX_DOCUMENTS:
      break

print(len(texts))

1000it [02:01,  8.23it/s]

1001





In [None]:

## remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

In [None]:
# Using small numbers here, probably want to use a bigger corpus, bigger dimensions, and more iterations.
model = gensim.models.Word2Vec(texts, vector_size=10, window=4, epochs=20, min_count=1)
model

<gensim.models.word2vec.Word2Vec at 0x7bbe490f2cb0>

We can get slightly better results (but we really should be using a much bigger corpus)

In [None]:
model.wv.most_similar("road")

[('route', 0.9823903441429138),
 ('street', 0.9532665014266968),
 ('cross', 0.9493829011917114),
 ('omaha', 0.9481728076934814),
 ('spruce', 0.9461341500282288),
 ('junction', 0.9428761601448059),
 ('divisions', 0.9417228698730469),
 ('freeway', 0.9403142333030701),
 ('west', 0.9383276104927063),
 ('extend', 0.9378702640533447)]

In [None]:
model.wv['road']

array([ 0.25712892, -3.496962  ,  4.9211335 ,  0.5291255 , -1.7015144 ,
       -1.9528776 ,  1.0085207 ,  4.5755672 , -2.9167597 ,  0.04388363],
      dtype=float32)

### Loading a pre-trained model

We can also use the gensim to automatically download and load a pre-trained model, or alternatively load it from disk.  Since the pre-trained model has much more data, the vectors encode some semantic meaning.

In [None]:
import gensim.downloader as api


In [None]:
model_pretrained = api.load("glove-wiki-gigaword-50")
model_pretrained

<gensim.models.keyedvectors.KeyedVectors at 0x7bbe490f2aa0>

In [None]:
model_pretrained.most_similar('road')

[('bridge', 0.8527506589889526),
 ('highway', 0.8253951072692871),
 ('route', 0.8184633255004883),
 ('lane', 0.8131610751152039),
 ('junction', 0.8032940030097961),
 ('roads', 0.7938767075538635),
 ('along', 0.779608428478241),
 ('west', 0.7775814533233643),
 ('intersection', 0.772043764591217),
 ('park', 0.7659005522727966)]

In [None]:
model_pretrained['road']

array([ 0.10042 ,  1.06    ,  0.24829 ,  0.014362, -0.783   , -0.12697 ,
       -0.85894 , -0.16042 ,  0.59427 , -1.069   , -1.2221  , -0.61181 ,
       -1.1446  , -1.3356  , -0.93968 ,  0.37353 ,  0.75405 ,  0.37777 ,
       -0.52882 ,  0.024955,  0.31032 ,  0.083344, -0.59232 ,  0.83623 ,
        0.65468 , -1.1154  ,  0.47597 ,  0.77803 ,  0.84934 , -0.82595 ,
        2.8725  , -0.1032  ,  0.25725 ,  0.074587,  0.95345 , -0.027788,
        0.60115 , -0.15205 , -0.50584 ,  0.58003 , -0.58731 , -0.72368 ,
       -0.057061, -0.28228 , -0.42823 ,  0.21001 ,  0.22496 , -1.2876  ,
        0.87487 , -0.6231  ], dtype=float32)

### SimLex999 evaluation

As we saw in the slides, we can visualize the distance between words using T-SNE.

In [None]:
simlex_data = wget.download("https://fh295.github.io/SimLex-999.zip")

Extract zip file

In [None]:
import zipfile

simple_simlex_path = "simlex999.txt"

simplex_pairs = dict()
with zipfile.ZipFile(simlex_data, 'r') as zip, open(simple_simlex_path, "wb") as fw:
   with zip.open('SimLex-999/SimLex-999.txt') as myfile:
    next(myfile)
    for line in myfile:
      w1, w2, pos, score, *_ = line.strip().split()
      simplex_pairs[(w1.decode('utf-8'), w2.decode('utf-8'))] = float(score)

In [None]:
simplex_pairs

{('old', 'new'): 1.58,
 ('smart', 'intelligent'): 9.2,
 ('hard', 'difficult'): 8.77,
 ('happy', 'cheerful'): 9.55,
 ('hard', 'easy'): 0.95,
 ('fast', 'rapid'): 8.75,
 ('happy', 'glad'): 9.17,
 ('short', 'long'): 1.23,
 ('stupid', 'dumb'): 9.58,
 ('weird', 'strange'): 8.93,
 ('wide', 'narrow'): 1.03,
 ('bad', 'awful'): 8.42,
 ('easy', 'difficult'): 0.58,
 ('bad', 'terrible'): 7.78,
 ('hard', 'simple'): 1.38,
 ('smart', 'dumb'): 0.55,
 ('insane', 'crazy'): 9.57,
 ('happy', 'mad'): 0.95,
 ('large', 'huge'): 9.47,
 ('hard', 'tough'): 8.05,
 ('new', 'fresh'): 6.83,
 ('sharp', 'dull'): 0.6,
 ('quick', 'rapid'): 9.7,
 ('dumb', 'foolish'): 6.67,
 ('wonderful', 'terrific'): 8.63,
 ('strange', 'odd'): 9.02,
 ('happy', 'angry'): 1.28,
 ('narrow', 'broad'): 1.18,
 ('simple', 'easy'): 9.4,
 ('old', 'fresh'): 0.87,
 ('apparent', 'obvious'): 8.47,
 ('inexpensive', 'cheap'): 8.72,
 ('nice', 'generous'): 5.0,
 ('weird', 'normal'): 0.72,
 ('weird', 'odd'): 9.2,
 ('bad', 'immoral'): 7.62,
 ('sad', 'funny

## Compute correlation between human scores and word2vec similarities

In natural language processing and computational linguistics, it is often essential to evaluate the performance of word embeddings, such as Word2Vec, by comparing their semantic similarity scores with human-generated similarity scores. This evaluation helps us understand how well the model captures the relationships between words as perceived by humans.

To perform this evaluation, we calculate the correlation between the human-assigned similarity scores and the similarity scores generated by the Word2Vec model. This correlation analysis provides insights into how closely the model's output aligns with human judgments.

In [None]:
def compute_correlation_score(model, word_pair2score, print_warning=True):
  human_scores = []
  system_scores = []
  for (w1, w2), score in word_pair2score.items():
    if (w1 not in model) or (w2 not in model):
      system_scores.append(-1)
      human_scores.append(score)
      if print_warning:
        print(f"WARNING ({w1} and {w2}) are not present in the embedding model!!" )
      continue
    system_similarity = model.similarity(w1, w2)
    human_scores.append(score)
    system_scores.append(system_similarity)
  human_scores = np.array(human_scores)
  system_scores = np.array(system_scores)
  pearson_r, _ = scipy.stats.pearsonr(human_scores, system_scores)    # Pearson's r
  spearman_rho = scipy.stats.spearmanr(human_scores, system_scores).statistic   # Spearman's rho
  return pearson_r, spearman_rho







### Correlation score of our model versus pretrained model!

In [None]:
compute_correlation_score(model.wv, simplex_pairs, print_warning=False)

(-0.014310638953798107, 0.03950377578426888)

In [None]:
compute_correlation_score(model_pretrained, simplex_pairs)

(0.2941386830730656, 0.2645792192990813)

## Semantic similarity

We provide an adapted version of semantic correlation score that can be used to asses the performances of your senses embedding models.

In [None]:


##IMPLEMENT THIS TO LOAD sense2score
# the returned dictionary should be similar to previous word_pair2score but instead of words we consider the senses from the dataset associated with this words
def load_semantic_simplex(path):
  senses2score = dict()
  with open(path) as fr:
    next(fr)
    for line in fr:
      chunks = line.strip().split()
      w1 = chunks[0]
      w2 = chunks[1]
      sim_lex_score = chunks[3]
      senses_w1 = chunks[10].split(",")
      senses_w2 = chunks[11].split(",")
      senses2score[tuple(tuple(senses_w1), tuple(senses_w2))] = sim_lex_score

  return senses2score

### Adapted the "compute_correlation_score"

In order to tailor the "compute_correlation_score" function to our specific needs, we have produced an adapted version called "compute_semantic_correlation_score." This modified function is designed to evaluate semantic similarity between pairs of senses and corresponding human-assigned scores.


The "compute_semantic_correlation_score" function begins by collecting pairs of senses and their associated human scores.
For those pairs the function computes the semantic similarity between each possible pair of senses. It calculates the system similarity score by averaging these individual sense-to-sense similarities. The human scores and system scores are then collected for subsequent correlation analysis.






In [None]:


def compute_semantic_correlation_score(model, senses2score,  print_warning=True):
  human_scores = []
  system_scores = []
  for (senses_1, senses_2), score in senses2score.items():
    senses_1_in_model = [s for s in senses_1 if s in model]
    senses_2_in_model = [s for s in senses_2 if s in model]

    if len(senses_1_in_model) == 0 or len(senses_1_in_model) == 0:
      # sense is not present in the model
      s1_str = " ".join(senses_1)
      s2_str = " ".join(senses_2)
      if print_warning:
        print(f"WARNING ({s1_str} and {s2_str}) are not present in the embedding model!!" )
      system_scores.append(-1)
    # Calculate semantic similarities between all pairs of senses
    all_similarities = []
    for s1 in senses_1_in_model:
      for s2 in senses_2_in_model:
        all_similarities.append(model.similarity(s1, s2))

    system_similarity = sum(all_similarities) / len(all_similarities)
    human_scores.append(score)
    system_scores.append(system_similarity)
  human_scores = np.array(human_scores)
  system_scores = np.array(system_scores)
  # Calculate Pearson's r (Pearson correlation coefficient) and Spearman's rho (Spearman rank correlation coefficient)
  pearson_r, _ = scipy.stats.pearsonr(human_scores, system_scores)    # Pearson's r
  spearman_rho = scipy.stats.spearmanr(human_scores, system_scores).statistic   # Spearman's rho
  return pearson_r, spearman_rho