<a href="https://colab.research.google.com/github/Z4HRA-S/NLP_Course_Spring2023/blob/main/Word_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word2vec
[Link to paper](https://arxiv.org/abs/1301.3781) | [Link to tutorial](https://towardsdatascience.com/word2vec-research-paper-explained-205cb7eecc30) | [Link to tutorial 2](https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/)

In the cell below we download the Word2vec model and load it.

In [None]:
from gensim.models import Word2Vec 
import gensim.downloader as api

v2w_model = v2w_model = api.load('word2vec-google-news-300')
#sample_word2vec_embedding=v2w_model['computer']

We define a similarity function for estimating the similarity between two vector.

In [None]:
def cosine_similarity(a,b):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

Now let's have a look at "apple"'s vector

In [None]:
v2w_model['apple'].shape

(300,)

In [None]:
v2w_model['apple']

Word2vec covers 3000000 words from english language.

In [None]:
len(v2w_model)

3000000

In [None]:
print("The similarity between apple and lemon:", cosine_similarity(v2w_model['apple'], v2w_model['lemon']))
print("The similarity between boy and man:", cosine_similarity(v2w_model['boy'], v2w_model['man']))
print("The similarity between king and man:", cosine_similarity(v2w_model['king'], v2w_model['man']))
print("The similarity between king and queen:", cosine_similarity(v2w_model['king'], v2w_model['queen']))
print("The similarity between queen and woman:", cosine_similarity(v2w_model['queen'], v2w_model['woman']))

1.0

## GloVe
"global vectors for word representation"

[paper link](https://nlp.stanford.edu/pubs/glove.pdf) | [tutorial link](https://medium.com/analytics-vidhya/word-vectorization-using-glove-76919685ee0b)

[Code sources](https://keras.io/examples/nlp/pretrained_word_embeddings/)

First we download the glove model weights.

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2023-05-15 08:52:09--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-05-15 08:52:09--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-05-15 08:52:09--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

=========================================================================

Now we unzip it.

In [None]:
!unzip -q glove.6B.zip

In [None]:
!ls

glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
glove.6B.200d.txt  glove.6B.50d.txt   sample_data


In the cell below we are loading the model into a dictionary. This is a fixed vector per word dictionary.

In [None]:
import os
import numpy as np

embeddings_index = {}
with open("glove.6B.100d.txt","r") as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

Let's see the vector for word "hello".

In [None]:
embeddings_index["hello"]

array([ 0.26688  ,  0.39632  ,  0.6169   , -0.77451  , -0.1039   ,
        0.26697  ,  0.2788   ,  0.30992  ,  0.0054685, -0.085256 ,
        0.73602  , -0.098432 ,  0.5479   , -0.030305 ,  0.33479  ,
        0.14094  , -0.0070003,  0.32569  ,  0.22902  ,  0.46557  ,
       -0.19531  ,  0.37491  , -0.7139   , -0.51775  ,  0.77039  ,
        1.0881   , -0.66011  , -0.16234  ,  0.9119   ,  0.21046  ,
        0.047494 ,  1.0019   ,  1.1133   ,  0.70094  , -0.08696  ,
        0.47571  ,  0.1636   , -0.44469  ,  0.4469   , -0.93817  ,
        0.013101 ,  0.085964 , -0.67456  ,  0.49662  , -0.037827 ,
       -0.11038  , -0.28612  ,  0.074606 , -0.31527  , -0.093774 ,
       -0.57069  ,  0.66865  ,  0.45307  , -0.34154  , -0.7166   ,
       -0.75273  ,  0.075212 ,  0.57903  , -0.1191   , -0.11379  ,
       -0.10026  ,  0.71341  , -1.1574   , -0.74026  ,  0.40452  ,
        0.18023  ,  0.21449  ,  0.37638  ,  0.11239  , -0.53639  ,
       -0.025092 ,  0.31886  , -0.25013  , -0.63283  , -0.0118

In [None]:
embeddings_index["hello"].shape

(100,)

In [None]:
len(embeddings_index)

400000

The vocabulary size of GloVe is 400000 words.

In [None]:
cosine_similarity(embeddings_index["woman"],embeddings_index["queen"])

0.50951535

In [None]:
cosine_similarity(embeddings_index["man"],embeddings_index["queen"])

0.47403228

In [None]:
cosine_similarity(embeddings_index["woman"],embeddings_index["girl"])

0.84726715

In [None]:
cosine_similarity(embeddings_index["tehran"],embeddings_index["iran"])

0.8502589

In [None]:
cosine_similarity(embeddings_index["king"],embeddings_index["queen"])

0.750769

Does subtracting and adding make sense in this situation? let's see.
We subtract the "man" vector from the "king" vector. and then added the "woman" vector. The similarity between woman and queen was 0.50 initially, but now we have a similarity of 0.783

What do you think? Do vectors encode the meanings correctly?

In [None]:
cosine_similarity(embeddings_index["king"]-embeddings_index["man"]+embeddings_index["woman"]
                  ,embeddings_index["queen"])

0.7834413

Can we increase the similarity? we try by adding "Elizabeth" to the first vector. The similarity increased. It shows that the word "Elizabeth" is related to the "queen" context. 

In [None]:
cosine_similarity(embeddings_index["king"]-embeddings_index["man"]+embeddings_index["woman"]
                  +embeddings_index["elizabeth"]
                  ,embeddings_index["queen"])

0.8362787

But what about the "Elizabeth" itself and the "queen"? The similarity is less than the above cell. What do you think? 

In [None]:
cosine_similarity(embeddings_index["elizabeth"]
                  ,embeddings_index["queen"])

0.735571

In [None]:
cosine_similarity(embeddings_index["is"],embeddings_index["she"])

0.65344477

In [None]:
cosine_similarity(embeddings_index["is"],embeddings_index["he"])

0.71428835

##BERT

Bidirectional Encoder Representations from Transformers

[Paper link](https://arxiv.org/abs/1810.04805) 

[A visual guide to using bert for the first time](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)

[A very useful tutorial](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#3-extracting-embeddings)

[Another good tutorial](https://towardsdatascience.com/mastering-word-embeddings-in-10-minutes-with-tensorflow-41e25da6aa54)


First we need to install transformers package

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.1


We will use the pre-trained model. It's best to use the tokenizer and embedding model from the same family, such as "bert-base-uncased".

In [2]:
from transformers import AutoTokenizer, TFBertModel

import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = TFBertModel.from_pretrained("bert-base-uncased")

def embedd(text):
    inputs = tokenizer(text, return_tensors="tf")
    outputs = model(inputs)
    last_hidden_states = outputs.last_hidden_state
    return last_hidden_states 

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Word2vec and GloVe will produce the same vector for a single word in every sentence, but the Bert model will take the surrounding context as input and produce a vector for each word based on the sequence context. For example the output vector for the word "king" would be different in the two different context.

In [3]:
text1="""
The current monarch is King Charles III,
 who ascended the throne on 8 September 2022,
  upon the death of his mother, Queen Elizabeth II.
   The monarch and their immediate family undertake various official,
    ceremonial, diplomatic and representational duties.
"""

text2 = """
The Lion King is a 1994 American animated musical drama film
 produced by Walt Disney Feature Animation and released by Walt Disney Pictures. 
"""

In [4]:
emb1 = embedd(text1)
emb2 = embedd(text2)

In [None]:
emb1.shape

(1, 49, 768)

In [None]:
emb1

tensor([[[ 0.0017,  0.4633,  0.1297,  ..., -0.4788,  0.6391,  0.7906],
         [ 0.0863,  0.3106, -0.1846,  ..., -0.5353,  0.9764,  0.3122],
         [ 0.0835,  0.4554,  0.9469,  ..., -1.1852,  0.2155,  0.2122],
         ...,
         [ 1.0985,  0.4861,  0.3829,  ..., -0.6052, -0.0388,  0.2371],
         [ 0.6943,  0.2025, -0.0849,  ...,  0.2950, -0.7087, -0.2378],
         [ 0.0068,  0.2121,  0.6084,  ..., -0.2196, -0.2067,  0.2549]]],
       grad_fn=<NativeLayerNormBackward0>)

In [None]:
cosine_similarity(emb1[0][5],emb2[0][3])

In [None]:
emb1 = embedd("I went to Turkey last year")
emb2 = embedd("I order Turkey for food")

8
7


In [None]:
cosine_similarity(emb1[0][4], emb2[0][3])

0.55290526

#Home Works
1. There are more powerful language models out there! do research about them and name 5 language models after BERT, their training data volume, the number of their parameters, and a short description about each of them. 

*If you are a fast-food person and using ChatGPT or other kinds of chatbots, please check the answer validity. Because in my own experience they sometimes lie :)*

2. Use the embedding of these 3 language models of this notebook for text classification with SVM and submit the code and results. Use these links for using SVM and loading a simple data set. Also, write a report with the results. 

for data use 2 different categories of news: [Link to data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset)

for SVM use scikit-learn: [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
