<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB2: Word Embeddings with Word2Vec
In this lab we are going to explore word embeddings using on of the most popular tools called "Word2Vec.

This tool was created by Google in 2013. It used a shallow neural network to create word embeddings and was able to use semantic understanding


## Install dependencies

The first step is to install the necessary libraries. In this case we will install the [gensim](https://pypi.org/project/gensim/) Python library. This library is popular for NLP (Natural Language Processing) use cases as it provides capabilities like topic modelling, document indexing and similarity retrieval.

If you receive an **error message** about previous runtime versions (in Google Colab), **click restart** in the error message that pops up.

In [None]:
!pip install gensim

The gensim library comes package with multiple [models](https://radimrehurek.com/gensim/apiref.html). We are interested in Word2Vec.

Let's start by importing it, but **before that** you need to click on the "Runtime" menu above and then select **"Restart session"** unless you got an error message on the previous step and have already restarted the session.

In [None]:
#did you restart the session before running this code block?
import gensim
from gensim.models import Word2Vec

## Preparing the data

Word2Vec can work with large amounts of data but in our example we will with a small corpus of 3 sentences

In [None]:
sentences = ["this is sentence number one", "Here it is another one","This makes it the third one"]

To create the model Word2Vec we will need to pass our documents/sentences as a list where each item is a list of words from each document. We are going to call this structure "corpus"

In [None]:
corpus = []
for s in sentences:
    newlist = []
    for w in s.split():
        newlist.append(w.lower())
    corpus.append(newlist)

print(corpus)

## Creating and exploring the model

It is time to create the model. Notice below how we are:
- considering only words appearing once or more, ie with frequency greater than 1 by using min_count=1
- requesting a vector size with 100 dimensions
- setting the size of the sliding window to 5
- using COBW. sg=0 means COBW (Continuous Bag of Words), wherease sg=1 means Skip-gram

In [None]:
model = Word2Vec(corpus, min_count = 1, vector_size = 100, window = 5, sg=0)
print(model)

When we print the "model" we can see the size of the vocabulary.

During the creation of the "corpus" construc before we used the ".lower()" function to turn the words to lower case. If we remove that function the size of the vocabulary would increase to 12 because "this" and "This" won't be considered the same word.

Next, let's display the vocabulary

In [None]:
print(list(model.wv.key_to_index))

We can query the vocabulary to see position a certain word occupies. Bear in mind that the first item in a Python list is 0

In [None]:
print(model.wv.key_to_index["is"])

## Calculating similarities

We can use Word2Vec to compute the similarity. These 2 words should be reasonably similar. The closer the similarity is to "1", the more semantically related they are

In [None]:
print(model.wv.similarity('it', 'this'))


On the other hand these 2 words should produce a smaller similarity (a number close to zero). Negative number would indicate opposite meaning.

In [None]:
print(model.wv.similarity('one', 'makes'))

## How to explore further

Try adding your own sentences, loading them into model and calculating the similarity between specific words. Do your results make sense?

### End of Lab 2