# 1. Introduction
## 1.1 Definition
**Word embeddings** are *vector representations of words,* that capture semantic information and relationships between them. The idea is to transform words into numerical vectors where words with similar meanings or contexts have similar representations. This approach allows machine learning models to understand language better by placing words in a high-dimensional space where distance indicates similarity.


## 1.2 Examples of word embedding techniques include:
1. **Word2Vec:** This model, developed by *Google*, learns word embeddings using neural networks. It has two main approaches: **CBOW (Continuous Bag of Words)** and **Skip-Gram.**
  * **CBOW,** *predicts a word based on its surrounding context.*
  * **Skip-Gram,** *predicts the context based on a given word.*
2. **GloVe (Global Vectors for Word Representation):** Developed by *Stanford*, GloVe captures global statistical information from a corpus by training on word co-occurrence probabilities.
3. **FastText:** An extension of **Word2Vec** developed by *Facebook*, **FastText** represents words as **n-grams of characters,** making it effective for morphologically (*similar in context*) rich languages and out-of-vocabulary words.

# 2. Import libraries
This imports the `gensim` library and the `Word2Vec` class.
* `gensim` is a Python library for training word embeddings and performing **NLP** tasks.
* The `Word2Vec` class specifically implements the `Word2Vec` model, which converts words into vector representations.

In [1]:
from gensim.models import Word2Vec
# from nltk.tokenize import word_tokenize

# 3. Prepare the text dataset
* Define a `sentences`, which is a list of lists.
* Each inner list is a sentence, split into individual words. These words will be used as **tokens** by the **Word2Vec** model.
* In **Word2Vec**, sentences are used to learn the relationships between words, with similar words or words in similar contexts receiving similar vector representations.


In [2]:
# Sample sentences
sentences = [
    ['hello', 'world'],
    ['i', 'love', 'natural', 'language', 'processing'],
    ['hello', 'from', 'the', 'other', 'side']
]

# 4. Initialize and Train Word2Vec Model
#### Key parameters:-
- **sentences:** *corpus of text, an iterable of sentences, where each sentence is a list of words.*
- **vector_size:** *the size of the word vectors.*
- **window:** *context window size.*
- **min_count:** *minimum frequency of words to be considered.*
- **sg:** *training algorithm; sg=1, skip-gram and sg=0, CBOW (Continuous Bag of Words) model.*




In [3]:
# vector_size=50; the dimension of the word vectors (embedding size)
# window = 3; The maximum distance between the current and predicted word in a sentence. For example, if window=3, Word2Vec will consider up to three words before and after the target word in each context.
# min_count=1; Ignores all words with a total frequency lower than this. In this case, all words with a frequency of 1 or more will be considered.
# sg=1; Specifies the training algorithm. sg=1 means using the skip-gram model, which tries to predict surrounding words given a center word. If sg=0, Word2Vec uses the CBOW (Continuous Bag of Words) model, which predicts the center word from surrounding context words.
model = Word2Vec(sentences, vector_size = 50, window = 3, min_count=1, sg = 1)
model

<gensim.models.word2vec.Word2Vec at 0x7a10d70ccca0>

# 5. Accessing Word Vectors
- **model.wv** provides access to the trained word vectors.
- *it is a 50-dimensional vector space,* capturing semantic information based on the surrounding words it appears with in the training data.

In [4]:
word_vector = model.wv['hello']  # Retrieve the vector for the word 'hello'
# Prints the vector for the word 'hello'. The output will be a 50-dimensional array of floating-point numbers that represents the word in semantic space.
print(word_vector)

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]


In [5]:
# Finding most similar words to "hello"
similar_words = model.wv.most_similar('hello')
# print(similar_words)
similar_words

[('natural', 0.13204392790794373),
 ('other', 0.1267007291316986),
 ('i', 0.09984847903251648),
 ('side', 0.042373016476631165),
 ('processing', 0.012442179024219513),
 ('world', -0.012591075152158737),
 ('the', -0.01447527389973402),
 ('love', -0.0560765340924263),
 ('language', -0.05974648892879486),
 ('from', -0.11821284145116806)]