## Exercise: Developing Word2Vec Algorithm

Objective: The objective of this exercise is to develop a simplified version of the Word2Vec algorithm to understand the underlying concepts and principles.

Dataset Preparation:
* Select a text corpus for training your Word2Vec model. You can choose any text corpus, such as news articles, books, or Wikipedia articles.
* Preprocess the text by tokenizing it into sentences and then into words.
* Remove any unwanted characters, punctuation, and convert the text to lowercase.
* Split the preprocessed text into a list of sentences, where each sentence is a list of words.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

# Load the text corpus (example: Shakespeare plays)
nltk.download('shakespeare')

from nltk.corpus import shakespeare

# Combine all the text from the Shakespeare plays
corpus = ' '.join([shakespeare.raw(fileid) for fileid in shakespeare.fileids()])

# Tokenize the text into sentences and words
sentences = sent_tokenize(corpus)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Unzipping corpora/shakespeare.zip.


Vocabulary Creation:
* Build a vocabulary by creating a set of unique words present in the corpus.
* This will be used to initialize the word vectors.
* Assign a unique index to each word in the vocabulary. You can use a dictionary to map words to their corresponding indices.

In [None]:
# Build vocabulary and assign indices
vocab = set([word for sentence in tokenized_sentences for word in sentence])
word2index = {word: index for index, word in enumerate(vocab)}


Initialize Word Vectors:
* Initialize the word vectors randomly for each word in the vocabulary. Each * * word vector should have a fixed dimensionality, such as 100 or 300.

In [None]:
import numpy as np

# Initialize word vectors
vector_size = 100
word_vectors = np.random.rand(len(vocab), vector_size)


Training the Model:
* Define the context window size, which determines the number of words to consider on both sides of the target word.
* Loop through each sentence in the corpus and each word in the sentence.
* For each target word, select the context words within the specified window.
* Calculate the predicted word vector by taking the average of the context word vectors.
* Update the target word vector by adjusting it to be more similar to the predicted word vector using a learning rate.
Repeat this process for a specified number of iterations or until convergence.

In [None]:
window_size = 2
learning_rate = 0.01
num_iterations = 10 #number of iterations should increase for more accuracy and relationships.

# Training loop
for _ in range(num_iterations):
    for sentence in tokenized_sentences:
        for i, target_word in enumerate(sentence):
            context_words = sentence[max(i - window_size, 0):i] + sentence[i + 1:i + window_size + 1]
            context_vectors = word_vectors[[word2index[word] for word in context_words]]
            predicted_vector = np.mean(context_vectors, axis=0)

            target_index = word2index[target_word]
            target_vector = word_vectors[target_index]
            word_vectors[target_index] += learning_rate * (predicted_vector - target_vector)


Evaluation:
* After training the Word2Vec model, choose a few target words from the
vocabulary.
* Find the most similar words to each target word based on the cosine similarity between their word vectors.
* Compare the results with the expected similarities and evaluate the performance of your Word2Vec model qualitatively.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Example evaluation
target_words = ["cat", "dog", "king", "queen"]
for target_word in target_words:
    target_index = word2index[target_word]
    target_vector = word_vectors[target_index]
    similarities = cosine_similarity(target_vector.reshape(1, -1), word_vectors)
    similar_indices = np.argsort(-similarities)[0][:5]
    similar_words = [list(word2index.keys())[index] for index in similar_indices]
    print(f"Words similar to '{target_word}': {similar_words}")


Words similar to 'cat': ['cat', 'maid', 'slave', 'common', 'breast']
Words similar to 'dog': ['dog', 'boy', 'nor', 'told', 'moor']
Words similar to 'king': ['king', '/scene', 'lord.', 'claudius', 'exit']
Words similar to 'queen': ['queen', 'gertrude', 'lord.', '/scene', 'exit']
