## Word2Vec
Word2Vec is a popular technique used in natural language processing to turn words into numerical vectors, so that computers can understand and work with text data. These vectors capture the meaning of words in such a way that similar words have similar vectors.

### How does Word2Vec work?
 Word2Vec uses a neural network to learn word relationships from a large amount of text. It does this by looking at the context in which words appear. There are two main ways Word2Vec can be trained: CBOW and Skip-gram.

### CBOW (Continuous Bag of Words):
 In CBOW, the model tries to predict a target word based on the words that come before and after it (the context). For example, given the sentence "the cat sits on the mat", if the context is ["the", "cat", "on", "the", "mat"], the model tries to predict the word "sits".

### Skip-gram:
 In Skip-gram, the model does the opposite: it takes a single word and tries to predict the words around it (the context). Using the same sentence, if the target word is "sits", the model tries to predict ["the", "cat", "on", "the", "mat"].

- In summary, CBOW predicts a word from its context, while Skip-gram predicts the context from a word.


In [3]:
!pip install gensim nltk



In [5]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

sentences = [
    "I love natural language processing",
    "Word embeddings are amazing",
    "I enjoy learning machine learning concepts",
    "Deep learning uses neural networks"
]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zainab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
tokenized_sentences

[['i', 'love', 'natural', 'language', 'processing'],
 ['word', 'embeddings', 'are', 'amazing'],
 ['i', 'enjoy', 'learning', 'machine', 'learning', 'concepts'],
 ['deep', 'learning', 'uses', 'neural', 'networks']]

In [9]:
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Dimension of word vectors
    window=5,          # Context window size
    min_count=1,       # Ignore words with frequency < 1
    workers=4          # Parallel processing threads
)

In [16]:
#PRINT VOCABULARY
vocab =model.wv.index_to_key
vocab

['learning',
 'i',
 'are',
 'love',
 'natural',
 'language',
 'processing',
 'word',
 'embeddings',
 'networks',
 'neural',
 'enjoy',
 'machine',
 'concepts',
 'deep',
 'uses',
 'amazing']

In [12]:
# Get vector for a word
vector = model.wv['learning']  # Returns 100-dim vector
print(f"Vector for 'learning':\n{vector[:5]}...")  # Show first 5 dimensions

Vector for 'learning':
[-0.00053683  0.00023688  0.00510326  0.00900885 -0.00930313]...


In [13]:
# Find most similar words
similar_words = model.wv.most_similar("learning", topn=3)
print("\nWords similar to 'learning':")
for word, score in similar_words:
    print(f"{word}: {score:.3f}")


Words similar to 'learning':
amazing: 0.219
networks: 0.216
machine: 0.093


In [None]:
import gensim.downloader as api

# Download pre-trained model (~1.6GB)
google_model = api.load('word2vec-google-news-300')

# Example usage
similar_words = google_model.most_similar("computer", topn=3)
print("\nGoogle News model results for 'computer':")
print(similar_words)

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Get words and vectors
words = ["learning", "deep", "neural", "machine", "language", "processing"]
vectors = [model.wv[w] for w in words]

# Reduce to 2D
pca = PCA(n_components=2)
vectors_2d = pca.fit_transform(vectors)

# Plot
plt.figure(figsize=(10,6))
for i, word in enumerate(words):
    plt.scatter(vectors_2d[i,0], vectors_2d[i,1])
    plt.annotate(word, (vectors_2d[i,0], vectors_2d[i,1]))
plt.title("Word2Vec Visualization")
plt.show()

In [None]:
import numpy as np

def document_vector(model, doc):
    # Remove out-of-vocabulary words
    words = [w for w in doc if w in model.wv]
    if len(words) == 0:
        return np.zeros(model.vector_size)
    return np.mean(model.wv[words], axis=0)

# Convert all documents to vectors
doc_vectors = [document_vector(model, doc) for doc in tokenized_sentences]