<a href="https://colab.research.google.com/github/dinakeshvari/NLP_Exercise_ShokrzadCourse/blob/main/Assignment02_DS04_S02_WordEmbedding_RezaShokrzad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📌 NLP Assignment: Word Embeddings & Vectorization


## 🎯 Objective:
This assignment will deepen your understanding of word vectorization techniques: TF-IDF, Word2Vec, FastText, and GloVe.
Your goal is to fill in the blanks and complete the implementation.



## 1️⃣ TF-IDF: Identifying Fake vs. Real News Based on Keyword
**Task:**
* Use TfidfVectorizer from scikit-learn to extract important words from a given document.
* Identify the top 5 most Top Keywords in fake and real news.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample dataset (Fake & Real News Headlines)
data = {
    "headline": [
        "Breaking: Celebrity Caught in Secret Scandal! Fans are shocked as leaked footage surfaces online.",
        "Scientists Discover New Planet With Signs of Life! Astronomers say it could have a habitable atmosphere.",
        "Government Hiding Truth About UFOs, Says Insider! Documents reveal classified reports on alien encounters.",
        "New Study Shows Coffee Can Extend Your Lifespan. Researchers find evidence linking caffeine to longevity.",
        "Shocking: Politician Involved in Money Laundering Scheme! Investigation uncovers offshore bank accounts.",
        "NASA Confirms Water on Mars, A Big Step for Space Exploration. Experts believe this could lead to human settlement."
    ],
    "label": ["fake", "real", "fake", "real", "fake", "real"]  # Labels: "fake" or "real"
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words="english")

# Fit and transform headlines
tfidf_matrix = vectorizer.fit_transform(df["headline"])

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Compute average TF-IDF score per category (fake vs. real)
df_tfidf = pd.concat([df, tfidf_df], axis=1)
fake_avg = df_tfidf[df_tfidf["label"] == "fake"].iloc[:, 2:].mean().sort_values(ascending=False)
real_avg = df_tfidf[df_tfidf["label"] == "real"].iloc[:, 2:].mean().sort_values(ascending=False)

# Display results
print("\n🔹 Top Keywords in **Fake News**:")
print(fake_avg.head(5))

print("\n🔹 Top Keywords in **Real News**:")
print(real_avg.head(5))



🔹 Top Keywords in **Fake News**:
accounts    0.100504
shocking    0.100504
offshore    0.100504
money       0.100504
scandal     0.100504
dtype: float64

🔹 Top Keywords in **Real News**:
new          0.171558
life         0.107179
signs        0.107179
habitable    0.107179
planet       0.107179
dtype: float64


## 2️⃣ Word2Vec: Training Word Embeddings from Text
**Task:**
* Train a Word2Vec model using the Gensim library.
* Extract and display the 5 most similar words to "NLP".

In [2]:
!pip install numpy==1.26.4 scipy==1.11.4 gensim==4.3.2



In [3]:
import nltk
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download necessary resources
nltk.download('punkt_tab')
nltk.download('stopwords')

# Sample corpus (Multiple Sentences for Better Training)
text_data = [
    "Natural Language Processing helps machines understand human language.",
    "Machine learning models improve NLP tasks significantly.",
    "Word embeddings like Word2Vec capture word meanings in large datasets.",
    "Deep learning methods such as transformers are revolutionizing NLP.",
    "TF-IDF is used for ranking important words in documents."
]

# Preprocessing Function (Remove Stopwords & Punctuation)
def preprocess(text):
    tokens = word_tokenize(text.lower())  # Tokenize & Lowercase
    tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]  # Remove stopwords & punctuation
    return tokens

# Preprocess all sentences
tokenized_corpus = [preprocess(sentence) for sentence in text_data]

# Train Word2Vec Model with a vector size of 100 and sliding window of 5 and four workers
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Find most similar words to "models"
similar_words = model.wv.most_similar("models", topn=5)

print("Top 5 words similar to 'models':", similar_words)

Top 5 words similar to 'models': [('natural', 0.18190345168113708), ('learning', 0.1726931780576706), ('processing', 0.16684965789318085), ('word2vec', 0.13273073732852936), ('meanings', 0.1122976541519165)]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3️⃣ FastText: Handling Rare Words in Embeddings
**Task:**
* Train a FastText model on a simple dataset.
* Test it on an out-of-vocabulary (OOV) word to see how it performs compared to Word2Vec.

In [10]:

# FastText: Handling Rare Words in Embeddings
from gensim.models import FastText

# Sample corpus
text_data = [
    "Deep learning powers NLP applications.",
    "FastText can generate word vectors for unseen words.",
    "Word embeddings are useful in semantic search.",
]

# Tokenize text
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in text_data]

# Train FastText model
fasttext_model = FastText(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Test on an OOV word
oov_word = "Word"
vector_representation = fasttext_model.wv[oov_word]

print(f"Vector for '{oov_word}':", vector_representation[:10])  # Show first 5 dimensions


Vector for 'Word': [ 3.5652798e-03  2.3734148e-03  1.3581426e-04 -1.3525041e-03
 -2.7393273e-03 -1.4782941e-03  8.5610896e-05  3.7341891e-04
  7.5276012e-06 -1.7403113e-04]


## 4️⃣ GloVe: Using Pre-Trained Embeddings
**Task:**
* Load pre-trained GloVe embeddings.
* Find the cosine similarity between "king" and "queen".

In [11]:

# Download glove Embedding
import urllib.request
import os
import zipfile

url = 'https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip'
output = os.path.join(os.getcwd(), 'glove.6B.zip')  # Save to the current working directory

urllib.request.urlretrieve(url, output)

# Unzip the file
with zipfile.ZipFile('./glove.6B.zip', 'r') as zip_ref:
    zip_ref.extractall('./glove')



In [15]:

# GloVe: Using Pre-Trained Embeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load GloVe embeddings
glove_path = "/content/glove/glove.6B.100d.txt"  # Ensure you have this file

# Read the file and store embeddings
embeddings_dict = {}
with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        embeddings_dict[word] = vector

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Test similarity between "king" and "queen"
vector_king = embeddings_dict["king"]
vector_queen = embeddings_dict["queen"]
similarity_score = cosine_similarity(vector_king, vector_queen)

print("Cosine similarity between 'king' and 'queen':", similarity_score)


Cosine similarity between 'king' and 'queen': 0.750769
