<a href="https://colab.research.google.com/github/dinakeshvari/NLP_Exercise_ShokrzadCourse/blob/main/Assignment02_DS04_S02_WordEmbedding_RezaShokrzad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📌 NLP Assignment: Word Embeddings & Vectorization


## 🎯 Objective:
This assignment will deepen your understanding of word vectorization techniques: TF-IDF, Word2Vec, FastText, and GloVe.
Your goal is to fill in the blanks and complete the implementation.



## 1️⃣ TF-IDF: Identifying Fake vs. Real News Based on Keyword
**Task:**
* Use TfidfVectorizer from scikit-learn to extract important words from a given document.
* Identify the top 5 most Top Keywords in fake and real news.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample dataset (Fake & Real News Headlines)
data = {
    "headline": [
        "Breaking: Celebrity Caught in Secret Scandal! Fans are shocked as leaked footage surfaces online.",
        "Scientists Discover New Planet With Signs of Life! Astronomers say it could have a habitable atmosphere.",
        "Government Hiding Truth About UFOs, Says Insider! Documents reveal classified reports on alien encounters.",
        "New Study Shows Coffee Can Extend Your Lifespan. Researchers find evidence linking caffeine to longevity.",
        "Shocking: Politician Involved in Money Laundering Scheme! Investigation uncovers offshore bank accounts.",
        "NASA Confirms Water on Mars, A Big Step for Space Exploration. Experts believe this could lead to human settlement."
    ],
    "label": ["fake", "real", "fake", "real", "fake", "real"]  # Labels: "fake" or "real"
}

# Convert to DataFrame
df = pd.__________(data)

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words="__________")

# Fit and transform headlines
tfidf_matrix = __________.__________(df["headline"])

# Convert to DataFrame
tfidf_df = pd.DataFrame(__________.toarray(), columns=__________.get_feature_names_out())

# Compute average TF-IDF score per category (fake vs. real)
df_tfidf = pd.concat([df, tfidf_df], axis=__________)
fake_avg = df_tfidf[df_tfidf["label"] == "fake"].iloc[:, 2:].__________().sort_values(ascending=False)
real_avg = df_tfidf[df_tfidf["label"] == "real"].iloc[:, 2:].mean().__________(ascending=False)

# Display results
print("\n🔹 Top Keywords in **Fake News**:")
print(fake_avg.head(__________))

print("\n🔹 Top Keywords in **Real News**:")
print(real_avg.head(__________))


## 2️⃣ Word2Vec: Training Word Embeddings from Text
**Task:**
* Train a Word2Vec model using the Gensim library.
* Extract and display the 5 most similar words to "NLP".

In [None]:
import nltk
import gensim
from __________.models import Word2Vec
from nltk.__________ import word_tokenize
from nltk.__________ import stopwords
import string

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample corpus (Multiple Sentences for Better Training)
text_data = [
    "Natural Language Processing helps machines understand human language.",
    "Machine learning models improve NLP tasks significantly.",
    "Word embeddings like Word2Vec capture word meanings in large datasets.",
    "Deep learning methods such as transformers are revolutionizing NLP.",
    "TF-IDF is used for ranking important words in documents."
]

# Preprocessing Function (Remove Stopwords & Punctuation)
def preprocess(text):
    tokens = __________(text.__________())  # Tokenize & Lowercase
    tokens = [__________ for word in __________ if word.isalnum() and word not in stopwords.__________('__________')]  # Remove stopwords & punctuation
    return tokens

# Preprocess all sentences
tokenized_corpus = [preprocess(sentence) for __________ in text_data]

# Train Word2Vec Model with a vector size of 100 and sliding window of 5 and four workers
model = Word2Vec(sentences=__________, vector_size=__________, window=__________, min_count=1, __________=4)

# Find most similar words to "language"
similar_words = model.__________.most_similar("models", __________=5)

print("Top 5 words similar to 'language':", similar_words)


## 3️⃣ FastText: Handling Rare Words in Embeddings
**Task:**
* Train a FastText model on a simple dataset.
* Test it on an out-of-vocabulary (OOV) word to see how it performs compared to Word2Vec.

In [None]:

# FastText: Handling Rare Words in Embeddings
from gensim.__________ import FastText

# Sample corpus
text_data = [
    "Deep learning powers NLP applications.",
    "FastText can generate word vectors for unseen words.",
    "Word embeddings are useful in semantic search.",
]

# Tokenize text
tokenized_corpus = [__________(sentence.__________()) for sentence in __________]

# Train FastText model
fasttext_model = FastText(sentences=tokenized_corpus, __________=100, __________=5, __________=1, __________=4)

# Test on an OOV word
oov_word = "nlptech"
vector_representation = __________.wv[oov_word]

print(f"Vector for '{oov_word}':", vector_representation[:10])  # Show first 5 dimensions


## 4️⃣ GloVe: Using Pre-Trained Embeddings
**Task:**
* Load pre-trained GloVe embeddings.
* Find the cosine similarity between "king" and "queen".

In [None]:

# Download glove Embedding
import urllib.request
import os
import zipfile

url = 'https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip'
output = os.__________.join(__________.getcwd(), 'glove.6B.zip')  # Save to the current working directory

urllib.__________.urlretrieve(__________, __________)

# Unzip the file
with zipfile.__________('./glove.6B.zip', 'r') as __________:
    zip_ref.__________('./glove')



In [None]:

# GloVe: Using Pre-Trained Embeddings
import __________ as __________

# Load GloVe embeddings
glove_path = "./__________/glove.6B.100d.txt"  # Ensure you have this file

# Read the file and store embeddings
embeddings_dict = {}
__________ open(glove_path, "r", __________="utf-8") as f:
    for line in f:
        values = __________.split()
        word = values[0]
        vector = np.__________(values[1:], __________="float32")
        embeddings_dict[word] = vector

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    return np.__________(vec1, vec2) / (np.__________.norm(vec1) * np.linalg.__________(vec2))

# Test similarity between "king" and "queen"
vector_king = embeddings_dict["king"]
vector_queen = embeddings_dict["queen"]
similarity_score = __________(vector_king, vector_queen)

print("Cosine similarity between 'king' and 'queen':", __________)
