# 🧪 Method Comparison on Sentiment Detection

We test how each vectorization method performs on sentiment detection using a small labeled corpus.

In [None]:
# Sample sentiment-labeled corpus
sent_corpus = [
    "I love this movie",
    "This film was fantastic",
    "I hated this movie",
    "This film was terrible"
]
sent_labels = [1, 1, 0, 0]  # 1 = Positive, 0 = Negative

### 🎯 Bag of Words and TF-IDF Performance

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# BoW
cv_bow = CountVectorizer()
X_bow = cv_bow.fit_transform(sent_corpus)
clf_bow = LogisticRegression().fit(X_bow, sent_labels)
bow_preds = clf_bow.predict(X_bow)
print("BoW Accuracy:", accuracy_score(sent_labels, bow_preds))

# TF-IDF
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(sent_corpus)
clf_tfidf = LogisticRegression().fit(X_tfidf, sent_labels)
tfidf_preds = clf_tfidf.predict(X_tfidf)
print("TF-IDF Accuracy:", accuracy_score(sent_labels, tfidf_preds))

### 🧠 Word2Vec Performance (Averaged Word Vectors)

In [None]:
# from gensim.models import Word2Vec
# import numpy as np

# tokenized_sent = [s.lower().split() for s in sent_corpus]
# w2v_model = Word2Vec(sentences=tokenized_sent, vector_size=50, window=2, min_count=1, seed=1)

# def average_vector(sentence, model):
#     tokens = sentence.lower().split()
#     vectors = [model.wv[w] for w in tokens if w in model.wv]
#     return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# X_w2v = np.array([average_vector(s, w2v_model) for s in sent_corpus])
# clf_w2v = LogisticRegression().fit(X_w2v, sent_labels)
# w2v_preds = clf_w2v.predict(X_w2v)
# print("Word2Vec Accuracy:", accuracy_score(sent_labels, w2v_preds))

## ✅ Observations

- **BoW** works well when training and test texts share exact terms.
- **TF-IDF** improves weighting for rare but informative words.
- **Word2Vec** captures semantic relationships (e.g., "love" and "fantastic" are both positive), making it better for generalization.

This illustrates that **context-aware embeddings like Word2Vec** are more robust to variation in wording, while **BoW/TF-IDF** are literal and vocabulary-dependent.