# Sentiment Analysis Tutorial Using TF-IDF Vectorization

This notebook shows how to perform sentiment analysis on text data using classical machine learning techniques with TF-IDF vectorization.  
We will cover:

- What TF-IDF is and why it’s useful  
- Preprocessing text using **stemming** and **lemmatization**  
- Vectorizing text with TF-IDF using unigrams and bigrams  
- Training a Multinomial Naive Bayes classifier  
- Comparing accuracy scores for different preprocessing and vectorization choices

---

## What is TF-IDF?

**TF-IDF** stands for **Term Frequency - Inverse Document Frequency**. It is a numerical statistic that reflects how important a word is to a document in a collection (corpus).

- **Term Frequency (TF):** Measures how often a term appears in a document.  
- **Inverse Document Frequency (IDF):** Measures how unique or rare a term is across all documents. Words common across many documents get lower weight.

Multiplying TF and IDF gives TF-IDF — a score that highlights words frequent in a document but rare in the whole corpus.

**Why use TF-IDF?**

- Unlike simple counts (Bag of Words), TF-IDF reduces the impact of very common words (like "the", "is") which carry less meaning.  
- It better reflects the importance of words for classification or retrieval tasks.

---

## The Dataset

We have a small IMDB-like dataset of 10 sentences labeled as positive (1) or negative (0) sentiment.

---

In [1]:
import pandas as pd
import numpy as np
import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Cell 2 - Sample dataset
data = {
    "text": [
        "I loved the movie, it was fantastic!",
        "What a terrible film. I hated it.",
        "An excellent movie with a great story.",
        "Worst movie I have ever seen.",
        "Absolutely wonderful experience, highly recommended!",
        "Horrible acting and bad direction.",
        "The plot was dull and boring.",
        "A masterpiece. Beautifully made and touching.",
        "Terrible! Do not waste your time.",
        "Brilliant performance and amazing visuals."
    ],
    "label": [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]  # 1=Positive, 0=Negative
}

df = pd.DataFrame(data)


## Text Preprocessing

We need to clean and normalize the text before feeding it to our model.

Steps include:

- Tokenizing text into words  
- Lowercasing  
- Removing punctuation and stop words (common words like "the", "is" which add little meaning)  
- Applying **stemming** (reducing words to their root form, e.g., "loved" → "love") or **lemmatization** (reducing words to dictionary base form)

We will show outputs after each preprocessing step for clarity.


In [3]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text, method='stem'):
    tokens = word_tokenize(text.lower())  # tokenize and lowercase
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  # remove punctuation and stop words
    
    if method == 'stem':
        tokens = [stemmer.stem(word) for word in tokens]
    elif method == 'lemm':
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens


In [4]:
print("Output after Stemming:")
stemmed_texts = df['text'].apply(lambda x: preprocess_text(x, method='stem'))
for i, tokens in enumerate(stemmed_texts):
    print(f"Sentence {i+1}:", tokens)

# Explanation:
# Stemming cuts words down to their root form, often chopping off suffixes.
# For example, 'loved' becomes 'love', 'fantastic' becomes 'fantast'.


Output after Stemming:
Sentence 1: ['love', 'movi', 'fantast']
Sentence 2: ['terribl', 'film', 'hate']
Sentence 3: ['excel', 'movi', 'great', 'stori']
Sentence 4: ['worst', 'movi', 'ever', 'seen']
Sentence 5: ['absolut', 'wonder', 'experi', 'highli', 'recommend']
Sentence 6: ['horribl', 'act', 'bad', 'direct']
Sentence 7: ['plot', 'dull', 'bore']
Sentence 8: ['masterpiec', 'beauti', 'made', 'touch']
Sentence 9: ['terribl', 'wast', 'time']
Sentence 10: ['brilliant', 'perform', 'amaz', 'visual']


In [5]:
print("\nOutput after Lemmatization:")
lemm_texts = df['text'].apply(lambda x: preprocess_text(x, method='lemm'))
for i, tokens in enumerate(lemm_texts):
    print(f"Sentence {i+1}:", tokens)

# Explanation:
# Lemmatization reduces words to their dictionary base form (lemma).
# It is usually more accurate than stemming but requires more linguistic knowledge.
# E.g., 'loved' remains 'loved' here because lemmatizer guesses noun form by default.


Output after Lemmatization:
Sentence 1: ['loved', 'movie', 'fantastic']
Sentence 2: ['terrible', 'film', 'hated']
Sentence 3: ['excellent', 'movie', 'great', 'story']
Sentence 4: ['worst', 'movie', 'ever', 'seen']
Sentence 5: ['absolutely', 'wonderful', 'experience', 'highly', 'recommended']
Sentence 6: ['horrible', 'acting', 'bad', 'direction']
Sentence 7: ['plot', 'dull', 'boring']
Sentence 8: ['masterpiece', 'beautifully', 'made', 'touching']
Sentence 9: ['terrible', 'waste', 'time']
Sentence 10: ['brilliant', 'performance', 'amazing', 'visuals']


## TF-IDF Vectorization

We now convert the processed text into numeric feature vectors using TF-IDF.

We will compare:

- **Unigram TF-IDF:** Only single words (1-grams) as features  
- **Bigram TF-IDF:** Single words and pairs of consecutive words (1-grams + 2-grams)

For each case, we will print:

- The learned vocabulary (all features)  
- The TF-IDF vector for the first sentence  
- Explanation of what the vector values mean  
- Train and test a Multinomial Naive Bayes classifier  
- Report accuracy

In [6]:
stemmed_texts_str = stemmed_texts.apply(lambda x: ' '.join(x))
lemm_texts_str = lemm_texts.apply(lambda x: ' '.join(x))


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

def train_and_evaluate_logreg(texts, labels, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(ngram_range=ngram_range)
    X = vectorizer.fit_transform(texts)
    y = labels
    
    print(f"\nVocabulary (ngram_range={ngram_range}):\n", vectorizer.get_feature_names_out())
    print(f"\nTF-IDF vector for first sentence:\n", X.toarray()[0])
    print("\nExplanation: Each value is the TF-IDF score for that feature (word or n-gram) in the first sentence.\n"
          "Higher values mean the term is more important in this document relative to the corpus.")
    
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
    for train_index, test_index in sss.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    print("Train labels distribution:", np.bincount(y_train))
    print("Test labels distribution:", np.bincount(y_test))
    
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    
    print(f"Accuracy on test set: {acc:.3f}")
    return acc

# Example usage with your processed data variables
acc1 = train_and_evaluate_logreg(stemmed_texts_str, df['label'], ngram_range=(1,1))
acc2 = train_and_evaluate_logreg(stemmed_texts_str, df['label'], ngram_range=(1,2))
acc3 = train_and_evaluate_logreg(lemm_texts_str, df['label'], ngram_range=(1,1))
acc4 = train_and_evaluate_logreg(lemm_texts_str, df['label'], ngram_range=(1,2))



Vocabulary (ngram_range=(1, 1)):
 ['absolut' 'act' 'amaz' 'bad' 'beauti' 'bore' 'brilliant' 'direct' 'dull'
 'ever' 'excel' 'experi' 'fantast' 'film' 'great' 'hate' 'highli'
 'horribl' 'love' 'made' 'masterpiec' 'movi' 'perform' 'plot' 'recommend'
 'seen' 'stori' 'terribl' 'time' 'touch' 'visual' 'wast' 'wonder' 'worst']

TF-IDF vector for first sentence:
 [0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.62583988 0.         0.         0.         0.         0.
 0.62583988 0.         0.         0.46545557 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.        ]

Explanation: Each value is the TF-IDF score for that feature (word or n-gram) in the first sentence.
Higher values mean the term is more important in this document relative to the corpus.
Train labels distribution: [3 4]
Test labels distribution: [2 1]
Accuracy on test set: 0.333

Vocabulary (ngra