<a href="https://colab.research.google.com/github/damiangohrh123/ml_projects/blob/main/classification/TF_IDF_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ§  IMDB Sentiment Classifier (TF-IDF + Logistic Regression)

In this project, we will use the IMDb movie reviews dataset (50,000 reviews) to build a Sentiment Analysis classifier using traditional machine learning techniques:

* Text preprocessing
* TF-IDF vectorization
* Logistic Regression classifier
* Evaluation metrics + predictions

This model serves as a strong baseline before moving to neural models (LSTM, CNN, DistilBERT).

In [None]:
# !pip install scikit-learn datasets matplotlib --quiet

import matplotlib.pyplot as plt
import numpy as np
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

## Step 1: Load & Inspect IMDb Dataset

In this section, we load the IMDb movie review dataset using the HuggingFace `datasets` library.  
The IMDb dataset is commonly used for sentiment analysis (positive vs. negative reviews).

### What this code does:
1. **Load the IMDb dataset** using `load_dataset("imdb")`.
2. **Select the training and test splits** from the dataset.
3. **Print the number of samples** in each split for reference.
4. **Extract the raw text and labels**:
   - `text` â†’ the actual movie review text  
   - `label` â†’ sentiment (0 = negative, 1 = positive)

After this step, we have:
- `X_train_raw` and `X_test_raw` containing the review texts  
- `y_train` and `y_test` containing the sentiment labels  

These variables will be used for tokenization, vectorization, and model training in the next steps.


In [None]:
# Load IMDb dataset from HuggingFace
dataset = load_dataset("imdb")

train_data = dataset["train"]
test_data = dataset["test"]

print("Training samples:", len(train_data))
print("Test samples:", len(test_data))

# Extract raw text + labels
X_train_raw = train_data["text"]
y_train = train_data["label"]

X_test_raw = test_data["text"]
y_test = test_data["label"]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(â€¦):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Training samples: 25000
Test samples: 25000


## Step 2: Visualize Sample Training Images
Before training, it's helpful to inspect some samples to understand what the data looks like.

In [None]:
print("Sample Review:\n")
print(X_train_raw[0])
print("\nLabel (0=neg, 1=pos):", y_train[0])

Sample Review:

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and 

## Step 3: TF-IDF Vectorization

TF-IDF converts text into numeric vectors based on:

* Word frequency
* How unique a word is across documents

Settings:

* max_features=50000
* ngram_range=(1,2) â†’ unigrams + bigrams
* Removes English stopwords
* Filters extremely rare/common words

In [None]:
vectorizer = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1, 2),
    stop_words="english",
    min_df=5,
    max_df=0.8
)

# Fit to training data
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

print("TF-IDF Matrix Shape:", X_train.shape)

TF-IDF Matrix Shape: (25000, 50000)


## Step 4: Define the Model â€” Logistic Regression

This is a strong baseline for text classification.

* Fast training
* Performs well with TF-IDF
* Easy to interpret

In [None]:
model = LogisticRegression(
    max_iter=200,
    solver="lbfgs"
)

print(model)

LogisticRegression(max_iter=200)


## Step 5: Train the Model

In [None]:
model.fit(X_train, y_train)

ðŸ“ˆ Step 6: Evaluate the Model

Compute:
* Accuracy
* Precision
* Recall
* F1 scores

In [None]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%\n")

print("Classification Report:")
print(classification_report(y_test, y_pred))

Test Accuracy: 88.28%

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



## Step 7: Make Predictions

We define a helper function to predict sentiment of any review.

In [None]:
def predict_review(text):
    vec = vectorizer.transform([text])
    pred = model.predict(vec)[0]
    return "Positive" if pred == 1 else "Negative"

sample_text = X_test_raw[10]

print("Review:\n", sample_text)
print("\nPrediction:", predict_review(sample_text))
print("Actual:", "Positive" if y_test[10] == 1 else "Negative")

Review:
 This flick is a waste of time.I expect from an action movie to have more than 2 explosions and some shooting.Van Damme's acting is awful. He never was much of an actor, but here it is worse.He was definitely better in his earlier movies. His screenplay part for the whole movie was probably not more than one page of stupid nonsense one liners.The whole dialog in the film is a disaster, same as the plot.The title "The Shepherd" makes no sense. Why didn't they just call it "Border patrol"? The fighting scenes could have been better, but either they weren't able to afford it, or the fighting choreographer was suffering from lack of ideas.This is a cheap low type of action cinema.

Prediction: Negative
Actual: Negative


## Step 8: Explore Important Words (Optional)

We can inspect the most influential words for positive and negative sentiment.


In [None]:
feature_names = np.array(vectorizer.get_feature_names_out())
coeffs = model.coef_[0]

# Top 20 positive and negative n-grams
top_pos = feature_names[np.argsort(coeffs)[-20:]]
top_neg = feature_names[np.argsort(coeffs)[:20]]

print("Top Positive Words:\n", top_pos)
print("\nTop Negative Words:\n", top_neg)


Top Positive Words:
 ['bit' 'enjoy' 'definitely' 'fantastic' 'highly' 'superb' 'beautiful'
 'enjoyed' 'brilliant' 'today' 'fun' 'loved' 'love' 'favorite' 'amazing'
 'perfect' 'wonderful' 'best' 'excellent' 'great']

Top Negative Words:
 ['worst' 'bad' 'awful' 'boring' 'waste' 'poor' 'worse' 'terrible'
 'horrible' 'dull' 'poorly' 'unfortunately' 'script' 'stupid' 'supposed'
 'instead' 'annoying' 'disappointment' 'ridiculous' 'minutes']


## Step 9: Conclusion

In this notebook, we:

- Loaded the IMDb dataset  
- Preprocessed text using TF-IDF  
- Trained a Logistic Regression classifier  
- Achieved ~90% accuracy  
- Visualized predictions and explored influential words  

This baseline will serve as a comparison for more advanced models:

1. LSTM / GRU  
2. CNN for text  
3. DistilBERT fine-tuning (GPU)
