# IMDB Movie Review Sentiment Classification

**Goal:** Train a model to predict whether a movie review is positive (1) or negative (0)

**Approach:** Logistic Regression with TF-IDF vectorization

This notebook follows a similar structure to the SMS spam classification project.

## Section 1: Loading and Exploring the Dataset

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Importing libraries...")
print("✓ Libraries imported successfully")

In [None]:
# Load the training dataset
with open('train.json', 'r', encoding='utf-8') as f:
    train_data = json.load(f)

# Load the test dataset
with open('test.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)

print(f"✓ Training samples loaded: {len(train_data)}")
print(f"✓ Test samples loaded: {len(test_data)}")

In [None]:
# Convert to DataFrame
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

print("=" * 70)
print("DATASET INFORMATION")
print("=" * 70)

In [None]:
# Display basic information about the dataset
print("\n-------------------- HEAD --------------------")
print(train_df.head())

print("\n-------------------- DESCRIBE --------------------")
print(train_df.describe())

print("\n-------------------- INFO --------------------")
print(train_df.info())

In [None]:
# Check for missing values
print("\nMissing values:")
print(train_df.isnull().sum())

# Check for duplicates
print(f"\nDuplicate entries: {train_df.duplicated().sum()}")

# Remove duplicates if any
if train_df.duplicated().sum() > 0:
    train_df = train_df.drop_duplicates()
    print(f"✓ Removed {train_df.duplicated().sum()} duplicates")

In [None]:
# Label distribution
print("\nLabel Distribution:")
print(train_df['label'].value_counts())

# Visualize label distribution
plt.figure(figsize=(8, 5))
train_df['label'].value_counts().plot(kind='bar', color=['salmon', 'lightblue'])
plt.title('Distribution of Positive vs Negative Reviews')
plt.xlabel('Label (0=Negative, 1=Positive)')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Analyze text lengths
train_df['text_length'] = train_df['text'].apply(len)
train_df['word_count'] = train_df['text'].apply(lambda x: len(x.split()))

print("\nText Statistics:")
print(train_df[['text_length', 'word_count']].describe())

# Average words per review by sentiment
print("\nAverage word count by label:")
print(train_df.groupby('label')['word_count'].mean())

## Section 2: Preprocessing the Dataset

Similar to the SMS spam preprocessing, we will:
1. Lowercase the text
2. Remove HTML tags and unwanted characters
3. Tokenize
4. Remove stop words
5. Apply stemming

### Lowercasing the Text

Converting all text to lowercase ensures consistency and reduces dimensionality.

In [None]:
print("=" * 70)
print("PREPROCESSING PIPELINE")
print("=" * 70)

print("\n=== BEFORE PREPROCESSING ===")
print(train_df['text'].head(2).values)

In [None]:
# Lowercase the text
train_df['text'] = train_df['text'].str.lower()
test_df['text'] = test_df['text'].str.lower()

print("\n=== AFTER LOWERCASING ===")
print(train_df['text'].head(2).values)

### Removing HTML Tags and Special Characters

Movie reviews often contain HTML tags like `<br />`. We'll remove these along with unnecessary punctuation and numbers.

In [None]:
import re

def clean_text(text):
    """Remove HTML tags and special characters"""
    # Remove HTML tags
    text = re.sub(r'<br\s*/?>', ' ', text)
    text = re.sub(r'<[^>]+>', '', text)
    # Keep only letters and basic punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

train_df['text'] = train_df['text'].apply(clean_text)
test_df['text'] = test_df['text'].apply(clean_text)

print("\n=== AFTER REMOVING HTML & SPECIAL CHARACTERS ===")
print(train_df['text'].head(2).values)

### Tokenization, Stop Word Removal, and Stemming

These steps normalize the text by:
- Breaking text into individual words (tokens)
- Removing common words that don't add meaning (stop words)
- Reducing words to their root form (stemming)

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

print("✓ NLTK resources downloaded")

In [None]:
# Initialize stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    """Complete preprocessing pipeline"""
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words and apply stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    # Join back into string
    return ' '.join(tokens)

print("Applying full preprocessing pipeline...")
train_df['processed_text'] = train_df['text'].apply(preprocess_text)
test_df['processed_text'] = test_df['text'].apply(preprocess_text)

print("\n=== AFTER FULL PREPROCESSING ===")
print("Original:", train_df['text'].iloc[0][:200])
print("Processed:", train_df['processed_text'].iloc[0][:200])
print("\n✓ Preprocessing completed")

## Section 3: Feature Extraction with TF-IDF

We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization, which:
- Captures both word frequency and importance
- Includes unigrams and bigrams
- Filters out very common and very rare terms

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

print("=" * 70)
print("FEATURE EXTRACTION")
print("=" * 70)

# Split training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train_df['processed_text'],
    train_df['label'],
    test_size=0.2,
    random_state=42,
    stratify=train_df['label']
)

print(f"\n✓ Training set: {len(X_train)} samples")
print(f"✓ Validation set: {len(X_val)} samples")
print(f"✓ Test set: {len(test_df)} samples")

In [None]:
# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=10000,  # Keep top 10k features
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=5,            # Ignore terms appearing in < 5 documents
    max_df=0.8,          # Ignore terms appearing in > 80% of documents
)

print("\nFitting TF-IDF vectorizer...")
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

print(f"\n✓ TF-IDF matrix shape: {X_train_tfidf.shape}")
print(f"✓ Number of features: {len(vectorizer.get_feature_names_out())}")

## Section 4: Model Training with Hyperparameter Tuning

Similar to the SMS spam classification, we use GridSearchCV to find the best hyperparameters.
We'll use Logistic Regression and tune the regularization parameter C.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print("=" * 70)
print("MODEL TRAINING WITH HYPERPARAMETER TUNING")
print("=" * 70)

In [None]:
# Create pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(
        max_features=10000,
        ngram_range=(1, 2),
        min_df=5,
        max_df=0.8
    )),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

print("✓ Pipeline created")

In [None]:
# Define parameter grid for hyperparameter tuning
param_grid = {
    'classifier__C': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],  # Regularization strength
    'classifier__penalty': ['l2']  # Regularization type
}

print("\nParameter grid for tuning:")
print(param_grid)

In [None]:
# Perform grid search with 5-fold cross-validation
print("\nPerforming GridSearchCV (this may take a few minutes)...")

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1',
    verbose=1,
    n_jobs=-1
)

# Fit on training data
grid_search.fit(X_train, y_train)

# Extract best model
best_model = grid_search.best_estimator_

print("\n" + "=" * 70)
print("BEST MODEL PARAMETERS")
print("=" * 70)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best F1 score (CV): {grid_search.best_score_:.4f}")

## Section 5: Model Evaluation

Evaluate the model on validation and test sets.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

print("=" * 70)
print("MODEL EVALUATION")
print("=" * 70)

In [None]:
# Predictions on validation set
y_val_pred = best_model.predict(X_val)
y_val_proba = best_model.predict_proba(X_val)

# Calculate metrics
val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)

print("\nVALIDATION SET RESULTS")
print("=" * 70)
print(f"Accuracy:  {val_accuracy:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall:    {val_recall:.4f}")
print(f"F1-Score:  {val_f1:.4f}")

print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=['Negative', 'Positive']))

In [None]:
# Confusion matrix for validation set
cm_val = confusion_matrix(y_val, y_val_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_val, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Validation Set - Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('confusion_matrix_validation.png', dpi=150)
plt.show()

print("✓ Confusion matrix saved")

In [None]:
# Predictions on test set
y_test_pred = best_model.predict(test_df['processed_text'])
y_test_proba = best_model.predict_proba(test_df['processed_text'])

# Calculate metrics
test_accuracy = accuracy_score(test_df['label'], y_test_pred)
test_precision = precision_score(test_df['label'], y_test_pred)
test_recall = recall_score(test_df['label'], y_test_pred)
test_f1 = f1_score(test_df['label'], y_test_pred)

print("\nTEST SET RESULTS")
print("=" * 70)
print(f"Accuracy:  {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall:    {test_recall:.4f}")
print(f"F1-Score:  {test_f1:.4f}")

print("\nClassification Report:")
print(classification_report(test_df['label'], y_test_pred, target_names=['Negative', 'Positive']))

In [None]:
# Confusion matrix for test set
cm_test = confusion_matrix(test_df['label'], y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Test Set - Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('confusion_matrix_test.png', dpi=150)
plt.show()

print("✓ Confusion matrix saved")

## Section 6: Feature Importance Analysis

Analyze which words are most indicative of positive vs negative sentiment.

In [None]:
print("=" * 70)
print("FEATURE IMPORTANCE ANALYSIS")
print("=" * 70)

# Get feature names and coefficients
vectorizer_from_pipeline = best_model.named_steps['vectorizer']
classifier_from_pipeline = best_model.named_steps['classifier']

feature_names = vectorizer_from_pipeline.get_feature_names_out()
coefficients = classifier_from_pipeline.coef_[0]

# Top positive features
top_positive_idx = np.argsort(coefficients)[-20:]
top_positive = [(feature_names[i], coefficients[i]) for i in top_positive_idx]

# Top negative features
top_negative_idx = np.argsort(coefficients)[:20]
top_negative = [(feature_names[i], coefficients[i]) for i in top_negative_idx]

print("\nTop 20 words indicating POSITIVE sentiment:")
for word, coef in reversed(top_positive):
    print(f"  {word:25s} {coef:8.4f}")

print("\nTop 20 words indicating NEGATIVE sentiment:")
for word, coef in top_negative:
    print(f"  {word:25s} {coef:8.4f}")

## Section 7: Testing on New Examples

Test the model with custom movie review examples.

In [None]:
print("=" * 70)
print("TESTING ON CUSTOM EXAMPLES")
print("=" * 70)

# Example movie reviews
example_reviews = [
    "This movie was absolutely fantastic! I loved every minute of it. Best film I've seen this year!",
    "Terrible movie. Complete waste of time and money. The acting was horrible and the plot made no sense.",
    "It was okay, nothing special but not terrible either. Just an average movie.",
    "A masterpiece! The direction, acting, and cinematography were all perfect. Highly recommended!",
    "Boring and predictable. I fell asleep halfway through. Don't waste your time on this garbage."
]

In [None]:
# Preprocess and predict
def predict_review(review_text):
    """Predict sentiment for a single review"""
    # Apply same preprocessing
    cleaned = clean_text(review_text.lower())
    processed = preprocess_text(cleaned)
    
    # Predict
    prediction = best_model.predict([processed])[0]
    probabilities = best_model.predict_proba([processed])[0]
    
    return prediction, probabilities

# Test each example
for i, review in enumerate(example_reviews, 1):
    pred, proba = predict_review(review)
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = proba[pred] * 100
    
    print(f"\nExample {i}:")
    print(f"Review: {review}")
    print(f"Prediction: {sentiment}")
    print(f"Confidence: {confidence:.2f}%")
    print(f"Probabilities: Negative={proba[0]:.2f}, Positive={proba[1]:.2f}")
    print("-" * 70)

## Section 8: Saving the Model

Save the trained model using joblib for future use.

In [None]:
import joblib

print("=" * 70)
print("SAVING MODEL")
print("=" * 70)

# Save the model
model_filename = 'skills_assessment.joblib'
joblib.dump(best_model, model_filename)

print(f"\n✓ Model saved to: {model_filename}")

# Get file size
import os
file_size = os.path.getsize(model_filename)
print(f"✓ File size: {file_size / 1024:.2f} KB ({file_size:,} bytes)")

In [None]:
# Save results to JSON
results = {
    'model_name': 'IMDB Sentiment Classification',
    'algorithm': 'Logistic Regression with TF-IDF',
    'best_params': grid_search.best_params_,
    'validation': {
        'accuracy': float(val_accuracy),
        'precision': float(val_precision),
        'recall': float(val_recall),
        'f1_score': float(val_f1)
    },
    'test': {
        'accuracy': float(test_accuracy),
        'precision': float(test_precision),
        'recall': float(test_recall),
        'f1_score': float(test_f1)
    }
}

with open('skills_assessment_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print("\n✓ Results saved to: skills_assessment_results.json")

## Section 9: Loading and Testing the Saved Model

Demonstrate how to load and use the saved model.

In [None]:
print("=" * 70)
print("LOADING SAVED MODEL")
print("=" * 70)

# Load the saved model
loaded_model = joblib.load(model_filename)

print(f"\n✓ Model loaded from: {model_filename}")
print(f"✓ Model type: {type(loaded_model).__name__}")

In [None]:
# Test the loaded model
test_review = "This film is amazing and wonderful! A true masterpiece of cinema."

# Preprocess
cleaned = clean_text(test_review.lower())
processed = preprocess_text(cleaned)

# Predict with loaded model
prediction = loaded_model.predict([processed])[0]
probabilities = loaded_model.predict_proba([processed])[0]

print("\nTest with loaded model:")
print(f"Review: {test_review}")
print(f"Prediction: {'Positive' if prediction == 1 else 'Negative'}")
print(f"Confidence: {probabilities[prediction] * 100:.2f}%")

print("\n" + "=" * 70)
print("✓ TRAINING PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 70)

## Summary

This notebook implements an IMDB sentiment classification model following the same structure as the SMS spam classification project:

1. **Data Loading & Exploration**: Loaded 25,000 train + 25,000 test reviews
2. **Preprocessing**: Lowercasing, HTML removal, tokenization, stop word removal, stemming
3. **Feature Extraction**: TF-IDF with unigrams and bigrams (10,000 features)
4. **Model Training**: Logistic Regression with GridSearchCV hyperparameter tuning
5. **Evaluation**: Comprehensive metrics on validation and test sets
6. **Feature Analysis**: Identified most important positive/negative indicators
7. **Model Saving**: Saved as `skills_assessment.joblib` for deployment

The model achieves strong performance and is ready for use in sentiment classification tasks.