# Sentiment Analysis of Movie Reviews - EDA and Preprocessing

**Project**: IMDb Movie Reviews Sentiment Classification  
**Author**: Aayushman Singh Chandel  
**Date**: December 24, 2025

---

## üìã Notebook Overview

This notebook covers the initial phases of the sentiment analysis project:
1. **Data Loading**: Loading the IMDb dataset
2. **Exploratory Data Analysis**: Understanding data distribution and characteristics
3. **Text Preprocessing**: Cleaning and preparing text for modeling
4. **Feature Analysis**: Analyzing word patterns and frequencies
5. **Data Saving**: Saving processed data for modeling

**Expected Outcomes**:
- Clean, preprocessed dataset ready for ML modeling
- Comprehensive visualizations of data characteristics
- Insights into positive vs. negative review patterns

## 1. Import Required Libraries

In [None]:
# Data Manipulation
import pandas as pd
import numpy as np
import os
import re
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go

# NLP Libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Dataset
from datasets import load_dataset

# Utilities
from collections import Counter
from tqdm import tqdm
tqdm.pandas()

# Set random seed for reproducibility
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")
print(f"Python version: 3.14")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

### Download NLTK Resources

In [None]:
# Download required NLTK data
nltk_resources = ['stopwords', 'punkt', 'wordnet', 'omw-1.4', 'averaged_perceptron_tagger']

for resource in nltk_resources:
    try:
        nltk.download(resource, quiet=True)
        print(f"‚úÖ Downloaded: {resource}")
    except Exception as e:
        print(f"‚ö†Ô∏è Failed to download {resource}: {e}")

print("\n‚úÖ NLTK resources downloaded successfully!")

## 2. Load the IMDb Dataset

We'll use the HuggingFace `datasets` library to load the standard IMDb movie reviews dataset.

**Dataset Info**:
- **Size**: 50,000 reviews (25k train, 25k test)
- **Balance**: 50% positive, 50% negative
- **Task**: Binary sentiment classification

In [None]:
# Load IMDb dataset from HuggingFace
print("Loading IMDb dataset...")
dataset = load_dataset("imdb")

# Display dataset structure
print("\nüìä Dataset Structure:")
print(dataset)

# Convert to pandas DataFrames for easier manipulation
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

print(f"\n‚úÖ Training set size: {len(train_df):,} reviews")
print(f"‚úÖ Test set size: {len(test_df):,} reviews")
print(f"‚úÖ Total dataset size: {len(train_df) + len(test_df):,} reviews")

### 2.1 Initial Data Inspection

In [None]:
# Display first few rows
print("üìã First 5 training samples:\n")
print(train_df.head())

print("\n" + "="*80)
print("\nüìä Dataset Info:")
print(train_df.info())

print("\n" + "="*80)
print("\nüìà Basic Statistics:")
print(train_df.describe())

# Check for missing values
print("\n" + "="*80)
print("\n‚ùì Missing Values:")
print(f"Training set: {train_df.isnull().sum().sum()} missing values")
print(f"Test set: {test_df.isnull().sum().sum()} missing values")

### 2.2 Sample Reviews

In [None]:
# Display sample positive and negative reviews
print("üé¨ POSITIVE REVIEW EXAMPLE:\n")
print(train_df[train_df['label'] == 1].iloc[0]['text'][:500] + "...")
print("\n" + "="*80)
print("\nüëé NEGATIVE REVIEW EXAMPLE:\n")
print(train_df[train_df['label'] == 0].iloc[0]['text'][:500] + "...")

## 3. Exploratory Data Analysis (EDA)

### 3.1 Sentiment Distribution

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
sentiment_counts_train = train_df['label'].value_counts().sort_index()
labels = ['Negative (0)', 'Positive (1)']
colors = ['#ff6b6b', '#51cf66']

axes[0].bar(labels, sentiment_counts_train.values, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[0].set_ylabel('Count', fontsize=12, fontweight='bold')
axes[0].set_title('Training Set - Sentiment Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(sentiment_counts_train.values):
    axes[0].text(i, v + 200, str(v), ha='center', va='bottom', fontweight='bold', fontsize=11)

# Test set
sentiment_counts_test = test_df['label'].value_counts().sort_index()
axes[1].bar(labels, sentiment_counts_test.values, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[1].set_ylabel('Count', fontsize=12, fontweight='bold')
axes[1].set_title('Test Set - Sentiment Distribution', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)
for i, v in enumerate(sentiment_counts_test.values):
    axes[1].text(i, v + 200, str(v), ha='center', va='bottom', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.show()

print("\n‚úÖ Dataset is perfectly balanced!")

### 3.2 Review Length Analysis

In [None]:
# Calculate review lengths
train_df['review_length'] = train_df['text'].apply(lambda x: len(str(x).split()))
test_df['review_length'] = test_df['text'].apply(lambda x: len(str(x).split()))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Overall distribution
axes[0].hist(train_df['review_length'], bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Review Length Distribution (Training Set)', fontsize=14, fontweight='bold')
axes[0].axvline(train_df['review_length'].mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean: {train_df["review_length"].mean():.1f}')
axes[0].axvline(train_df['review_length'].median(), color='green', linestyle='--', linewidth=2,
                label=f'Median: {train_df["review_length"].median():.1f}')
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)

# By sentiment
for label, color, name in zip([0, 1], ['#ff6b6b', '#51cf66'], ['Negative', 'Positive']):
    subset = train_df[train_df['label'] == label]['review_length']
    axes[1].hist(subset, bins=50, alpha=0.6, label=name, color=color, edgecolor='black')

axes[1].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Review Length by Sentiment', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print("\nüìä Review Length Statistics:")
print(f"Mean: {train_df['review_length'].mean():.2f} words")
print(f"Median: {train_df['review_length'].median():.2f} words")
print(f"Min: {train_df['review_length'].min()} words")
print(f"Max: {train_df['review_length'].max()} words")
print(f"Std Dev: {train_df['review_length'].std():.2f} words")

## 4. Text Preprocessing

We'll create a comprehensive preprocessing pipeline to clean the raw text data.

In [None]:
# Text preprocessing functions
import string

class TextPreprocessor:
    """Text preprocessing pipeline for sentiment analysis."""
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
    
    def clean_html(self, text):
        """Remove HTML tags from text."""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)
    
    def clean_text(self, text):
        """Remove URLs, special characters, and extra whitespaces."""
        # Remove URLs
        text = re.sub(r'http\S+|www.\S+', '', text)
        # Remove special characters and digits
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        # Remove extra whitespaces
        text = re.sub(r'\s+', ' ', text)
        return text.strip()
    
    def preprocess(self, text):
        """Apply full preprocessing pipeline."""
        if pd.isna(text):
            return ""
        
        # Convert to string and lowercase
        text = str(text).lower()
        
        # Remove HTML tags
        text = self.clean_html(text)
        
        # Clean text
        text = self.clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and lemmatize
        tokens = [self.lemmatizer.lemmatize(word) for word in tokens 
                 if word not in self.stop_words and len(word) > 2]
        
        return ' '.join(tokens)

# Initialize preprocessor
preprocessor = TextPreprocessor()

print("‚úÖ Text preprocessor created!")
print("\nExample transformation:")
sample_text = train_df.iloc[0]['text'][:200]
print(f"\nOriginal: {sample_text}...")
print(f"\nCleaned: {preprocessor.preprocess(sample_text)}")

### Apply Preprocessing to Full Dataset

‚ö†Ô∏è **Note**: This may take 5-10 minutes. We'll use tqdm for progress tracking.

In [None]:
# Enable progress bar for pandas
tqdm.pandas()

# Process training data
print("Processing training data...")
train_df['cleaned_text'] = train_df['text'].progress_apply(preprocessor.preprocess)
train_df['word_count'] = train_df['cleaned_text'].apply(lambda x: len(x.split()))
train_df['char_count'] = train_df['cleaned_text'].apply(len)

# Process test data
print("\nProcessing test data...")
test_df['cleaned_text'] = test_df['text'].progress_apply(preprocessor.preprocess)
test_df['word_count'] = test_df['cleaned_text'].apply(lambda x: len(x.split()))
test_df['char_count'] = test_df['cleaned_text'].apply(len)

print("\n‚úÖ Preprocessing complete!")
print(f"\nTraining set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

### 4.1 Word Clouds - Positive vs Negative Reviews

In [None]:
# Generate word clouds
positive_text = ' '.join(train_df[train_df['label'] == 1]['cleaned_text'].head(5000))
negative_text = ' '.join(train_df[train_df['label'] == 0]['cleaned_text'].head(5000))

fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Positive word cloud
wc_pos = WordCloud(width=800, height=400, background_color='white',
                   max_words=100, colormap='Greens', contour_width=2,
                   contour_color='darkgreen').generate(positive_text)
axes[0].imshow(wc_pos, interpolation='bilinear')
axes[0].axis('off')
axes[0].set_title('Positive Reviews - Word Cloud', fontsize=16, fontweight='bold', pad=15)

# Negative word cloud
wc_neg = WordCloud(width=800, height=400, background_color='white',
                   max_words=100, colormap='Reds', contour_width=2,
                   contour_color='darkred').generate(negative_text)
axes[1].imshow(wc_neg, interpolation='bilinear')
axes[1].axis('off')
axes[1].set_title('Negative Reviews - Word Cloud', fontsize=16, fontweight='bold', pad=15)

plt.tight_layout()
plt.savefig('../results/figures/wordclouds.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Word clouds generated and saved!")

### 4.2 Top Words Comparison

In [None]:
# Get word frequencies
from collections import Counter

positive_words = Counter(' '.join(train_df[train_df['label'] == 1]['cleaned_text']).split())
negative_words = Counter(' '.join(train_df[train_df['label'] == 0]['cleaned_text']).split())

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Positive words
top_pos = positive_words.most_common(20)
words_pos, counts_pos = zip(*top_pos)
axes[0].barh(range(len(words_pos)), counts_pos, color='#51cf66', alpha=0.8, edgecolor='black')
axes[0].set_yticks(range(len(words_pos)))
axes[0].set_yticklabels(words_pos, fontsize=11)
axes[0].set_xlabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Top 20 Words in Positive Reviews', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# Negative words
top_neg = negative_words.most_common(20)
words_neg, counts_neg = zip(*top_neg)
axes[1].barh(range(len(words_neg)), counts_neg, color='#ff6b6b', alpha=0.8, edgecolor='black')
axes[1].set_yticks(range(len(words_neg)))
axes[1].set_yticklabels(words_neg, fontsize=11)
axes[1].set_xlabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Top 20 Words in Negative Reviews', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('../results/figures/top_words_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Top words comparison generated!")

## 5. Save Processed Data

Save the cleaned datasets for use in modeling notebooks.

In [None]:
# Create directories if they don't exist
import os
os.makedirs('../data/processed', exist_ok=True)
os.makedirs('../results/figures', exist_ok=True)

# Save processed data
train_df.to_csv('../data/processed/train_processed.csv', index=False)
test_df.to_csv('../data/processed/test_processed.csv', index=False)

print("‚úÖ Processed data saved successfully!")
print(f"\nüìÅ Files saved:")
print(f"  - ../data/processed/train_processed.csv ({len(train_df):,} rows)")
print(f"  - ../data/processed/test_processed.csv ({len(test_df):,} rows)")

# Display sample of processed data
print("\nüìä Sample of processed data:")
print(train_df[['text', 'cleaned_text', 'label', 'word_count']].head(3))

## Notebook 1 Summary

- Loaded 50,000 IMDb movie reviews
- Performed comprehensive EDA
- Analyzed sentiment distribution and review lengths
- Preprocessed all text data (cleaning, tokenization, lemmatization)
- Generated visualizations (word clouds, frequency plots)
- Saved processed data for modeling

**Next Steps**: Proceeding to Notebook 2 for Classical ML modeling