# Exploratory Data Analysis (EDA) - IMDB Sentiment Analysis
**Team:** [Your Team Name]

**Date:** October 20, 2025

**Objective:** Understand the IMDB dataset before building models

---

## 1. Setup and Imports

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Settings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully!")

## 2. Load the Dataset

**TODO:** Update the path below to match where YOU saved the IMDB dataset

In [None]:
# Load IMDB dataset
# UPDATE THIS PATH!
data_path = '/content/drive/MyDrive/sentiment_analysis_project/data/IMDB Dataset.csv'

df = pd.read_csv(data_path)

print(f"✅ Dataset loaded successfully!")
print(f"Shape: {df.shape}")

## 3. Initial Data Exploration

**Goal:** Get a first look at what we're working with

In [None]:
# Display first few rows
# TODO: Display the first 5 rows of the dataset


In [None]:
# Get basic information
# TODO: Use .info() to check data types and missing values


In [None]:
# Check for missing values
# TODO: Check how many null values are in each column


In [None]:
# Check for duplicates
# TODO: How many duplicate rows exist?


## 4. Target Variable Analysis

**Goal:** Understand the distribution of sentiments (positive vs negative)

In [None]:
# Count sentiment distribution
# TODO: Use value_counts() to see how many positive vs negative reviews


In [None]:
# Visualize sentiment distribution
# TODO: Create a bar plot or count plot
# HINT: Use plt.bar() or sns.countplot()

plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

**Question:** Is the dataset balanced? Why does this matter?

## 5. Text Length Analysis

**Goal:** See how long the reviews are (character count and word count)

In [None]:
# Create new columns for length metrics
# TODO: Add a column called 'char_count' that counts characters in each review
# HINT: df['review'].apply(len)

# TODO: Add a column called 'word_count' that counts words
# HINT: df['review'].apply(lambda x: len(x.split()))


In [None]:
# Get statistics on review lengths
# TODO: Use .describe() on char_count and word_count


In [None]:
# Visualize word count distribution
# TODO: Create a histogram of word counts
# HINT: plt.hist() or df['word_count'].hist()

plt.title('Distribution of Review Lengths (Word Count)')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Compare lengths between positive and negative reviews
# TODO: Create a box plot comparing word counts by sentiment
# HINT: sns.boxplot(data=df, x='sentiment', y='word_count')

plt.title('Word Count by Sentiment')
plt.show()

**Question:** Do positive or negative reviews tend to be longer? Why might this be?

## 6. Sample Reviews

**Goal:** Read actual reviews to get intuition about the data

In [None]:
# Display 3 random positive reviews
print("=== POSITIVE REVIEWS ===")
# TODO: Sample 3 positive reviews and display them
# HINT: df[df['sentiment'] == 'positive'].sample(3)


In [None]:
# Display 3 random negative reviews
print("=== NEGATIVE REVIEWS ===")
# TODO: Sample 3 negative reviews and display them


## 7. Text Patterns (OPTIONAL - Advanced)

If you finish early, try these explorations:

In [None]:
# Check for HTML tags (IMDB reviews sometimes have <br /> tags)
# TODO: See if any reviews contain '<br />'
# HINT: df['review'].str.contains('<br />')


In [None]:
# Most common words (simple version)
from collections import Counter

# Combine all reviews into one big string
all_text = ' '.join(df['review']).lower()
words = all_text.split()

# Get 20 most common words
word_counts = Counter(words).most_common(20)
print("20 Most Common Words:")
for word, count in word_counts:
    print(f"{word}: {count}")

## 8. Your Insights

**TODO:** Based on your exploration, write down 3 interesting observations about the dataset:

1. 

2. 

3. 

---

**Next Steps:**
- Week 2: Clean the text data and build baseline models (Logistic Regression, Naive Bayes)
- Save any important findings to share with the team!