# üìä StreamSage Notebook 1: Data Exploration

**Goal**: Download and explore all datasets for the Movie Discovery Assistant.

**Datasets**:
1. MovieLens 25M - User ratings and preferences
2. TMDb 5000 - Movie metadata (plots, cast, keywords)
3. IMDb Reviews - Sentiment analysis data

**Outcome**: Understand data structure, quality, and merge strategy.

In [None]:
# Install required libraries
!pip install pandas numpy matplotlib seaborn kaggle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("‚úÖ Libraries loaded!")

## 1Ô∏è‚É£ Dataset 1: MovieLens 25M

**What it contains**: 25 million ratings from 162,000 users on 62,000 movies.

**Why we need it**: User preference patterns for personalized recommendations.

In [None]:
# Download MovieLens 25M
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip -q ml-25m.zip

print("‚úÖ MovieLens 25M downloaded!")

In [None]:
# Load MovieLens data
ml_ratings = pd.read_csv('ml-25m/ratings.csv')
ml_movies = pd.read_csv('ml-25m/movies.csv')
ml_tags = pd.read_csv('ml-25m/tags.csv')

print(f"Ratings shape: {ml_ratings.shape}")
print(f"Movies shape: {ml_movies.shape}")
print(f"Tags shape: {ml_tags.shape}")

print("\n--- Ratings Sample ---")
display(ml_ratings.head())

print("\n--- Movies Sample ---")
display(ml_movies.head())

In [None]:
# MovieLens Statistics
print("üìä MovieLens Statistics")
print(f"Total ratings: {len(ml_ratings):,}")
print(f"Unique users: {ml_ratings['userId'].nunique():,}")
print(f"Unique movies: {ml_ratings['movieId'].nunique():,}")
print(f"Average rating: {ml_ratings['rating'].mean():.2f}")
print(f"Rating range: {ml_ratings['rating'].min()} - {ml_ratings['rating'].max()}")

# Visualize rating distribution
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
ml_ratings['rating'].hist(bins=10, color='skyblue', edgecolor='black')
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Count')

# Ratings per movie
plt.subplot(1, 2, 2)
ratings_per_movie = ml_ratings.groupby('movieId').size()
ratings_per_movie.hist(bins=50, color='coral', edgecolor='black')
plt.title('Ratings per Movie')
plt.xlabel('Number of Ratings')
plt.ylabel('Number of Movies')
plt.yscale('log')
plt.tight_layout()
plt.show()

## 2Ô∏è‚É£ Dataset 2: TMDb 5000 Movies

**What it contains**: Rich metadata for 5,000 movies (plots, cast, keywords).

**Why we need it**: Content-based recommendations using plot similarity.

In [None]:
# Download TMDb dataset from Kaggle
# Note: You need to upload your kaggle.json API key first
# Go to: https://www.kaggle.com/settings/account ‚Üí Create New API Token
# Upload kaggle.json to Colab

from google.colab import files
print("üì§ Please upload your kaggle.json file:")
uploaded = files.upload()

# Setup Kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download dataset
!kaggle datasets download -d tmdb/tmdb-movie-metadata
!unzip -q tmdb-movie-metadata.zip

print("‚úÖ TMDb dataset downloaded!")

In [None]:
# Load TMDb data
tmdb_movies = pd.read_csv('tmdb_5000_movies.csv')
tmdb_credits = pd.read_csv('tmdb_5000_credits.csv')

print(f"TMDb Movies shape: {tmdb_movies.shape}")
print(f"TMDb Credits shape: {tmdb_credits.shape}")

print("\n--- TMDb Movies Sample ---")
display(tmdb_movies.head())

print("\n--- Column Names ---")
print(tmdb_movies.columns.tolist())

In [None]:
# TMDb Statistics
print("üìä TMDb Statistics")
print(f"Total movies: {len(tmdb_movies):,}")
print(f"Movies with overview: {tmdb_movies['overview'].notna().sum():,}")
print(f"Missing overviews: {tmdb_movies['overview'].isna().sum()}")
print(f"Average overview length: {tmdb_movies['overview'].str.len().mean():.0f} characters")

# Check a sample overview
print("\n--- Sample Movie Overview ---")
sample = tmdb_movies.iloc[0]
print(f"Title: {sample['title']}")
print(f"Overview: {sample['overview']}")
print(f"Genres: {sample['genres']}")
print(f"Keywords: {sample['keywords']}")

In [None]:
# Parse JSON columns (genres, keywords)
def parse_json_column(df, column):
    """Parse JSON string column and extract names"""
    def extract_names(x):
        if pd.isna(x):
            return []
        try:
            data = json.loads(x)
            return [item['name'] for item in data]
        except:
            return []
    
    return df[column].apply(extract_names)

# Parse genres and keywords
tmdb_movies['genres_list'] = parse_json_column(tmdb_movies, 'genres')
tmdb_movies['keywords_list'] = parse_json_column(tmdb_movies, 'keywords')

print("‚úÖ JSON columns parsed!")
print("\n--- Sample Parsed Data ---")
print(f"Title: {tmdb_movies.iloc[0]['title']}")
print(f"Genres: {tmdb_movies.iloc[0]['genres_list']}")
print(f"Keywords: {tmdb_movies.iloc[0]['keywords_list'][:5]}...")  # First 5 keywords

## 3Ô∏è‚É£ Dataset 3: IMDb Reviews

**What it contains**: 50,000 movie reviews with sentiment labels.

**Why we need it**: Train sentiment analyzer for review insights.

In [None]:
# Download IMDb reviews
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz

print("‚úÖ IMDb reviews downloaded!")

In [None]:
# Load IMDb reviews (sample)
import os

def load_reviews(directory, label):
    """Load reviews from directory"""
    reviews = []
    for filename in os.listdir(directory)[:100]:  # Sample 100 reviews
        if filename.endswith('.txt'):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as f:
                reviews.append({'text': f.read(), 'sentiment': label})
    return reviews

# Load positive and negative reviews
pos_reviews = load_reviews('aclImdb/train/pos', 'positive')
neg_reviews = load_reviews('aclImdb/train/neg', 'negative')

# Create DataFrame
imdb_reviews = pd.DataFrame(pos_reviews + neg_reviews)

print(f"IMDb Reviews shape: {imdb_reviews.shape}")
print(f"Positive: {(imdb_reviews['sentiment'] == 'positive').sum()}")
print(f"Negative: {(imdb_reviews['sentiment'] == 'negative').sum()}")

print("\n--- Sample Review ---")
print(imdb_reviews.iloc[0]['text'][:300] + "...")

## üîó Merge Strategy Analysis

**Challenge**: MovieLens and TMDb use different IDs.

**Solution**: Match by title + year.

In [None]:
# Extract year from MovieLens title
ml_movies['year'] = ml_movies['title'].str.extract(r'\((\d{4})\)')[0]
ml_movies['clean_title'] = ml_movies['title'].str.replace(r'\s*\(\d{4}\)', '', regex=True)

# Extract year from TMDb
tmdb_movies['year'] = pd.to_datetime(tmdb_movies['release_date'], errors='coerce').dt.year.astype('Int64')

print("‚úÖ Years extracted!")
print("\n--- MovieLens Sample ---")
display(ml_movies[['movieId', 'title', 'clean_title', 'year']].head())

print("\n--- TMDb Sample ---")
display(tmdb_movies[['id', 'title', 'year']].head())

In [None]:
# Test merge on exact title + year match
merged_test = ml_movies.merge(
    tmdb_movies,
    left_on=['clean_title', 'year'],
    right_on=['title', 'year'],
    how='inner',
    suffixes=('_ml', '_tmdb')
)

print(f"üìä Merge Results (Exact Match)")
print(f"MovieLens movies: {len(ml_movies):,}")
print(f"TMDb movies: {len(tmdb_movies):,}")
print(f"Matched movies: {len(merged_test):,}")
print(f"Match rate: {len(merged_test) / len(tmdb_movies) * 100:.1f}%")

print("\n--- Sample Merged Data ---")
display(merged_test[['movieId', 'title_ml', 'title_tmdb', 'year', 'overview']].head())

## üìã Summary & Next Steps

### What We Learned:
1. ‚úÖ MovieLens has 25M ratings on 62K movies
2. ‚úÖ TMDb has rich metadata for 5K movies
3. ‚úÖ We can match ~3-4K movies by title + year
4. ‚úÖ IMDb reviews are ready for sentiment training

### Next Notebook: Data Cleaning
- Remove low-quality data
- Handle missing values
- Standardize formats

In [None]:
# Save exploration results
print("üíæ Saving exploration results...")

# Save basic stats
stats = {
    'movielens_ratings': len(ml_ratings),
    'movielens_movies': len(ml_movies),
    'tmdb_movies': len(tmdb_movies),
    'matched_movies': len(merged_test),
    'imdb_reviews': len(imdb_reviews)
}

with open('exploration_stats.json', 'w') as f:
    json.dump(stats, f, indent=2)

print("‚úÖ Stats saved to exploration_stats.json")
print("\nüéâ Exploration complete! Ready for Notebook 2: Data Cleaning")