# Ranking: Data Exploration

**Phase 1: Exploration**

This notebook explores the MovieLens 1M dataset to understand:
- Data structure and schemas
- Data quality and completeness
- Distribution of users, movies, and ratings
- Potential features for ranking model

Goal: Build intuition about the data before feature engineering and model training.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

ModuleNotFoundError: No module named 'pandas'

## 1. Load Data

MovieLens 1M uses `::` as the delimiter. The datasets are:
- **movies.dat**: MovieID::Title::Genres
- **ratings.dat**: UserID::MovieID::Rating::Timestamp
- **users.dat**: UserID::Gender::Age::Occupation::Zip-code

In [None]:
# Define data path
DATA_DIR = Path('../data/ml-1m')

# Load movies
movies = pd.read_csv(
    DATA_DIR / 'movies.dat',
    sep='::',
    engine='python',
    encoding='latin-1',
    names=['movie_id', 'title', 'genres'],
    header=None
)

print(f"Movies shape: {movies.shape}")
movies.head()

In [None]:
# Load ratings
ratings = pd.read_csv(
    DATA_DIR / 'ratings.dat',
    sep='::',
    engine='python',
    names=['user_id', 'movie_id', 'rating', 'timestamp'],
    header=None
)

print(f"Ratings shape: {ratings.shape}")
ratings.head()

In [None]:
# Load users
users = pd.read_csv(
    DATA_DIR / 'users.dat',
    sep='::',
    engine='python',
    names=['user_id', 'gender', 'age', 'occupation', 'zip_code'],
    header=None
)

print(f"Users shape: {users.shape}")
users.head()

## 2. Basic Data Quality Checks

In [None]:
# Check for missing values
print("Missing values in movies:")
print(movies.isnull().sum())
print("\nMissing values in ratings:")
print(ratings.isnull().sum())
print("\nMissing values in users:")
print(users.isnull().sum())

In [None]:
# Basic statistics
print("Dataset Overview:")
print(f"Total users: {users['user_id'].nunique():,}")
print(f"Total movies: {movies['movie_id'].nunique():,}")
print(f"Total ratings: {len(ratings):,}")
print(f"\nSparsity: {1 - len(ratings) / (users['user_id'].nunique() * movies['movie_id'].nunique()):.4%}")

## 3. Rating Distribution

In [None]:
# Rating distribution
print("Rating statistics:")
print(ratings['rating'].describe())

# Plot rating distribution
plt.figure(figsize=(10, 5))
ratings['rating'].value_counts().sort_index().plot(kind='bar')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

## 4. User Behavior Analysis

In [None]:
# Ratings per user
user_rating_counts = ratings.groupby('user_id').size()

print("User rating statistics:")
print(user_rating_counts.describe())

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
user_rating_counts.hist(bins=50)
plt.title('Distribution of Ratings per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Number of Users')

plt.subplot(1, 2, 2)
user_rating_counts.plot(kind='box')
plt.title('Box Plot: Ratings per User')
plt.ylabel('Number of Ratings')
plt.tight_layout()
plt.show()

## 5. Movie Popularity Analysis

In [None]:
# Ratings per movie
movie_rating_counts = ratings.groupby('movie_id').size()

print("Movie rating statistics:")
print(movie_rating_counts.describe())

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
movie_rating_counts.hist(bins=50)
plt.title('Distribution of Ratings per Movie')
plt.xlabel('Number of Ratings')
plt.ylabel('Number of Movies')

plt.subplot(1, 2, 2)
movie_rating_counts.plot(kind='box')
plt.title('Box Plot: Ratings per Movie')
plt.ylabel('Number of Ratings')
plt.tight_layout()
plt.show()

In [None]:
# Top rated movies (by count)
top_movies = ratings.groupby('movie_id').size().sort_values(ascending=False).head(10)
top_movies_with_titles = top_movies.to_frame('rating_count').join(movies.set_index('movie_id'))

print("Top 10 most rated movies:")
print(top_movies_with_titles[['title', 'rating_count']])

## 6. User Demographics

In [None]:
# Gender distribution
print("Gender distribution:")
print(users['gender'].value_counts())

plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
users['gender'].value_counts().plot(kind='bar')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')

# Age distribution
plt.subplot(1, 3, 2)
users['age'].value_counts().sort_index().plot(kind='bar')
plt.title('Age Distribution')
plt.xlabel('Age Group')
plt.ylabel('Count')

# Occupation distribution
plt.subplot(1, 3, 3)
users['occupation'].value_counts().sort_index().plot(kind='bar')
plt.title('Occupation Distribution')
plt.xlabel('Occupation Code')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## 7. Genre Analysis

In [None]:
# Extract and count genres
# Genres are pipe-separated, e.g., "Action|Adventure|Sci-Fi"
all_genres = movies['genres'].str.split('|').explode()
genre_counts = all_genres.value_counts()

print("Genre distribution:")
print(genre_counts)

plt.figure(figsize=(12, 6))
genre_counts.plot(kind='barh')
plt.title('Movie Count by Genre')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Temporal Analysis

In [None]:
# Convert timestamp to datetime
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')

print("Rating time range:")
print(f"First rating: {ratings['datetime'].min()}")
print(f"Last rating: {ratings['datetime'].max()}")
print(f"Time span: {(ratings['datetime'].max() - ratings['datetime'].min()).days} days")

# Ratings over time
ratings_by_date = ratings.groupby(ratings['datetime'].dt.date).size()

plt.figure(figsize=(14, 5))
ratings_by_date.plot()
plt.title('Ratings Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Ratings')
plt.tight_layout()
plt.show()

## 9. Summary & Next Steps

Key observations from this exploration:
- Dataset characteristics (size, sparsity, distributions)
- Data quality (completeness, consistency)
- User behavior patterns
- Movie popularity patterns
- Available features for ranking

Next steps:
1. Feature engineering for ranking model
2. Train/test split strategy
3. Baseline model development