# SmartRecs

## Load and Merge IMDb Datasets

- We start by loading the `title.basics.tsv` and `title.ratings.tsv` files.
- These files contain information about movies and their ratings.
- We filter only movies and merge both datasets using the `tconst` identifier.

In [None]:
# Import necessary libraries
import pandas as pd

# Load datasets
basics = pd.read_csv('../data/title.basics.tsv', sep='\t', na_values='\\N', low_memory=False)
ratings = pd.read_csv('../data/title.ratings.tsv', sep='\t', na_values='\\N')

# Filter only movies (not shorts, TV shows, etc.)
movies = basics[basics['titleType'] == 'movie']

# Drop rows with missing essential values
movies = movies.dropna(subset=['primaryTitle', 'startYear', 'genres'])

# Convert year to integer
movies['startYear'] = movies['startYear'].astype(int)

# Merge with ratings
movies_merged = pd.merge(movies, ratings, on='tconst')

# Show a preview
movies_merged.head()


## Data Preprocessing

We clean the dataset by:
- Removing movies before 1950
- Filtering movies with less than 1000 votes (to keep relevant/popular ones)
- Resetting the index for cleaner access

In [None]:
# Filter: Keep movies released from 1950 onwards
movies_filtered = movies_merged[movies_merged['startYear'] >= 1950]

# Filter: Keep only movies with at least 1000 votes
movies_filtered = movies_filtered[movies_filtered['numVotes'] >= 1000]

# Reset index
movies_filtered = movies_filtered.reset_index(drop=True)

# Quick check
print("Filtered dataset shape:", movies_filtered.shape)
movies_filtered[['primaryTitle', 'startYear', 'genres', 'averageRating', 'numVotes']].head()

## Exploratory Data Analysis (EDA)

Here we explore the dataset with visualizations to understand:
- Distribution of movie ratings
- Number of votes received
- Most common genres

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="darkgrid")

# Rating distribution
plt.figure(figsize=(8, 5))
sns.histplot(movies_filtered['averageRating'], bins=20, kde=True)
plt.title('Distribution of Average Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Number of Movies')
plt.show()

# Vote count distribution (log scale to reduce skew)
plt.figure(figsize=(8, 5))
sns.histplot(movies_filtered['numVotes'], bins=30, log_scale=True)
plt.title('Distribution of Vote Counts (log scale)')
plt.xlabel('Number of Votes')
plt.ylabel('Number of Movies')
plt.show()

# Most common genres
from collections import Counter

# Flatten genre list
genre_list = []
for genres in movies_filtered['genres'].dropna():
    genre_list.extend(genres.split(','))

genre_counts = Counter(genre_list)
top_genres = pd.DataFrame(genre_counts.most_common(10), columns=['Genre', 'Count'])

plt.figure(figsize=(8, 5))
sns.barplot(data=top_genres, x='Genre', y='Count')
plt.title('Top 10 Genres')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45)
plt.show()


## Model Building - Collaborative Filtering

We'll use the `Surprise` library to implement:
- User-based CF
- Item-based CF
- SVD (Matrix Factorization)