# Data Preparation for TMDB and Netflix Titles

This notebook processes datasets from Netflix and TMDB to identify matching titles, explore statistics, and generate cleaned outputs. Each step is documented to ensure clarity and reproducibility.

## Pre Covid Netflix titles with TMDB
### 1. Loading and Inspecting Datasets

In [None]:
import pandas as pd

# Load datasets
pre_netflix_df = pd.read_csv('netflix_titles_1.csv')
tmdb_df = pd.read_csv('TMDB_movie_dataset_v11.csv')
post_netflix_df = pd.read_csv('titles.csv')

# Display dataset statistics
print('Netflix dataset shape:', pre_netflix_df.shape)
print('TMDB dataset shape:', tmdb_df.shape)
print('Netflix dataset shape:', post_netflix_df.shape)

# Display a few rows from each dataset
pre_netflix_df.head(), tmdb_df.head(), post_netflix_df.head()

### 2. Data Cleaning and Transformation

In [None]:
# Standardize title case to lowercase for comparison
pre_netflix_df['title_lower'] = pre_netflix_df['title'].str.lower()
tmdb_df['title_lower'] = tmdb_df['title'].str.lower()
post_netflix_df['title_lower'] = post_netflix_df['title'].str.lower()

# Drop duplicates in datasets
pre_netflix_df = pre_netflix_df.drop_duplicates(subset=['title_lower'])
tmdb_df = tmdb_df.drop_duplicates(subset=['title_lower'])
post_netflix_df = post_netflix_df.drop_duplicates(subset=['title_lower'])
print('Duplicates removed from datasets.')

### 3. Merging Datasets

In [None]:
# Merge datasets on standardized title
pre_merged_df = pd.merge(pre_netflix_df, tmdb_df, on='title_lower', how='inner')

# Filter post-COVID titles (release year >= 2020)
post_merged_df = post_netflix_df[post_netflix_df['release_year'] >= 2020]
post_merged_df = pd.merge(post_netflix_df, tmdb_df, on='title_lower', how='inner')

# Save pre merged dataset
pre_merged_df.to_csv('filtered_titles.csv', index=False)
# Save post-COVID merged dataset
post_merged_df.to_csv('post_matching_titles.csv', index=False)

print(f'Pre-COVID Merged dataset shape: {merged_df.shape}')
print(f'Post-COVID merged dataset shape: {merged_post_covid.shape}')

### 5. Statistics and Identifying Common Titles

In [None]:
# Load pre- and post-COVID datasets
filtered_df = pd.read_csv('filtered_titles.csv')
post_matching_df = pd.read_csv('post_matching_titles.csv')

# Find common titles between datasets
common_titles = pd.merge(filtered_df, post_matching_df, on='title_lower', how='inner')

# Save common and remaining titles
common_titles.to_csv('common_titles.csv', index=False)
filtered_remaining = filtered_df[~filtered_df['title_lower'].isin(common_titles['title_lower'])]
post_matching_remaining = post_matching_df[~post_matching_df['title_lower'].isin(common_titles['title_lower'])]

filtered_remaining.to_csv('filtered_remaining.csv', index=False)
post_matching_remaining.to_csv('post_matching_remaining.csv', index=False)

print(f'Number of common titles: {common_titles.shape[0]}')

In this notebook, we successfully prepared and processed two datasets—Netflix and TMDB—to identify common movie titles and explore post-COVID trends. The key steps involved were:

Data Loading and Inspection: Reviewed and summarized the datasets, providing an overview of their structure and content.
Data Cleaning and Transformation: Standardized titles, removed duplicates, and ensured consistency for reliable comparisons.
Merging and Filtering: Merged datasets based on movie titles and applied filters to focus on specific timeframes, such as post-COVID releases.
Analysis and Output: Generated key statistics, identified overlaps between datasets, and exported the cleaned and filtered datasets for further analysis.
This process not only streamlined the data for easier use but also highlighted important patterns, such as trends in movie releases and shared titles between platforms.