# Data Preparation:

To enrich the analysis, I incorporated data from the TMDB dataset, which provides extensive information about movies, such as genres, popularity, and release years. By connecting Netflix's pre- and post-COVID datasets with TMDB, I aim to identify patterns, shifts, and trends in Netflix's content strategy.

### Structure of the Project:
**Data Preparation for TMDB and Netflix Titles:**

The first step involved processing datasets from Netflix and TMDB to ensure consistency and quality. Titles were standardized, duplicates were removed, and pre- and post-COVID Netflix datasets were merged with the TMDB dataset through an inner join to include only matching titles. This phase focused on producing cleaned datasets that would serve as a foundation for reliable analysis.

**Analysis and Visualization:**

The second phase involved analyzing the prepared datasets to uncover insights about Netflix's content strategy before and after COVID-19. Using visualizations, I explored trends such as the distribution of movies by genre, release year, and popularity, comparing pre- and post-COVID periods. The findings highlight shifts in Netflix's catalog and provide a deeper understanding of its response to changing consumer needs during and after the pandemic.

This project combines my passion for movies and data analysis to uncover meaningful insights into Netflix's evolving content strategy, with a focus on how the pandemic influenced its movie library.

## Data Preparation for TMDB and Netflix Titles

This notebook processes datasets from Netflix and TMDB to identify matching titles, explore statistics, and generate cleaned outputs. Each step is documented to ensure clarity and reproducibility.

## Pre Covid Netflix titles with TMDB
### 1. Loading and Inspecting Datasets

In [109]:
import pandas as pd

# Load datasets
pre_netflix_df = pd.read_csv('PreCovid_titles.csv')
tmdb_df = pd.read_csv('TMDB_movie_dataset_v11.csv')
post_netflix_df = pd.read_csv('PostCovid_titles_May2022.csv')

I have displayed the shapes and a few titles from each dataset below. First one is pre-COVID dataset.

In [111]:
print('Netflix Pre-COVID dataset shape:', pre_netflix_df.shape)
pre_netflix_df.head()

Netflix Pre-COVID dataset shape: (6234, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


Now, the TMDB dataset.

In [113]:
print('TMDB dataset shape:', tmdb_df.shape)
tmdb_df.head()

TMDB dataset shape: (1147541, 24)


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


Finally, the post-COVID dataset.

In [115]:
print('Netflix post-COVID dataset shape:', post_netflix_df.shape)
post_netflix_df.head()

Netflix post-COVID dataset shape: (5806, 15)


Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7


### 2. Data Cleaning and Transformation

The data cleaning and transformation steps for the source datasets are standardizing title case to lowercase and dropping duplicates in datasets.

Titles in the datasets may have case differences (e.g., "Inception" vs. "inception"), which would otherwise be treated as distinct entries. Converting titles to lowercase standardizes them for consistent comparison and duplicate removal.

In [119]:
pre_netflix_df['title_lower'] = pre_netflix_df['title'].str.lower()
tmdb_df['title_lower'] = tmdb_df['title'].str.lower()
post_netflix_df['title_lower'] = post_netflix_df['title'].str.lower()

Duplicate titles can distort analysis and inflate metrics. Removing duplicates ensures the data reflects unique entries, leading to cleaner and more accurate insights.

In [121]:
pre_netflix_df = pre_netflix_df.drop_duplicates(subset=['title_lower'])
print('Netflix Pre-COVID dataset shape:', pre_netflix_df.shape)
pre_netflix_df.head()

Netflix Pre-COVID dataset shape: (6169, 13)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_lower
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,norm of the north: king sized adventure
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,jandino: whatever it takes
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob...",transformers prime
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...,transformers: robots in disguise
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,#realityhigh


In [122]:
tmdb_df = tmdb_df.drop_duplicates(subset=['title_lower'])
print('TMDB dataset shape:', tmdb_df.shape)
tmdb_df.head()

TMDB dataset shape: (973912, 25)


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords,title_lower
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc...",inception
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,...",interstellar
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f...",the dark knight
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ...",avatar
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com...",the avengers


In [123]:
post_netflix_df = post_netflix_df.drop_duplicates(subset=['title_lower'])
print('Netflix Post-COVID dataset shape:', post_netflix_df.shape)
post_netflix_df.head()

Netflix Post-COVID dataset shape: (5750, 16)


Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,title_lower
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,,five came back: the reference films
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2,taxi driver
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8,monty python and the holy grail
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8,life of brian
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7,the exorcist


The focus of my analysis is on movies. Removing TV shows ensures the dataset aligns with my analysis goal. Also, the number of TV shows is significantly lower compared to movies in both the datasets, further justifying the focus on movies.

In [125]:
# Drop TV shows from pre_netflix_df 
pre_netflix_df = pre_netflix_df[pre_netflix_df['type'] != "TV Show"]
pre_netflix_df.groupby('type').count().reset_index()

Unnamed: 0,type,show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_lower
0,Movie,4230,4230,4103,3873,4036,4230,4230,4223,4230,4230,4230,4230


In [126]:
# Drop SHOW from post_netflix_df
post_netflix_df = post_netflix_df[post_netflix_df['type'] != "SHOW"]
post_netflix_df.groupby('type').count().reset_index()

Unnamed: 0,type,id,title,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,title_lower
0,MOVIE,3729,3728,3721,3729,1395,3729,3729,3729,0,3422,3378,3362,3671,3546,3728


### 3. Merging Datasets

After cleaning the datasets by standardizing title cases to lowercase and removing duplicates, I merged the cleaned pre-COVID Netflix dataset and post-COVID Netflix dataset with the TMDB dataset. The merge was performed using an inner join on the standardized title column (title_lower). This approach ensures that only Netflix titles present in the TMDB dataset are included in the resulting merged datasets. The TMDB dataset contains additional information about titles, making it valuable for deeper analysis.


In [129]:
# Merge datasets on standardized title
pre_merged_df = pd.merge(pre_netflix_df, tmdb_df, on='title_lower', how='inner')

# Save pre merged dataset
pre_merged_df.to_csv('pre_covid_merged_titles.csv', index=False)

print(f'Pre-COVID Merged dataset shape: {pre_merged_df.shape}')

Pre-COVID Merged dataset shape: (3650, 37)


For the post-COVID Netflix dataset, an additional filter was applied to include only titles with a release year greater than or equal to 2020, ensuring that only relevant post-COVID titles are retained. After filtering, the dataset was merged with the TMDB dataset similarly using an inner join.

In [131]:
# Filter post-COVID titles (release year >= 2020)
post_merged_df = post_netflix_df[post_netflix_df['release_year'] >= 2020]
post_merged_df = pd.merge(post_netflix_df, tmdb_df, on='title_lower', how='inner')

# Save post-COVID merged dataset
post_merged_df.to_csv('post_covid_merged_titles.csv', index=False)

print(f'Post-COVID merged dataset shape: {post_merged_df.shape}')

Post-COVID merged dataset shape: (3610, 40)


Finally, the resulting merged datasets were saved as separate CSV files:

- *pre_covid_merged_titles.csv* contains the merged pre-COVID titles.
- *post_covid_merged_titles.csv* contains the merged post-COVID titles.
  
This process highlights the focus on aligning Netflix titles with TMDB data to enhance analysis capabilities. The shapes of the resulting datasets were printed to validate the number of titles retained after merging.

### 4. Statistics and Identifying Common Titles

Now, I need to isolate titles from each merged dataset for detailed comparisons or trend analysis, such as studying the impact of COVID on Netflix's content library. Also, this is required so that no duplicates are analyzed when analyzing pre-COVID and post-COVID data seperately.

In [135]:
# Load pre-COVID dataset
filtered_df = pd.read_csv('pre_covid_merged_titles.csv')

In [136]:
# Load post-COVID dataset
post_matching_df = pd.read_csv('post_covid_merged_titles.csv')

In [137]:
# Find common titles between datasets
common_titles = pd.merge(filtered_df, post_matching_df, on='title_lower', how='inner')
# Save common and remaining titles
common_titles.to_csv('common_titles.csv', index=False)
print(f'Number of common titles: {common_titles.shape[0]}')

Number of common titles: 1484


In [138]:
#Filtering out the remaining titles for pre-COVID
pre_covid_remaining = filtered_df[~filtered_df['title_lower'].isin(common_titles['title_lower'])]
pre_covid_remaining.to_csv('pre_covid_remaining.csv', index=False)
print(f'Number of pre-COVID titles: {pre_covid_remaining.shape[0]}')

Number of pre-COVID titles: 2166


In [139]:
#Filtering out the remaining titles for post-COVID
post_covid_remaining = post_matching_df[~post_matching_df['title_lower'].isin(common_titles['title_lower'])]
post_covid_remaining.to_csv('post_covid_remaining.csv', index=False)
print(f'Number of post-COVID titles: {post_covid_remaining.shape[0]}')

Number of post-COVID titles: 2126


## Conclusion:

In this notebook, we successfully prepared and processed two datasets-Netflix and TMDB-to identify common movie titles and explore post-COVID trends. The key steps involved were:

- Data Loading and Inspection: Reviewed and summarized the datasets, providing an overview of their structure and content.
- Data Cleaning and Transformation: Standardized titles, removed duplicates, and ensured consistency for reliable comparisons.
- Merging and Filtering: Merged datasets based on movie titles and applied filters to focus on specific timeframes, such as post-COVID releases.
- Statistics and Identifying Common Titles Generated key statistics, identified overlaps between datasets, and exported the cleaned and filtered datasets for further analysis.
This process not only streamlined the data for easier use but also highlighted important patterns, such as trends in movie releases and shared titles between platforms.