# Dataset

In [78]:
import pandas as pd

In [79]:
# Load the cleaned datasets
netflix_df = pd.read_csv('https://raw.githubusercontent.com/anthonybrown0528/csc-442-course-project/refs/heads/main/dataset/clean/netflix_film_data.csv')
imdb_df = pd.read_csv('https://raw.githubusercontent.com/anthonybrown0528/csc-442-course-project/refs/heads/main/dataset/clean/netflix_imdb_scores.csv')

## 1. Data Merging

### 1.1 Identify Common Columns

In [80]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7293 entries, 0 to 7292
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     7293 non-null   int64 
 1   show_id        7293 non-null   object
 2   type           7293 non-null   object
 3   title          7293 non-null   object
 4   director       7293 non-null   object
 5   cast           7293 non-null   object
 6   country        7293 non-null   object
 7   date_added     7293 non-null   object
 8   release_year   7293 non-null   int64 
 9   rating         7290 non-null   object
 10  duration       7293 non-null   object
 11  listed_in      7293 non-null   object
 12  description    7293 non-null   object
 13  num_listed_in  7293 non-null   int64 
 14  first_cast     7293 non-null   object
 15  num_genre      7293 non-null   int64 
dtypes: int64(4), object(12)
memory usage: 911.8+ KB


In [81]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5278 entries, 0 to 5277
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         5278 non-null   int64  
 1   title              5278 non-null   object 
 2   type               5278 non-null   object 
 3   description        5278 non-null   object 
 4   release_year       5278 non-null   int64  
 5   age_certification  2996 non-null   object 
 6   runtime            5278 non-null   int64  
 7   imdb_id            5278 non-null   object 
 8   imdb_score         5278 non-null   float64
 9   imdb_votes         5278 non-null   int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 412.5+ KB


In [82]:
common_titles = list(set(imdb_df["title"]) & set(netflix_df["title"]))
print('Common Titles:', len(common_titles))

common_release_year = list(set(imdb_df["release_year"]) & set(netflix_df["release_year"]))
print('Common Release Years:', len(common_release_year))

common_age_certification_year = list(set(imdb_df["age_certification"]) & set(netflix_df["rating"]))
print('Common Age Certifications:', len(common_age_certification_year))

Common Titles: 3177
Common Release Years: 64
Common Age Certifications: 12


The datasets share the `title`, `release_year`, and `age_certification` columns and contain common values. The title and release year are sufficient to identify a unique film in this dataset.

## 1.2 Broad Merge

In [83]:
# Rename the rating column to age_certification to be consistent with the
# column label in the other dataset
netflix_df = netflix_df.rename(columns={'rating': 'age_certification'})

In [84]:
netflix_df.head()

Unnamed: 0.1,Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,age_certification,duration,listed_in,description,num_listed_in,first_cast,num_genre
0,1,s2,SHOW,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",3,Ama Qamata,3
1,4,s5,SHOW,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,3,Mayur More,3
2,7,s8,MOVIE,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",3,Kofi Ghanaba,3
3,8,s9,SHOW,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...,2,Mel Giedroyc,2
4,9,s10,MOVIE,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,2,Melissa McCarthy,2


In [85]:
# Perform an inner join on film title and release year
netflix_film_imdb_data = pd.merge(imdb_df, netflix_df, how='inner', on=['title', 'release_year'])

# The duplicate type and age certification are redundant
# Remove one of each from the merged data and retain the columns which
# do not contain missing values
#
# Merging with duplicate columns forced a renaming,
# so revert the columns back to the original naming
netflix_film_imdb_data = netflix_film_imdb_data \
            .drop(columns=['type_x', 'age_certification_x']) \
            .rename(columns={'type_y': 'type', 'age_certification_y': 'age_certification'})

netflix_film_imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2477 entries, 0 to 2476
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0_x       2477 non-null   int64  
 1   title              2477 non-null   object 
 2   description_x      2477 non-null   object 
 3   release_year       2477 non-null   int64  
 4   runtime            2477 non-null   int64  
 5   imdb_id            2477 non-null   object 
 6   imdb_score         2477 non-null   float64
 7   imdb_votes         2477 non-null   int64  
 8   Unnamed: 0_y       2477 non-null   int64  
 9   show_id            2477 non-null   object 
 10  type               2477 non-null   object 
 11  director           2477 non-null   object 
 12  cast               2477 non-null   object 
 13  country            2477 non-null   object 
 14  date_added         2477 non-null   object 
 15  age_certification  2477 non-null   object 
 16  duration           2477 

In [86]:
netflix_film_imdb_data.to_csv('netflix_film_imdb_data.csv')

# Credit

This notebook contains contributions from Anthony Brown.

The organization of this notebook is inspired by a notebook provided as a workshop for CSC 442 at NC State University. This workshop was created by Aditi Mallavarapu, Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.