Kaggle Dataset Link : https://www.kaggle.com/datasets/shivamb/netflix-shows

# Questions

1. Group by country and count how many movies/TV shows are released in each country.
2. Find the average duration of shows grouped by type (Movie vs TV Show).
3. Group by rating and calculate the minimum, maximum, and mean release year for each.
4. Group by country and type to count the number of entries in each combination.
5. For each director, find the total number of movies and total number of TV shows.
6. Use .agg() to compute multiple statistics (count, unique, first) for cast grouped by country.
7. Which country has the highest number of Netflix entries? (Sort groupby results).
8. Within each type (Movie/TV Show), rank the countries by number of releases.
9. Filter to only keep countries that have more than 100 movies.
10. For each rating, filter directors who have directed at least 5 movies.
11. Create a new column showing how a movie’s release year compares to the average release year of its country.
12. Assign each country a normalized score of number of shows, scaled by the maximum number of shows across all countries.
13. Group by rating and calculate the IQR (interquartile range) of release years.
14. For each country, find the director who directed the most shows using apply.
15. Group by country and fill missing values in the director column with the most common director for that country.


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('netflix_titles.csv')
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [4]:
# 1. Group by country and count how many movies/TV shows are released in each country.
df.groupby('country')['type'].count().sort_values(ascending=False)

country
United States                                                                          2818
India                                                                                   972
United Kingdom                                                                          419
Japan                                                                                   245
South Korea                                                                             199
                                                                                       ... 
Ireland, Canada, Luxembourg, United States, United Kingdom, Philippines, India            1
Ireland, Canada, United Kingdom, United States                                            1
Ireland, Canada, United States, United Kingdom                                            1
Ireland, France, Iceland, United States, Mexico, Belgium, United Kingdom, Hong Kong       1
Zimbabwe                                                                

In [5]:
# 2. Find the average duration of shows grouped by type (Movie vs TV Show).
a = df.copy()
a['duration'] = a['duration'].str.split().str[0]
a['duration'] = a['duration'].astype('float')
a.groupby('type')['duration'].mean()

# movies in mins
# TV shows in seasons

type
Movie      99.577187
TV Show     1.764948
Name: duration, dtype: float64

In [6]:
# 3. Group by rating and calculate the minimum, maximum, and mean release year for each.

df.groupby('rating').agg(
    {
        'release_year' : ['min','max','mean']
    }
)

Unnamed: 0_level_0,release_year,release_year,release_year
Unnamed: 0_level_1,min,max,mean
rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
66 min,2015,2015,2015.0
74 min,2017,2017,2017.0
84 min,2010,2010,2010.0
G,1956,2020,1997.804878
NC-17,2013,2018,2015.0
NR,1958,2018,2010.9125
PG,1973,2021,2008.428571
PG-13,1955,2021,2009.314286
R,1962,2021,2010.47184
TV-14,1925,2021,2013.655556


In [7]:
# 4. Group by country and type to count the number of entries in each combination.
df.groupby(['country','type']).size()

country                                              type   
, France, Algeria                                    Movie       1
, South Korea                                        TV Show     1
Argentina                                            Movie      38
                                                     TV Show    18
Argentina, Brazil, France, Poland, Germany, Denmark  Movie       1
                                                                ..
Venezuela                                            Movie       1
Venezuela, Colombia                                  Movie       1
Vietnam                                              Movie       7
West Germany                                         Movie       1
Zimbabwe                                             Movie       1
Length: 847, dtype: int64

In [8]:
# 5. For each director, find the total number of movies and total number of TV shows.
df.groupby(['director','type']).size().reset_index(name='Total')

# OR with different column for each type & fills given value wherever no value using unstack
df.groupby(['director', 'type']).size().unstack(fill_value=0)

type,Movie,TV Show
director,Unnamed: 1_level_1,Unnamed: 2_level_1
A. L. Vijay,2,0
A. Raajdheep,1,0
A. Salaam,1,0
A.R. Murugadoss,2,0
Aadish Keluskar,1,0
...,...,...
Çagan Irmak,1,0
Ísold Uggadóttir,1,0
Óskar Thór Axelsson,1,0
Ömer Faruk Sorak,2,0


In [9]:
# 6. Use .agg() to compute multiple statistics (count, unique, first) for cast grouped by country.
df.groupby('country').agg(
    {
        'cast' : ['count', 'unique', 'first']
    }
)

Unnamed: 0_level_0,cast,cast,cast
Unnamed: 0_level_1,count,unique,first
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
", France, Algeria",1,"[Khaled Abol El Naga, Souad Massi, Suhail Hadd...","Khaled Abol El Naga, Souad Massi, Suhail Hadda..."
", South Korea",1,"[Jung Hae-in, Koo Kyo-hwan, Kim Sung-kyun, Son...","Jung Hae-in, Koo Kyo-hwan, Kim Sung-kyun, Son ..."
Argentina,51,"[Chino Darín, Nancy Dupláa, Joaquín Furriel, P...","Chino Darín, Nancy Dupláa, Joaquín Furriel, Pe..."
"Argentina, Brazil, France, Poland, Germany, Denmark",1,"[Bárbara Lennie, Daniel Aráoz, Claudio Tolcach...","Bárbara Lennie, Daniel Aráoz, Claudio Tolcachi..."
"Argentina, Chile",2,"[Benjamín Vicuña, Gastón Pauls, Alfredo Castro...","Benjamín Vicuña, Gastón Pauls, Alfredo Castro,..."
...,...,...,...
Venezuela,1,[Paco Ignacio Taibo II],Paco Ignacio Taibo II
"Venezuela, Colombia",0,[nan],
Vietnam,7,"[Tran Nghia, Truc Anh, Tran Phong, Khanh Van, ...","Tran Nghia, Truc Anh, Tran Phong, Khanh Van, N..."
West Germany,0,[nan],


In [10]:
# 7. Which country has the highest number of Netflix entries? (Sort groupby results).
df.groupby('country').size().sort_values(ascending=False)

country
United States                                                                          2818
India                                                                                   972
United Kingdom                                                                          419
Japan                                                                                   245
South Korea                                                                             199
                                                                                       ... 
Ireland, Canada, Luxembourg, United States, United Kingdom, Philippines, India            1
Ireland, Canada, United Kingdom, United States                                            1
Ireland, Canada, United States, United Kingdom                                            1
Ireland, France, Iceland, United States, Mexico, Belgium, United Kingdom, Hong Kong       1
Zimbabwe                                                                

In [11]:
# 8. Within each type (Movie/TV Show), rank the countries by number of releases.
df.groupby(['type','country'])['show_id'].count().groupby(level=0).rank(ascending=False)

# level=0 refers to the first index in the MultiIndex, which is 'type'.
# groupby(level=0) means: “group by each type and do the next operation (rank) separately within that group.”

type     country                                            
Movie    , France, Algeria                                      405.5
         Argentina                                               21.5
         Argentina, Brazil, France, Poland, Germany, Denmark    405.5
         Argentina, Chile                                       132.0
         Argentina, Chile, Peru                                 405.5
                                                                ...  
TV Show  United States, South Korea, China                       65.0
         United States, Sweden                                  135.5
         United States, United Kingdom                           27.5
         United States, United Kingdom, Australia               135.5
         Uruguay, Germany                                       135.5
Name: show_id, Length: 847, dtype: float64

In [12]:
# 9. Filter to only keep countries that have more than 100 movies.
a = df[df['type'] == 'Movie']
a = a.groupby('country')['type'].count().reset_index(name='Total')
a[a['Total'] > 100]

Unnamed: 0,country,Total
50,Canada,122
218,India,893
440,United Kingdom,206
525,United States,2058


In [13]:
# 10. For each rating, filter directors who have directed at least 5 movies.
a = df[df['type'] == 'Movie']
a = a.groupby(['rating','director']).size().reset_index(name='Total')
a[a['Total'] > 5]

Unnamed: 0,rating,director,Total
316,PG,Robert Rodriguez,7
1146,R,Martin Scorsese,8
1245,R,Quentin Tarantino,7
1575,TV-14,Cathy Garcia-Molina,7
1636,TV-14,David Dhawan,6
1927,TV-14,Kunle Afolayan,7
2040,TV-14,Milan Luthria,6
3297,TV-MA,Jay Chapman,10
3298,TV-MA,Jay Karas,10
3509,TV-MA,Lance Bangs,7


In [14]:
# 11. Create a new column showing how a movie’s release year compares to the average release year of its country.
filter_df = df[df['type'] == 'Movie']
avg_yr = filter_df.groupby('country')['release_year'].mean().reset_index(name='average_release_year')
filter_df = filter_df.merge(avg_yr,how='left',on='country')
filter_df['year_diff'] = filter_df['release_year'] - filter_df['average_release_year']
filter_df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,average_release_year,year_diff
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2012.058795,7.941205
1,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,,
2,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",1993.000000,0.000000
3,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,2012.058795,8.941205
4,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic","September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...,2021.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6126,s8802,Movie,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...,2015.000000,0.000000
6127,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2012.058795,-5.058795
6128,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,2012.058795,-3.058795
6129,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",2012.058795,-6.058795


In [15]:
# 12. Assign each country a normalized score of number of shows, scaled by the maximum number of shows across all countries.

x = df.copy()

# count shows per country
x['country_show_count'] = x.groupby('country')['show_id'].transform('count')  

# for normalized score divide it by max
x['normalized_score'] = x['country_show_count'] / x['country_show_count'].max()
x

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,country_show_count,normalized_score
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2818.0,1.000000
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",30.0,0.010646
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,972.0,0.344925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2818.0,1.000000
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",,
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,2818.0,1.000000
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",2818.0,1.000000


In [16]:
# 13. Group by rating and calculate the IQR (interquartile range) of release years.

df.groupby('rating')['release_year'].apply(lambda x: x.quantile(0.75) - x.quantile(0.25))

rating
66 min       0.00
74 min       0.00
84 min       0.00
G           22.00
NC-17        2.50
NR           4.00
PG          13.00
PG-13       11.00
R           10.00
TV-14        6.00
TV-G         5.00
TV-MA        3.00
TV-PG        6.00
TV-Y         4.00
TV-Y7        6.00
TV-Y7-FV     2.75
UR          21.00
Name: release_year, dtype: float64

In [17]:
# 14. For each country, find the director who directed the most shows using apply.
co_dir = df.groupby(['country','director'])['show_id'].count().reset_index(name='count')

result = co_dir.groupby('country').apply(
    lambda x: x.loc[x['count'].idxmax()]
)
result

  result = co_dir.groupby('country').apply(


Unnamed: 0_level_0,country,director,count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
", France, Algeria",", France, Algeria",Najwa Najjar,1
Argentina,Argentina,"Raúl Campos, Jan Suter",5
"Argentina, Brazil, France, Poland, Germany, Denmark","Argentina, Brazil, France, Poland, Germany, De...",Diego Lerman,1
"Argentina, Chile","Argentina, Chile","Cecilia Atán, Valeria Pivato",1
"Argentina, Chile, Peru","Argentina, Chile, Peru",Ticoy Rodriguez,1
...,...,...,...
Venezuela,Venezuela,Matías Gueilburt,1
"Venezuela, Colombia","Venezuela, Colombia",Jorge Granier,1
Vietnam,Vietnam,"Bao Nhan, Namcito",1
West Germany,West Germany,"Joachim Fest, Christian Herrendoerfer",1


In [18]:
# 15. Group by country and fill missing values in the director column with the most common director for that country.

# null values before filling in director column
df['director'].isnull().sum()

2634

In [19]:
df['director'] = df.groupby('country')['director'].transform(
    lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown')
)

df

# Each x in the lambda is one Series containing directors for that country.
# x.mode() computes the most frequent value in that group.
# x.mode()[0] if not x.mode().empty else 'Unknown' picks:
    # the first mode if it exists,
    # 'Unknown' if the group has all NaNs.
# transform()
    # Takes the filled group Series and maps it back to the original index in df['director'].
    # This preserves the row order, so each original row gets the correct filled value.

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,Adze Ugah,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,David Dhawan,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [20]:
# null values after filling in director column
df['director'].isnull().sum()

831