# Dataset: NetFlix Shows
> ~9000 NetFlix English Shows with Full Details

<img src="https://images.unsplash.com/photo-1574375927938-d5a98e8ffe85?fm=jpg&q=60&w=3000&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8NHx8bmV0ZmxpeHxlbnwwfHwwfHx8MA%3D%3D" width="1000" height="500">


---

### 👨‍💻 **Author: Abdul Samad**

- 🔗 **LinkedIn:**  
  [https://www.linkedin.com/in/abdulsamad577/](https://www.linkedin.com/in/abdulsamad577/)
  
- 🧠 **Kaggle Profile:**  
  [https://www.kaggle.com/samad0015](https://www.kaggle.com/samad0015)

---




## About Dataset
The raw data is Web Scrapped through Selenium. It contains Unlabelled text data of around 9000 Netflix Shows and Movies along with Full details like Cast, Release Year, Rating, Description, etc.

# Exploratory Data Analysis

> Following Steps to Explore the Data:

## 1. Import Libraries

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## 2. Load data

In [2]:
file_path = 'C:/Users/Sami/Desktop/Skills/1_month_practice/week1/D5/netflix/netflix_titles.csv'
df=pd.read_csv(file_path)

## 3. Basic exploration

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [5]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


 #### Check the Percentage of Missing values in Dataset

In [6]:
df.isnull().sum()/len(df)*100

show_id          0.000000
type             0.000000
title            0.000000
director        29.908028
cast             9.367549
country          9.435676
date_added       0.113546
release_year     0.000000
rating           0.045418
duration         0.034064
listed_in        0.000000
description      0.000000
dtype: float64

## 4. Impute the missing values of each column

### Column 'type'

In [7]:
df['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

In [8]:
df['type'].value_counts()

type
Movie      6131
TV Show    2676
Name: count, dtype: int64

In [9]:
df['type'].isnull().sum()

np.int64(0)

> So this column is clear we don't need any imputation on them.

### Column 'Director'

In [10]:
df['director'].isnull().sum()

np.int64(2634)

In [11]:
df['director'].value_counts()

director
Rajiv Chilaka              19
Raúl Campos, Jan Suter     18
Suhas Kadav                16
Marcus Raboy               16
Jay Karas                  14
                           ..
James Brown                 1
Ivona Juka                  1
Mu Chu                      1
Chandra Prakash Dwivedi     1
Majid Al Ansari             1
Name: count, Length: 4528, dtype: int64

In [12]:
df['director'].nunique()

4528

> There are alot of Directors of movies, so it is not possible to impute the director with respect to most frequent director name. So we will skip it for now.

### Column 'cast'

In [13]:
df['cast'].isnull().sum()

np.int64(825)

In [14]:
df['cast'].nunique()

7692

> Cast is also different for each shows. So it is not possible to impute it according to previos data.

#### So according to that we decided to remove the columns where missing the cast or director

In [15]:
df=df.dropna(subset=['cast','director'])

In [16]:
df.isnull().sum()/len(df)*100

show_id         0.000000
type            0.000000
title           0.000000
director        0.000000
cast            0.000000
country         6.385965
date_added      0.000000
release_year    0.000000
rating          0.017544
duration        0.052632
listed_in       0.000000
description     0.000000
dtype: float64

### Column 'country'

In [17]:
df['country'].value_counts().nlargest(5)

country
United States     1849
India              875
United Kingdom     183
Canada             107
Spain               91
Name: count, dtype: int64

In [18]:
df['country'].nunique()

604

> we have 6% Null values, we know that there are 1800+ shows are recorder at United States so we have this most frequently occuring.
> so impute it with most frequent occuring word. 

In [19]:
df['country'].fillna(df['country'].mode()[0],inplace = True)

### Column 'date_added'

In [20]:
df['date_added'].isnull().sum()

np.int64(0)

In [21]:
df['date_added'].value_counts()

date_added
January 1, 2020       97
November 1, 2019      70
March 1, 2018         67
December 31, 2019     67
October 1, 2018       62
                      ..
September 26, 2018     1
June 17, 2017          1
October 21, 2018       1
November 18, 2017      1
September 6, 2021      1
Name: count, Length: 1481, dtype: int64

In [22]:
df['date_added'] = df['date_added'].str.strip()  # Remove leading/trailing spaces
df['date_added'] = pd.to_datetime(df['date_added'], format='mixed')


In [23]:
df['date_added'].value_counts()

date_added
2020-01-01    97
2019-11-01    71
2018-03-01    67
2019-12-31    67
2018-10-01    62
              ..
2017-12-22     1
2017-12-30     1
2018-01-22     1
2017-03-23     1
2018-09-27     1
Name: count, Length: 1478, dtype: int64

### Column 'release_year'

In [24]:
df['release_year'].isnull().sum()

np.int64(0)

In [25]:
df.release_year.dtypes

dtype('int64')

### Column 'rating'

In [26]:
df['rating'].nunique()
df['rating'].value_counts()

rating
TV-MA       1939
TV-14       1329
R            789
PG-13        477
TV-PG        456
PG           279
TV-Y7        123
TV-Y         102
TV-G          96
NR            58
G             40
TV-Y7-FV       3
UR             3
NC-17          2
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64

In [27]:
df['rating'].fillna(df['rating'].mode()[0],inplace=True)

In [28]:
df['rating'].isnull().sum()

np.int64(0)

### Column 'duration'

In [29]:
df['duration'].isnull().sum()

np.int64(3)

In [30]:
df['duration'].value_counts()

duration
94 min      140
1 Season    137
97 min      136
93 min      135
95 min      129
           ... 
228 min       1
18 min        1
205 min       1
201 min       1
191 min       1
Name: count, Length: 205, dtype: int64

In [31]:
df[df['duration'].isnull()][['title', 'type', 'duration']]


Unnamed: 0,title,type,duration
5541,Louis C.K. 2017,Movie,
5794,Louis C.K.: Hilarious,Movie,
5813,Louis C.K.: Live at the Comedy Store,Movie,


In [32]:
df['duration'].fillna(df['duration'].mode()[0],inplace=True)

### Column 'listed_in'

In [33]:
df['listed_in'].isnull().sum()
df['listed_in'].value_counts()

listed_in
Dramas, International Movies                             361
Stand-Up Comedy                                          309
Comedies, Dramas, International Movies                   271
Dramas, Independent Movies, International Movies         252
Children & Family Movies, Comedies                       193
                                                        ... 
Action & Adventure, Romantic Movies, Sci-Fi & Fantasy      1
British TV Shows, Classic & Cult TV, TV Comedies           1
Action & Adventure, Comedies, Horror Movies                1
Action & Adventure, Documentaries, Sports Movies           1
Cult Movies, Dramas, Thrillers                             1
Name: count, Length: 346, dtype: int64

### Column 'description'

In [34]:
df['description'].isnull().sum()

np.int64(0)

In [35]:
df['description'].value_counts()

description
Paranormal activity at a lush, abandoned property alarms a group eager to redevelop the site, but the eerie events may not be as unearthly as they think.    4
A surly septuagenarian gets another chance at her 20s after having her photo snapped at a studio that magically takes 50 years off her life.                 3
With their biggest foe seemingly defeated, InuYasha and his friends return to everyday life. But the peace is soon shattered by an emerging new enemy.       2
On India's Independence Day, a zany mishap in a Mumbai chawl disrupts a young love story while compelling the residents to unite in aid of a little boy.     2
A scheming matriarch plots to cut off her disabled stepson and his wife from the family fortune, creating a division within the clan.                        2
                                                                                                                                                            ..
Recovering alcoholic Talal wakes u

## 5. Final dataset check

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5700 entries, 2 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       5700 non-null   object        
 1   type          5700 non-null   object        
 2   title         5700 non-null   object        
 3   director      5700 non-null   object        
 4   cast          5700 non-null   object        
 5   country       5700 non-null   object        
 6   date_added    5700 non-null   datetime64[ns]
 7   release_year  5700 non-null   int64         
 8   rating        5700 non-null   object        
 9   duration      5700 non-null   object        
 10  listed_in     5700 non-null   object        
 11  description   5700 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 578.9+ KB


In [37]:
df.duplicated().sum()

np.int64(0)

In [38]:
df.shape

(5700, 12)

In [39]:
df.describe()

Unnamed: 0,date_added,release_year
count,5700,5700.0
mean,2019-05-28 02:41:56.210526208,2012.979474
min,2008-01-01 00:00:00,1942.0
25%,2018-05-04 00:00:00,2012.0
50%,2019-07-21 12:00:00,2016.0
75%,2020-08-13 00:00:00,2018.0
max,2021-09-24 00:00:00,2021.0
std,,9.562133


In [40]:
df.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [41]:
df.sample(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6569,s6570,Movie,Darra,Parveen Kumar,"Gurpreet Ghuggi, Happy Raikoti, Kartar Cheema,...",India,2017-10-15,2016,TV-14,121 min,"Dramas, International Movies",After returning from school and getting marrie...
3922,s3923,Movie,Dabbe 5: Zehr-i Cin,Hasan Karacadağ,"Nil Günal, Ümit Bülent Dinçer, Sultan Köroğlu ...",Turkey,2019-04-12,2014,TV-MA,133 min,"Horror Movies, International Movies",When Dilek becomes haunted by evil spirits and...
4049,s4050,Movie,Sarkar,A.R. Murugadoss,"Vijay, Varalakshmi Sarathkumar, Keerthi Suresh...",United States,2019-03-02,2018,TV-MA,162 min,"Action & Adventure, Dramas, International Movies",A ruthless businessman’s mission to expose ele...
135,s136,Movie,Cliffhanger,Renny Harlin,"Sylvester Stallone, John Lithgow, Michael Rook...","United States, Italy, France, Japan",2021-09-01,1993,R,113 min,Action & Adventure,Ranger Gabe Walker and his partner are called ...
8301,s8302,Movie,The First Wives Club,Hugh WIlson,"Bette Midler, Goldie Hawn, Diane Keaton, Maggi...",United States,2019-11-20,1996,PG,103 min,Comedies,Following a friend's suicide after her husband...
6801,s6802,Movie,French Dirty,"Wade Allain-Marcus, Jesse Allain-Marcus","Wade Allain-Marcus, Arjun Gupta, Melina Lizett...",United States,2016-02-04,2015,TV-MA,72 min,"Dramas, Independent Movies","An aimless, unemployed millennial hangs out wi..."
4854,s4855,Movie,Tig Notaro Happy To Be Here,Tig Notaro,Tig Notaro,United States,2018-05-22,2018,TV-14,58 min,Stand-Up Comedy,Comedian Tig Notaro unleashes her inner pranks...
7086,s7087,Movie,Internet Famous,Michael Gallagher,"Shane Dawson, Steve Greene, Amanda Cerny, Chri...",United States,2016-07-21,2016,TV-14,87 min,Comedies,Five viral Internet celebrities travel to a co...
2557,s2558,TV Show,Scissor Seven,He Xiaofeng,"He Xiaofeng, Jiang Guangtao, Duan Yixuan, Zhu ...",China,2020-05-07,2020,TV-MA,2 Seasons,"International TV Shows, TV Action & Adventure,...","Seeking to recover his memory, a scissor-wield..."
8259,s8260,Movie,The Crow,Alex Proyas,"Brandon Lee, Rochelle Davis, Ernie Hudson, Mic...",United States,2019-01-01,1994,R,102 min,"Action & Adventure, Cult Movies, Sci-Fi & Fantasy",One year after Eric Draven and his fiancée are...


> ### ✅ Now data is clean from missing and duplicate values.

## 6. Save the Cleaned Dataset

In [42]:
df.to_csv('Netflix.csv')