# **Exploratory Data Analysis of Netflix Shows**

![](https://images.ctfassets.net/y2ske730sjqp/821Wg4N9hJD8vs5FBcCGg/9eaf66123397cc61be14e40174123c40/Vector__3_.svg?w=940)
## **Introduction**

Welcome to the exploratory data analysis (EDA) of Netflix shows! In this notebook, we will dive into the dataset containing information about various shows available on Netflix. By performing EDA, we aim to gain insights, discover patterns, and uncover interesting trends within the data.

Netflix has become one of the leading streaming platforms, offering a vast library of TV shows and movies across different genres. As a Netflix user or someone interested in the entertainment industry, this EDA will provide you with a better understanding of the shows available on the platform.

## **Data Loading and Preparation**

In this section, we will import the necessary libraries for data manipulation and visualization, load the dataset into a Pandas DataFrame, and perform initial data cleaning and preprocessing steps. The dataset used for this analysis is titled "*Netflix Shows*" 

### Import the required libraries for data manipulation and visualization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



### Load the "Netflix Shows" dataset

In [2]:
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')

### Data Understanding

In [3]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [5]:
df.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

### Data Cleaning

To calculate the total number of missing values in each column of the DataFrame, we can use the following code:

In [6]:
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

To handle these missing values, the following plan will be implemented:

1. For the column '**director**', which has **2634** missing values, we will delete the entire column as it contains a significant number of missing values.

2. For the columns '**cast**', '**country**', '**date_added**', '**rating**', and '**duration**', which have **825**, **831**, **10**, **4**, and **3** missing values respectively, we will delete the rows that contain missing values in these columns. By removing these rows, we can ensure that the analysis is based on complete and reliable data.

#### Handling Missing Values
Removing Director column

In [7]:
df = df.drop(columns=['director'])
df.isna().sum()

show_id           0
type              0
title             0
cast            825
country         831
date_added       10
release_year      0
rating            4
duration          3
listed_in         0
description       0
dtype: int64

Removing rows with null values in Duration, Rating, Date Added, Country and Cast

In [8]:
df = df.dropna(axis=0, subset=['duration'])
df = df.dropna(axis=0, subset=['rating'])
df = df.dropna(axis=0, subset=['date_added'])
df = df.dropna(axis=0, subset=['country'])
df = df.dropna(axis=0, subset=['cast'])

df.isna().sum()

show_id         0
type            0
title           0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

#### Converting 'date_added' Column

To enhance the readability and consistency of the 'date_added' column in our dataset, we will be converting the existing format from "MMMM DD, YYYY" to the format of "YYYY-MM-DD". This change will allow for easier interpretation and standardize the date representation across the column.

In [9]:
df['date_added'] = pd.to_datetime(df['date_added'].str.strip(), format='%B %d, %Y').dt.strftime('%Y-%m-%d')
df.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
4,s5,TV Show,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
7,s8,Movie,Sankofa,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


## **Exploratory Data Analysis**



### Basic Data Exploration

#### Total number of rows and columns in the dataset

In [10]:
df.shape

(7290, 11)

#### Summary statistics of numerical variables (e.g., release year)

In [11]:
df.describe()

Unnamed: 0,release_year
count,7290.0
mean,2013.698903
std,8.862822
min,1942.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


#### Count of unique values in categorical variables (e.g., show types, countries)

In [12]:
show_types_count = df['type'].value_counts()
countries_count = df['country'].value_counts()

print("Count of unique values in 'type' column:")
print(show_types_count)
print("\nCount of unique values in 'country' column:")
print(countries_count)

Count of unique values in 'type' column:
Movie      5277
TV Show    2013
Name: type, dtype: int64

Count of unique values in 'country' column:
United States                                2479
India                                         940
United Kingdom                                350
Japan                                         238
South Korea                                   196
                                             ... 
Uruguay, Argentina, Germany, Spain              1
Taiwan, Malaysia                                1
France, South Korea, Japan, United States       1
Kenya, United States                            1
United Arab Emirates, Jordan                    1
Name: country, Length: 689, dtype: int64


***In the dataset, some rows in the 'country' column contain multiple countries separated by commas. To ensure consistency and facilitate analysis, we will modify these rows to include only the first country from the list.***

In [13]:
for index, row in df.iterrows():
    countries = row['country'].split(',')[0]
    df.at[index, 'country'] = countries

## **Key Insights and Findings**

## **Conclusion**

## **Credits**

We would like to acknowledge and express our gratitude to the following team members for their contributions to this EDA:

* **[Yahya Lazrek](https://github.com/uuinc)**
* **[Yousra Elbarraq](https://github.com/yousraeb)**
* **[Ouassima Aboukhair](https://github.com/OuassimaAboukhair)**

We would also like to thank **Dr. Hamza Es-samaali** for providing guidance and support throughout the project.

## **References**

The dataset used for this EDA can be found at: [**Netflix Shows Dataset**](https://www.kaggle.com/datasets/poojasomavanshi/netflix-shows)