# Netflix Exploratory Data Analysis

This notebook, I'll give EDA from **Netflix Data Set** from **kaggle**

--**STEPS**--
1. Basic Data Exploration
   1. Import Packages and Load Data
   2. Feature Exploration
   3. Summary Statistics
2. Data Cleaning
   1. Null Value Analysis
   2. Outlier Analysis
3. Exploratory data analysis (Answering questions we have of the data) 

## Basic Data Exploration
1. Import Packages and load Data
2. Feature Exploration
3. Summary Statistics

#### Importing Libraries and Loading the Dataset

In [43]:
# Import Relevant Packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno

In [None]:
# Set some Options
# ensure that all columns will be displayed in their entirety when printing a DataFrame.
pd.set_option("display.max_columns", None)
# set the display width to 500 characters
pd.set_option("display.width", 500)

In [None]:
# Load Data set
df = pd.read_csv('data_sets/netflix_titles.csv')

#### Feature Exploration

In [None]:
# First 3 Data
df.head(3)

In [None]:
df.columns

In [None]:
df.shape

In [None]:
# Data types in columns
df.info()

We Have:
* 11 Categorical Feature
* 1 Numeric Feature

#### Summary Statistics

In [None]:
# Summary statistics for numerical features
numerical_features = df.select_dtypes(include='number')
# We have only 'release_year' as a numeric feature
numerical_features.describe().T

In [42]:
# Summary statistics for categorical features
categorical_features = df.select_dtypes(include='object')
categorical_features.describe().T

Unnamed: 0,count,unique,top,freq
show_id,8807,8807,s1,1
type,8807,2,Movie,6131
title,8807,8807,Dick Johnson Is Dead,1
director,6173,4528,Rajiv Chilaka,19
cast,7982,7692,David Attenborough,19
country,7976,748,United States,2818
date_added,8797,1767,"January 1, 2020",109
rating,8803,17,TV-MA,3207
duration,8804,220,1 Season,1793
listed_in,8807,514,"Dramas, International Movies",362


## Data Cleaning
1. Null Value Analysis
2. Outlier Analysis

#### Null Value Analysis

In [None]:
# Is there any null value in dataset
df.isnull().values.any()

In [None]:
# Which features have how much null values?
msno.matrix(df)

In [None]:
def missing_value_table(df, get_null_columns=False):
    # find columns that only have null values
    null_columns = [col for col in df.columns
                    if df[col].isnull().sum() > 0]
    
    # Null Value counts
    null_counts = df[null_columns].isnull().sum().sort_values(ascending=True)

    # Null Value Percentage
    null_value_rates = (df[null_columns].isnull().sum() / df.shape[0] * 100).sort_values(ascending=True)
    formatted_null_value_rates = null_value_rates.apply(lambda value: f"% {str(np.round(value, 2))}")

    # Null Value Table
    null_df = pd.concat([null_counts, formatted_null_value_rates],
                        axis=1, keys=["Null Value Count", "Null Value Rates"])
    print(null_df, end="\n")

    if get_null_columns:
        return null_columns
    

missing_value_table(df)

* Six features have null values and all of them is ***categorical feature***

In [None]:
# Turn null values to 'missing'
columns_to_fill = missing_value_table(df, get_null_columns=True)
df[columns_to_fill].fillna('missing')

#### Outlier Analysis