# Table of Content
* [Reading and Inspecting Data](#reading)
* [Data Cleaning](#data_cleaning)
    * [Handling Missing values](#handling_missing_values)
    * [Standadization](#standardization)
* [Unvariate Analysis](#univariate_analysis)
* [Bivariate Analysis](#bivariate_analysis)
* [Summary](#summary)

# Reading and Inspecting Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set()

In [None]:
df = pd.read_csv('/kaggle/input/exploratory-data-analysis-on-netflix-data/netflix_titles_2021.csv')
df.head()

The dataset consist of following columns
* **show_id:** Gives the information about show id.
* **type:** Gives information about 2 different unique values one is TV Show and another is Movie.
* **title:** Gives information about the title of Movie or TV Show.
* **director:** Gives information about the director who directed the Movie or TV Show.
* **cast:** Gives information about the cast who plays role in Movie or TV Show.
* **country:** Gives information about the Name of country.
* **date_added:** Gives information about the tv shows or movie released.
* **release_year:** Gives information about the year when Movie or TV Show was released.
* **rating:** Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc).
* **duration:** Gives information about the duration of Movie or TV Show.
* **listed_in:** Gives information about the genre of Movie or TV Show.
* **description:** Gives information about the description of Movie or TV Show.

In [None]:
print(f'Dataset has {df.shape[0]} rows and {df.shape[1]} columns')

In [None]:
df.info()

In [None]:
df['date_added'].head(20)

In [None]:
df['duration']

In [None]:
df['rating']

<a id = 'data_cleaning'></a>
# Data Cleaning

<a id = 'handling_missing_values'></a>
# Handling Missing Values

In [None]:
# check % of missing values
df.isnull().mean()*100


### Handling Missing values in 'director' feature

In [None]:
df[df['director'].isnull()]

In [None]:
# replacing NaN with 'unknown' in director columns
df['director'] = df['director'].fillna('unknown')
df['director'].isnull().sum()

### Handling Missing Values in 'cast' feature

In [None]:
df[df['cast'].isnull()]

In [None]:
# replacing NaN with 'unknown' in cast columns
df['cast'] = df['cast'].fillna('unknown')
df['cast'].isnull().sum()

### Handling Missing values in 'Country' columns

In [None]:
df[df['country'].isnull()]

In [None]:
df['country'] = df['country'].fillna('unknown')
df['country'].isnull().sum()

* **"data_added, rating, duration" feature have less amount of NaN values,we can directly drop that.**

In [None]:
df.dropna(inplace = True)
df.isnull().mean()*100

<a id ='standardization'></a>
# Standarization

In [None]:
# Capitalizing 'Country' Featur
df['country'] = df['country'].apply(lambda x: x.capitalize())
df.country.head()

In [None]:
df[['date_added','release_year']]

In [None]:
# converting date_added to month and year
df['added_month'] = df['date_added'].apply(lambda x:x.split(',')[0].split()[0])
df['added_year'] = df['date_added'].apply(lambda x:x.split(',')[1])
df[['added_month','added_year']]

In [None]:
# droping column 'date_added'
df.drop('date_added',axis=1,inplace= True)

In [None]:
df['added_year'] = df['added_year'].astype(int)

In [None]:
df[df['added_year']<df['release_year']]

In [None]:
# date_added_year can't we less than release year,so droping those rows.
df.drop(df[df['added_year']<df['release_year']].index,inplace = True)

In [None]:
df.head()

<a id ='univariate_analysis'></a>
# Univariate Analysis

## Analysing 'Type' feature

In [None]:
df['type'].nunique()

In [None]:
df['type'].value_counts(normalize=True)

In [None]:
df['type'].value_counts(normalize=True).plot.bar()

* **Inference:**
* On Netflix  69% are Movie and 31% are TV Show

## Analysing 'director' feature

In [None]:
df['director'].nunique()

In [None]:
df['director'].value_counts()

**inference**
* Director "Rajiv Chilaka" has released highest number of shows.

## Analysing 'country' Feature

In [None]:
df['country'].nunique()

In [None]:
# Top 3 countries based on count on shows
df['country'].value_counts(normalize=True)[:3]

In [None]:
df['country'].value_counts(normalize=True)[:3].plot.bar()

**Inference**
* On Netflix most of the shows are from USA(31%) and India(11%)

## Analysing 'date_added_year' feature


In [None]:
sns.distplot(df['added_year'])

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x = df['added_year'])

* **Inference**:
* Highest number of shows added was in 2019.

In [None]:
df['added_month'].value_counts(normalize=True).plot.bar()

** Inference**
* 'July,December,September' are the months in which most of the shows are released.

## Analysing 'rating' feature

In [None]:
df.rating.nunique()

In [None]:
df.rating.value_counts(normalize=True)[:5].plot.bar()

**Inference**
* Most of the movies have got rating 'TV-MA,TV-14,TV-PG'.

In [None]:
# Analysing 'listed_in' feature
genre_list = df['listed_in'].str.split(',')

In [None]:
df['genre_1'] = genre_list.str.get(0)
df['genre_2'] = genre_list.str.get(1)
df['genre_3'] = genre_list.str.get(2)


In [None]:
# droping column 'listed_in'
df.drop('listed_in',axis =1,inplace =True)

In [None]:
df.head()

In [None]:
df['genre_1'].value_counts(normalize=True)[:5].plot.bar()

**Inference**
* Netflix has highest (i.e.17% ) shows of 'Drama' genre,followed by 'Comedies' genre.

In [None]:
df

## Analysing dateMonth_added Colunm

<a id = 'bivariate Analysis'></a>
# Bivariate Analysis

In [None]:
temp_df = df[df['country'].isin(df['country'].value_counts()[:5].index)]

In [None]:
sns.countplot(data=temp_df,x='country',hue='type')

**Inference:**
* In USA and India 'Movie' show are more famous than 'TV Shows',and in United Kingdom both type are equily popular,and In Japan 'TV shows' are more popular than 'Movie' shows.

In [None]:
list(df['country'].value_counts()[:5].index)

In [None]:
sns.countplot(data=temp_df,x='country',hue='rating')

**Inference**:
* for United states most of the shows have 'TV-MA' rating and in India most of the shows have 'TV-14' rating.

In [None]:
sns.countplot(data=temp_df,x='country',hue='genre_1')

<a id ='summary'></a>
# Summary
* On Netflix 69% are Movie and 31% are TV Show
* Director "Rajiv Chilaka" has released highest number of shows.
* On Netflix most of the shows are from USA(31%) and India(11%).
* Highest number of shows added was in 2019.
* 'July,December,September' are the months in which most of the shows are released.
* Most of the shows  are of  rating 'TV-MA,TV-14,TV-PG'.
* Netflix has highest (i.e.17% ) shows of 'Drama' genre,followed by 'Comedies' genre.
* In USA and India 'Movie' show are more famous than 'TV Shows',and in United Kingdom both type are equily popular,and In Japan 'TV shows' are more popular than 'Movie' shows.
* for United states most of the shows have 'TV-MA' rating and in India most of the shows have 'TV-14' rating.