<a href="https://colab.research.google.com/github/diyaaistwal/Netflix-Analysis/blob/main/netflix_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Netflix Data Analysis project**
"""
## Overview

Welcome to the Netflix Data Analysis project! In this comprehensive data analysis project, you will explore and analyze the vast dataset provided
by Netflix, employing the powerful Pandas library in Python. Tailored for intermediate-level programmers, this project is designed for individuals eager
to unravel insights from real-world entertainment data and who possess some prior experience in Python programming.

## Project Description

### TASKS


Task-1 **Data Acquisition:** Learn how to effortlessly load and read datasets into your Python environment.

Task-2,Task-3 **Data Preprocessing:** Learn cleaning and transforming the raw data to make it suitable for analysis.

Task-4,Task-5 **Exploratory Data Analysis (EDA):** Explore the dataset using statistical summaries, visualizations, and other techniques to uncover patterns, trends, and relationships within the data

Task-6,Task-7 **Statistical Analysis:** Learn to perform various statistical tests and calculations to quantify relationships between variables, identify outliers, and assess the significance of findings.


### Tools and Libraries

- Python

## Pre-requisites

1. Python Fundamentals
2. Pandas
"""


## **TASK 1**
"""
*  You are required to load the dataframe using *'pandas'* library.
*  Data path is provided as *data_path* as a parameter.
*  Base function is provided below(load_dataset), return the dataframe with first 5 rows. """

In [20]:
import pandas as pd
df = pd.read_csv('dataset_netflix.csv', delimiter=',', skiprows=1)

print(df.head())


  show_id     type                             title         director  \
0      s1    Movie              Dick Johnson Is Dead  Kirsten Johnson   
1      s3  TV Show                         Ganglands  Julien Leclercq   
2      s6  TV Show                     Midnight Mass    Mike Flanagan   
3     s14    Movie  Confessions of an Invisible Girl    Bruno Garotti   
4      s8    Movie                           Sankofa     Haile Gerima   

         country date_added  release_year rating  duration  \
0  United States  9/25/2021          2020  PG-13        90   
1         France  9/24/2021          2021  TV-MA  1 Season   
2  United States  9/24/2021          2021  TV-MA  1 Season   
3         Brazil  9/22/2021          2021  TV-PG        91   
4  United States  9/24/2021          1993  TV-MA       125   

                                           listed_in  
0                                      Documentaries  
1  Crime TV Shows, International TV Shows, TV Act...  
2                 TV Dr

## **TASK 2**

* "Not Given" values may indicate missing or incomplete data. By removing them, you can improve the overall quality of the dataset and ensure that it accurately represents the phenomena under study.
* You are required to drop those rows of given dataframe whose *director* or *country*  column  values are mentioned as "Not Given" and return the filtered dataframe.
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(data_preprocessing) and return the first 5 rows in dataframe after processing.

In [21]:
column_names = ['show_id', 'type', 'title', 'director', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in']
df.columns = column_names
print(df.head())



  show_id     type                             title         director  \
0      s1    Movie              Dick Johnson Is Dead  Kirsten Johnson   
1      s3  TV Show                         Ganglands  Julien Leclercq   
2      s6  TV Show                     Midnight Mass    Mike Flanagan   
3     s14    Movie  Confessions of an Invisible Girl    Bruno Garotti   
4      s8    Movie                           Sankofa     Haile Gerima   

         country date_added  release_year rating  duration  \
0  United States  9/25/2021          2020  PG-13        90   
1         France  9/24/2021          2021  TV-MA  1 Season   
2  United States  9/24/2021          2021  TV-MA  1 Season   
3         Brazil  9/22/2021          2021  TV-PG        91   
4  United States  9/24/2021          1993  TV-MA       125   

                                           listed_in  
0                                      Documentaries  
1  Crime TV Shows, International TV Shows, TV Act...  
2                 TV Dr

In [22]:
def data_preprocessing(df):
  df['director'].replace('Not Given', pd.NA, inplace=True)
  df['country'].replace('Not Given', pd.NA, inplace=True)
  df.dropna(subset=['director', 'country'], inplace=True)
  return df

filtered_df=data_preprocessing(df)
print(filtered_df.head())



  show_id     type                             title         director  \
0      s1    Movie              Dick Johnson Is Dead  Kirsten Johnson   
1      s3  TV Show                         Ganglands  Julien Leclercq   
2      s6  TV Show                     Midnight Mass    Mike Flanagan   
3     s14    Movie  Confessions of an Invisible Girl    Bruno Garotti   
4      s8    Movie                           Sankofa     Haile Gerima   

         country date_added  release_year rating  duration  \
0  United States  9/25/2021          2020  PG-13        90   
1         France  9/24/2021          2021  TV-MA  1 Season   
2  United States  9/24/2021          2021  TV-MA  1 Season   
3         Brazil  9/22/2021          2021  TV-PG        91   
4  United States  9/24/2021          1993  TV-MA       125   

                                           listed_in  
0                                      Documentaries  
1  Crime TV Shows, International TV Shows, TV Act...  
2                 TV Dr

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['director'].replace('Not Given', pd.NA, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['country'].replace('Not Given', pd.NA, inplace=True)


## **TASK 3**

* Breaking a timestamp into separate components like month, year, and date can be advantageous for various reasons:

  *   Granular Analysis: For a Netflix dataset, breaking timestamps into separate components like month, year, and date allows analysts to perform granular analysis of user behavior.

  *   Facilitates Filtering and Grouping: This could involve calculating metrics such as total viewing hours per month, average number of sessions per year, or most popular genres by month.


* You are required to add columns such as 'month_added', 'year_added', and 'day_added'.
* Ensure that the 'month_added' column is named with full month names(like 'September') and the 'day_added' column is named with full day names (like 'Saturday').
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(date_preprocessing) and specifically return the str(date_added value for Ganglands in the DataFrame).

In [26]:
def date_preprocessing(df):
  df['date_added']=pd.to_datetime(df['date_added'],errors='coerce')
  df['month_added']=df['date_added'].dt.month_name()
  df['day_added']=df['date_added'].dt.day_name()
  df['year_added']=df['date_added'].dt.year
  res=df[df['title']=='Ganglands']
  return res['date_added']

print(date_preprocessing(df))


1   2021-09-24
Name: date_added, dtype: datetime64[ns]


## **TASK 4**
* We're tasked with assisting Netflix in determining the total count of TV shows available on their platform. Let's find out how many TV shows are in the dataframe.
* In TV Show, 'type' column value is 'TV Show' in the given dataframe.
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(exploratory_data_analysis1) and return number of the TV Shows.

In [27]:
def exploratory_data_analysis1(df):
  tv_shows_df=df[df['type']=='TV Show']
  tv_show_count=tv_shows_df.shape[0]
  return tv_show_count

result=exploratory_data_analysis1(df)
print(result)


219


## **TASK 5**
* Netflix is interested in finding out how many movies were released during their high-demand month of December. Let's help them by identifying the number of movies released in December from the dataset.
* You are required to count of *Movie* released in *December* month.
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(exploratory_data_analysis2)  and return name of the resultant country from the given dataframe.




In [28]:
movies_df = df[df['type'] == 'Movie']
december_movies = movies_df[movies_df['date_added'].dt.month_name() == 'December']
print(december_movies.shape[0])

309


## **TASK 6**
* Netflix wants to ensure that their average movie length is not excessively long, as it may bore their customers.
* To assist them, let's calculate the median movie length on their platform.
* You are required to find median duration of a movies .
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(statistical_analysis1)  and return median the given dataframe.



In [29]:
movies_df['duration'] = movies_df['duration'].str.replace(' min', '').astype(float)
print(movies_df['duration'].median())


99.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['duration'] = movies_df['duration'].str.replace(' min', '').astype(float)


## **TASK 7**

* You are required to find the country having maximum TV shows and Movies.
* DataFrame is provided below as *df* as a parameter.
* Base function is provided below(statistical_analysis2)  and return name of the resultant country from the given dataframe.



In [30]:
def statistical_analysis(df):
    max_country = df['country'].mode()[0]
    return max_country

resultant_country = statistical_analysis(df)
print("Country with maximum TV shows and movies:", resultant_country)


Country with maximum TV shows and movies: United States
