# **Business Case: Netflix - Data Exploration & Visualisation**  :

***Business Problem*** : Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

In [None]:
#importing different libaries
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import warnings #to ignore the warnings & make our code more representable
warnings.filterwarnings("ignore")

In [None]:
#Loading of dataset
df = pd.read_csv("netflix.csv")
df.head()


*   "Title" , "director" & "cast" columns needs to be unnested to make our analyis more accurate.
*   Duration columns having data in minutes for movies and in seasons for TV shows





***Attributes information:***

*Show_id*: Unique ID for every Movie / Tv Show

*Type:* Identifier - A Movie or TV Show

*Title:* Title of the Movie / Tv Show

*Director*: Director of the Movie

*Cast*: Actors involved in the movie/show

*Country*: Country where the movie/show was produced

*Date_added:* Date it was added on Netflix

*Release_year*: Actual Release year of the movie/show

*Rating*: TV Rating of the movie/show

*Duration*: Total Duration - in minutes or number of seasons

Listed_in *italicized text*: Genre

*Description*: The summary description


In [None]:
df.shape #checking the count of no. of rows and columns of dataset

*Dataset is having 8807 rows of data with 12 attributes.*

In [None]:
df.info() #to check the data types of all columns and count of values in particular column.


*   We can see that type of rating and date_added columns is "object" which should be categorical and datetime. 
*   More no. of missing values in cast and director columns.






# ***Statistical summary***

In [None]:
df.describe()  #to check statistical summary of numerical type data



*   25% of the tolal data belongs to year 2019-2021
*   25% of the tolal data belongs to year 1925-2013

***Insight*** -->Netflix should add latest Movies and TV shows to attract more customers.

In [None]:
df.describe(include = object) #to check statistical summary of categorical type data

**Conclusion :-**

*   Show_id and Title are the unique factors.
*   "Type" and "rating" column needs to be changed to categorical data
*   "United States" is having the maximun content available.







# ***Missing value detection***

In [None]:
df.isnull().sum() #checking count of null values per column.


*  Lot of missing data in director, cast and country columns as compared to others.




In [None]:
for col in df:
  null_count = df[col].isnull().sum() / len(df) *100
  print(col , "-->" ,null_count)

As we can we 30% of Director columns value are missing , we cant drop this much data. We will fill these columns with "Unknown" 

In [None]:
df[["director","cast","country"]] = df[["director","cast","country"]].fillna("Unknown") #Fillling up the missing values

In [None]:
df.isnull().sum()

We will drop these rows in which date added values are missing when we will do the analysis related to date added

In [None]:
df["rating"].value_counts() #checking unique values in rating columns.

As we can clearly see that last three values of rating should be in duration columns. 

***Shifting of data to the right columns***

In [None]:
df.loc[(df["rating"] == "74 min") | (df["rating"] == "84 min") | (df["rating"] == "66 min")]
df["duration"][[5541,5794,5813]] = df["rating"][[5541,5794,5813]]
df["rating"][[5541,5794,5813]] = "Nan"

In [None]:
df["rating"].value_counts() #checking the count of each category. 

In [None]:
#Conversion of categorical attributes to 'category' and 'datetime'
# df["date_added"] = pd.to_datetime(df["date_added"])
df =df.astype({"type" : "category", "rating" : "category"})

# ***Univariate Analysis***

In [None]:
df_datetime = df.copy()
df_datetime['Year'] = df.date_added.dt.year  #adding new columns to the dataframe --> year , month , weekday
df_datetime['month'] = df.date_added.dt.month
df_datetime['day'] = df.date_added.dt.day_name()

In [None]:
sns.countplot(x = "type" , data = df_datetime) #countplot to count the no of movies and tv shows available.
plt.title("No of movies and TV series")
plt.show()

Immense difference between the count of no of movies and TV show.

In [None]:
plt.figure(figsize=(20,8))
duration_df = df.loc[df["duration"].str.contains("min")== True]["duration"].apply(lambda x: x.split()[0]).astype(int)  # splting the movies duration as its type is string , extracting the numeri value and converting it into int type
plt.subplot(1,2,1) #subplots to make the data look easy for comparison.
sns.boxplot(duration_df , color = "maroon")
plt.title("Distribution of duration of movies")
duration_seson_df = df.loc[df["duration"].str.contains("Season")== True]["duration"].apply(lambda x: x.split()[0]).astype(int)
plt.subplot(1,2,2)
sns.boxplot(duration_seson_df , color = "maroon")
plt.title("Distribution of no of seasons in TV show")
plt.show()

**Conclusion** - 
*  Average duration of movies are around 100 min
*  TV shows mostly are having 1 or 2 seasons.
*  There are lot of outliers present in movies as compare to TV shows





In [None]:
df_TV_season = df.loc[df["duration"].str.contains("Season")== True , "duration" ].value_counts().reset_index()[:10]  #filtering out top 10 values of TV shows using string. 
df_TV_season.rename(columns = {"index" : "No_of_seasons" , "duration" : "Count"}, inplace = True) #renaming the columns 
plt.figure(figsize=(20,8))
sns.barplot(y = "No_of_seasons" , x = "Count" , data = df_TV_season)
plt.title("Count of TV shows with their no of season")
plt.show()

Mostly TV shows have only one season. 

# ***Bivariate Analysis***

In [None]:
df_datetime = pd.DataFrame(df)
df_datetime['Year'] = df.date_added.dt.year
df_datetime['month'] = df.date_added.dt.month 
df_datetime['day'] = df.date_added.dt.day_name()
df_datetime_month = df_datetime.sort_values(by ="month")
df_datetime_month['month_name'] = df.date_added.dt.month_name()

**Analysis of number of content added on Netflix over the period**

In [None]:
plt.figure(figsize=(20,8)) #defining fig size fot the graph image
sns.countplot(x = "month_name" , data = df_datetime_month , hue = "type")
plt.title("No of movies and TV series added monthwise") #title name of the plot
plt.legend(loc=(1.01,0.5))
plt.show()

***Conclusion :-***
*  July and December are the months when most content was added becasue no of TV shows durind these two months are maximum among all.
*  No of movies added per month is greater then no of TV shows added per month.



In [None]:
plt.figure(figsize=(20,8))
df_year = df.loc[df['release_year']>2000] #used masked to get out data for movies and TV shows released after 2000
sns.countplot(x='release_year', data = df_year, hue='type')
plt.title("No of movies and TV series added yearwise")
plt.show()

***Conclusion :-***


*   In 2020 , maximum no. of TV shows are added followed by 2019 & 2021.
*   More no of movies added on Netflix after "2015"
*   We can see in 2021 count of movies add drop significanty  ,maybe due to COVID pandemic.





In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x = "day" , data = df_datetime , hue = "type" ,  order=["Monday" , "Tuesday" , "Wednesday", "Thursday", "Friday", "Saturday" ,"Sunday"])
plt.title("No of movies and TV series added daywise")
plt.show()

***Conclusion :-***  Most of the content added on netflix on "Friday" followed by Thursday as weekend appraches after these days.





In [None]:
print('PG-13 -----> Parental Guidance with Adult Themes[Parental Guidance]',
'TV-MA -----> Mature Audience[Only for Adults]',
'PG -----> Parental Guidance without Adult Themes[Parental Guidance]',
'TV-14 -----> Contents with Parents strongly cautioned.',
'TV-PG -----> Parental guide suggested[Parental Guidance]',
'TV-Y -----> Children suited content[General Audience & Kids]',
'TV-Y7 -----> Children of age 7 and older[General Audience & Kids]',
'R -----> Strictly for Adults[Only for Adults]',
'TV-G -----> Suitable for all audiences[General Audience & Kids]',
'G -----> General Audience films[General Audience & Kids]',
'NC-17 -----> No one seventeen and under admitted[Only for Adults]',
'NR -----> Not rated movies[Not Rated]',
'TV-Y7-FV -----> Children of age 7 and older with fantasy violence[General Audience & Kids]',
'UR -----> recut version of rated movie[Not Rated]', sep = '\n')

df_rating = df[df["rating"].isnull()== False]
df_rating.reset_index(inplace = True)
plt.figure(figsize=(20,8))
sns.countplot(x ="rating" , data = df_rating , hue = "type")
plt.show()

# Conclusion :-
*  Mostly TV shows and movies are belongs to TV-MA & TV-14 rating.  
*  Mostly content available on netflix is for adults and teenagers.



In [None]:
df_yearwise_trend = pd.DataFrame(df.groupby("release_year")["type"].value_counts())  #grouping of the content by year forr movies and TV shows
df_yearwise_trend.reset_index(inplace = True)
df_content_count =df_yearwise_trend.pivot(index = "release_year",
                        columns = "level_1",
                        values ="type")
df_content_count.reset_index(inplace = True)
plt.figure(figsize=(14,6))
sns.lineplot(x = "release_year" , y = "type" , data = df_yearwise_trend , hue = "level_1")
plt.xticks(np.arange(1940,2025,10))
plt.title("Distribun of Conted added over year")
plt.show()

# **Conclusion :-**



*   In 2020 , maximum no. of TV shows are added followed by 2019 & 2021.
*   More no of movies added on Netflix after "2015"
*   We can see in 2021 count of movies add drop significanty ,maybe due to COVID pandemic.










In [None]:
plt.figure(figsize=(14,6))
movies_ratingwise = df.loc[df["type"] == "Movie" , ["type" , "rating"]]
sns.countplot( y="rating" , data =movies_ratingwise,  palette="Blues_d" )
plt.title("Movies distribution rating wise")
plt.show()

**Conclusion** : Mostly movies are belongs to TV-MA & TV-14 rating.

In [None]:
plt.figure(figsize=(14,6))
movies_ratingwise = df.loc[df["type"] == "TV Show" , ["type" , "rating"]]
sns.countplot( y="rating" , data =movies_ratingwise,  palette="Blues_d" )
plt.title("TV Shows distribution rating wise")
plt.show()

**Conclusion** :- Mostly TV Shows are belongs to TV-MA & TV-14 rating.

In [None]:
director = df["director"].apply(lambda x : str(x).split(", ")).tolist()  #exploding the nested data in directors column.
df_director = pd.DataFrame(director, index = df["title"])
df_director= df_director.stack()
df_director = df_director.reset_index()
df_director.drop(columns ="level_1" , inplace = True) #droping the columns 
df_director.columns = ["title" , "director"] #renaming the columns
df_fav_director = df.merge(df_director , on = "title" ) #merging of the dataframes
df_fav_director.head(4)

In [None]:
#exploding country column
country = df["country"].apply(lambda x: str(x).split(", ")).tolist() #exploding the country column
df_country = pd.DataFrame(country, index = df["title"])
df_country = df_country.stack()
df_country = df_country.reset_index()
df_country.drop(columns = "level_1" , inplace = True)
df_country.columns = ["title" , "country"]

In [None]:
Country_wise_trend = df.merge(df_country , on = "title") #making new dataframe by merfing df_country and original dataframe.
Country_wise_trend.drop(columns = "country_x" , inplace = True)
Country_wise_trend.rename(columns = {"country_y" : "country"}, inplace = True)
Country_wise_trend = Country_wise_trend.loc[Country_wise_trend["country"] != "Unknown"]
top10_country = Country_wise_trend["country"].value_counts().head(10).reset_index()
top10_country.rename(columns = {"index" :"country" , "country" : "count"}, inplace = True)
Country_wise_trend = Country_wise_trend.merge(top10_country, how = "inner" , on = "country")
plt.figure(figsize = (15,8))
sns.countplot(x ="country" , data =Country_wise_trend , hue = "type" )
plt.title("Count of movies and TV shows countrywise")
plt.show()

# **Conclusion :-** 


*   Netflix should target to add more movies in Unites states and India as compare to TV Series. 
*   Netflix should target to add more TV shows in Japan and South Korea. 





In [None]:
#exploding listed_in column
listed_in = df["listed_in"].apply(lambda x: str(x).split(", ")).tolist()
df_genre = pd.DataFrame(listed_in, index = df["title"])
df_genre = df_genre.stack()
df_genre = df_genre.reset_index()
df_genre.drop(columns = "level_1" , inplace = True)
df_genre.columns = ["title" , "genre"]
df_genre.head()

In [None]:
plt.figure(figsize = (18,10))
sns.countplot(y = "genre" , data =df_genre )
plt.title("Ditribution of conent Rating_wise")
plt.show()

Most appearing category in netflix movies and TV shows are:-
*   International Movies
*   Dramas
*   Comedies
*   International TV show





# ***Non-Graphical Analysis***

In [None]:
director_countrywise= df_fav_director.merge(df_country , on = "title")
director_countrywise= director_countrywise.drop(columns = ["director_x" , "country_x" ])
director_countrywise.rename(columns = {"director_y": "director" , "country_y" : "country"}, inplace = True)
director_countrywise = director_countrywise.loc[director_countrywise["director"] != "Unknown"]
director_countrywise.reset_index(inplace= True)
director_countrywise.head()

In [None]:
country = director_countrywise['country'].value_counts()[:6].index.tolist()
print(' Top 2 Directors of Top 5 Countries')
print('\n')
for val in country:
  if val != 'Unknown':
    print(f'**{val}**')
    print(director_countrywise.loc[director_countrywise['country']==val, 'director'].value_counts()[:2])
    print('\n')

Conclusion :
*   Anurag Kashyap and David Dhawan are the most famous directors for Inida. 
*  Jay Karas and Marcus Raboyare the most famous directors in United States.



In [None]:
director_countrywise["director"].value_counts().head(3)

Conclusion : "Rajiv Chilaka" is the most famous director among all followed by Jan Suter

In [None]:
#exploding cast column
cast = df["cast"].apply(lambda x : str(x).split(", ")).tolist()
df_cast = pd.DataFrame(cast,  index = df["title"])
df_cast = df_cast.stack()
df_cast = df_cast.reset_index()
df_cast.drop(columns = "level_1" , inplace = True)
df_cast.columns = ["title" , "cast"]
df_fav_cast = df.merge(df_cast , on = "title" )

In [None]:
cast_countrywise= df_fav_cast.merge(df_country , on = "title")
cast_countrywise= cast_countrywise.drop(columns = ["cast_x" , "country_x"])
cast_countrywise = cast_countrywise.rename(columns = {"cast_y" : "cast" , "country_y" : "country"})
cast_countrywise = cast_countrywise.loc[cast_countrywise["cast"] != "Unknown"].reset_index() #making new dataframe by dropping all rows whose cast is unknown and then resetting the index..00
cast_countrywise.head()

In [None]:
country_actor = cast_countrywise['country'].value_counts()[:6].index.tolist()
print(' Top 2 Actors of Top 5 Countries')
print('\n')
for val in country:
  if val != 'Unknown':
    print(f'--{val}--')
    print(cast_countrywise.loc[cast_countrywise['country']==val, 'cast'].value_counts()[:2])
    print('\n')

**Conclusion :-**

*   These are the top two cast of these countires.
*   Netflix has added more content for India in which cast are- Anupam Kher or Shah Rukh Khan.



In [None]:
cast_countrywise["cast"].value_counts().head(5) #value_counts of the cast columns to get the most famous actors

These are the top five actors and most famous actor belongs to India. 

# ***Heatmap***

In [None]:
df_trend_country = df.merge(df_country , on = "title")
df_trend_country.drop(columns = "country_x" , inplace = True)
df_trend_country.rename(columns = {"country_y":"country"}, inplace = True)

In [None]:
temp = df_trend_country['country'].value_counts()[:11].reset_index()
temp.rename(columns = {'index':'country', 'country':'count'}, inplace=True)
country_list = temp['country'].tolist()
df_top10country = df_trend_country.loc[df_trend_country['country'].isin(country_list)]
df_top10country = df_top10country.loc[df_top10country["country"]!="Unknown"] #dropping of rows whose value is unknown.

In [None]:
heat_rating = df_top10country.groupby("country")["rating"].value_counts().reset_index()
heat_rating = heat_rating.pivot("country" , "level_1" , "rating")
plt.figure(figsize = (12,8))
sns.heatmap(heat_rating, annot = True,  cmap="Blues", fmt = "d")
plt.title("Ditribution of content availble in differnt countires rating wise")
plt.show()


**Conclusion :-**

* Top 10 countries are having most content that belongs to TV-MA (Adults 
Category)
* India and United States are having large content in TV-14 category. 
* United Kingdom and United States are having large content in R category. 

In [None]:
genre_country_df= df_trend_country.merge(df_genre , on= "title")
genre_country_df.head(5)

In [None]:
temp_genre = genre_country_df['genre'].value_counts()[:10].reset_index()
temp_genre.rename(columns = {'index':'genre', 'genre':'count'}, inplace=True)
genre_list = temp_genre['genre'].tolist()
df_top10_genre = genre_country_df.loc[genre_country_df['genre'].isin(genre_list)]
df_top10_genre.head()

In [None]:
df_top10_genre = df_top10_genre.loc[df_top10_genre["country"] != "Unknown"]
df_top10_genre["country"].value_counts()[:10]

temp_c = df_top10_genre["country"].value_counts()[:10].reset_index()
temp_c.rename(columns = {'index':'country', 'country':'count'}, inplace=True)
country_list = temp_c["country"].tolist()
df_top10_genre_countrywise = df_top10_genre.loc[df_top10_genre['country'].isin(country_list)]
df_top10_genre_countrywise.head()

heat_genre= pd.DataFrame(df_top10_genre_countrywise.groupby("genre")["country"].value_counts())
heat_genre.rename(columns = {"country" : "count"}, inplace = True)
heat_genre.reset_index(inplace = True)
heat_genre_final = heat_genre.pivot("genre" , "country" , "count")
plt.figure(figsize = (12,8))
sns.heatmap(heat_genre_final , annot = True,  cmap="Blues", fmt = "d")
plt.title("Top 10 genre of 10 differnt countries")
plt.show()

**Conclusion** :-
*  For India, netflix should add more content of genre International movies , Comedies and Dramas.
* For United States , Netflix should add more content of genre Dramas and Comedy.
* For Canada, Netflix should add more content of genre Dramas & Children and family movies.


# **Summary :-**


*   Netflix added more movies as compare to TV shows
*   Content for United States on netflix is maximum as compare to other countries. 
*   Netflix content is mostly availabe for adults only 
*   Most popular genres in recent years are International movies, Dramas, Comedies, International TV Shows and Action & Adventure.
* In 2021 , there is significant amount of drop in content added due to COVID pandemic.
*Most of viewers of Netflix is from United States followed by India & United Kingdom

**Movies:-**
* In United States , India and United kingdom movies are more popular as comapre to other countires 
* Almost same no. of movies are added on netflix every month.
* Mostly movies are of "100 min" duration.
* Top people casted in Movies are from India.
* "Rajiv Chilakaa" is the most famous director among all. 

**TV Shows :-**
* TV Shows mostly are having season 1 and season 2 respectively. 
* For Japan and South Korea, netflix should focus more on TV showes as compare to movies 


**Recommendations** : 

*Movies* :- 
* Preferd movies duration is between 90-100 minutes. 
* Netflix should add more movies for United States and India falling in category of Internation movies and comedies
* Netflix should add more movies for United States and India having rating of TV-MA & TV-14. 
* Top three countries where movies added are United States, India & United Kingdom.
*  Netflix shoud add TV Show on Friday than any other weekday.

*TV Show*:- 
* Preferd movies duration is 1-2 seeasons.
* Netflix should focus on countries like Japan, South Korea and France in TV shows , as they prefer TV shows over movies. 
* Netflix shoud add TV Show on Friday than other weekday.
* As per 2021 data, count of TV showes are more than movies , this means people wants more web-series as they have for leisure time may be due to work from home scenario. 
