# Business Case: Netflix - Data Exploration and Visualisation

#### Problem Statment:

Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries.

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

print(os.listdir("/kaggle/input/netflix"))

In [None]:
path = "/kaggle/input/netflix/netflix.csv"

df = pd.read_csv(path)
df.head()

In [None]:
df.tail()

In [None]:
# total rows and columns
print(f"Count of rows: {df.shape[0]}  columns: {df.shape[1]}")

In [None]:
df.info()

Following columns have the missing values:
1. director
2. cast
3. country
4. duration
5. date_added
6. rating

In [None]:
# % of missing values in each column - descending order
df.isnull().sum().sort_values(ascending=False) / len(df)

In [None]:
# plotting column wise missing values
plt.figure(figsize=(10, 6))
plt.title("Column wise missing values")
sns.heatmap(df.isnull())
plt.axis()
plt.show()

In [None]:
df.describe(include="all")

In [None]:
# looking at the data where director has missing value
df[df['director'].isnull()]

In [None]:
# drop description column
df = df.drop(columns=['description'])

In [None]:
# rename listed_in to genre
df = df.rename(columns={"listed_in": "genre"})

#### Unnesting below columns
- director
- cast
- country
- genre

In [None]:
director_df = df[['show_id', 'director']]
director_df = pd.DataFrame(director_df['director'].apply(lambda x: str(x).split(',')).tolist(), index=df['show_id']).stack().reset_index()
director_df = director_df.drop(columns=['level_1'], axis=1)
director_df.columns = ['show_id', 'director']
director_df.head(3)

In [None]:
cast_df = df[['show_id', 'cast']]
cast_df = pd.DataFrame(cast_df['cast'].apply(lambda x: str(x).split(',')).tolist(), index=df['show_id']).stack().reset_index()
cast_df = cast_df.drop(columns=['level_1'], axis=1)
cast_df.columns = ['show_id', 'cast']
cast_df.head(3)

In [None]:
country_df = df[['show_id', 'country']]
country_df = pd.DataFrame(country_df['country'].apply(lambda x: str(x).split(',')).tolist(), index=df['show_id']).stack().reset_index()
country_df = country_df.drop(columns=['level_1'], axis=1)
country_df.columns = ['show_id', 'country']
country_df.head(3)

In [None]:
genre_df = df[['show_id', 'genre']]
genre_df = pd.DataFrame(genre_df['genre'].apply(lambda x: str(x).split(',')).tolist(), index=df['show_id']).stack().reset_index()
genre_df = genre_df.drop(columns=['level_1'], axis=1)
genre_df.columns = ['show_id', 'genre']
genre_df.head(3)

Drop the following columns from the df:
- director
- cast
- country
- genre

In [None]:
df = df.drop(columns=['director', 'cast', 'country', 'genre'], axis=1)
df.head(2)

Merge the **df** with other unnested dataframes - **director_df, cast_df, country_df, genre_df**

In [None]:
merged_df = df.merge(director_df, on="show_id", how="inner")
merged_df = merged_df.merge(cast_df, on="show_id", how="inner")
merged_df = merged_df.merge(country_df, on="show_id", how="inner")
merged_df = merged_df.merge(genre_df, on="show_id", how="inner")

# lowercase & trim country, cast, director, genre values
merged_df["director"] = merged_df["director"].apply(lambda x: str(x).lower().strip())
merged_df["country"] = merged_df["country"].apply(lambda x: str(x).lower().strip())
merged_df["cast"] = merged_df["cast"].apply(lambda x: str(x).lower().strip())
merged_df["genre"] = merged_df["genre"].apply(lambda x: str(x).lower().strip())

# replace nan with np.nan
merged_df = merged_df.replace("nan", np.nan)
merged_df

**Question:** In which country Netflix has released most of the movies?

In [None]:
tmp_df = pd.DataFrame(merged_df[["show_id", "country"]].groupby(["show_id", "country"]).size()).reset_index().drop(columns=[0])
tmp_df

In [None]:
# top 10 countries where most of the movies/shows are released by Netflix
count_df = tmp_df["country"].value_counts()[:10]
count_df

In [None]:
count_df.index

In [None]:
plt.figure(figsize=(10, 7))
sns.barplot(x=count_df.values, y=count_df.index, alpha=0.8)
plt.title("Top 10 countries where most of the movies/shows are released by Netflix", fontsize=12, pad=20)
plt.ylabel("Movies/Shows count", fontsize=12)
plt.xlabel("Country", fontsize=12)
plt.show()

In [None]:
# top 10 countries where less movies/shows are released by Netflix
tmp_df["country"].value_counts(ascending=True)[:10]

### Question: 
#### Which is the most popular genre all over the world?

In [None]:
tmp_df = pd.DataFrame(merged_df.groupby(["show_id", "country", "genre"]).size()).reset_index().drop(columns=[0])
tmp_df

In [None]:
print(f"Total genres: {len(tmp_df['genre'].unique())}")
tmp_df['genre'].unique()

In [None]:
# top 10 most popular genres all over the world
count_df = tmp_df["genre"].value_counts()[:10]
count_df

In [None]:
count_df.index

In [None]:
plt.figure(figsize=(10, 7))
sns.barplot(x=count_df.values, y=count_df.index, alpha=0.8)
plt.title("Top 10 most popular genres all over the world", fontsize=12, pad=15)
plt.ylabel("Genre", fontsize=12)
plt.show()

In [None]:
top_10_countries =  ['united states', 'india', 'united kingdom', 'canada', 'france', 'japan','spain', 'south korea', 'germany', 'mexico']

In [None]:
# top 10 popular genres in united states
count_df = tmp_df[tmp_df['country']=='united states']["genre"].value_counts()[:10]

plt.figure(figsize=(10, 7))
sns.barplot(x=count_df.values, y=count_df.index, alpha=0.8)
plt.title("United States - Top 10 most popular genres", fontsize=12, pad=15)
plt.ylabel("Genre", fontsize=12)
plt.show()

In [None]:
# top 10 popular genres in india
count_df = tmp_df[tmp_df['country']=='india']["genre"].value_counts()[:10]

plt.figure(figsize=(10, 7))
sns.barplot(x=count_df.values, y=count_df.index, alpha=0.8)
plt.title("India - Top 10 most popular genres", fontsize=12, pad=15)
plt.ylabel("Genre", fontsize=12)
plt.show()

#### Most popular actors

In [None]:
tmp_df = merged_df.groupby(["show_id","country", "cast"])[["title"]].size().reset_index().drop(columns=[0])
tmp_df

In [None]:
# top 10 most popular actors all over the world
count_df = tmp_df["cast"].value_counts()[:10]
count_df


In [None]:
plt.figure(figsize=(10, 7))
sns.barplot(x=count_df.values, y=count_df.index, alpha=0.8)
plt.title("Top 10 most popular actors all over the world", fontsize=12, pad=15)
plt.ylabel("Genre", fontsize=12)
plt.show()

In [None]:
# country wise popular actors
fam_act_df = tmp_df.groupby(["country", "cast"]).count().sort_values(by="show_id").groupby(level=0).tail(1).reset_index().sort_values(by="show_id", ascending=False)
fam_act_df.set_index(["country"], inplace=True)
fam_act_df

In [None]:
fam_act_df.loc[['united states', 'india', 'united kingdom', 'canada', 'france', 'japan','spain', 'south korea', 'germany', 'mexico']]

#### Genre wise most popular actors

In [None]:
tmp_df = merged_df.groupby(["show_id","genre", "cast"])[["title"]].size().reset_index().drop(columns=[0])
tmp_df

In [None]:
fam_act_df = tmp_df.groupby(["genre", "cast"]).count().sort_values(by="show_id").groupby(level=0).tail(1).reset_index().sort_values(by="show_id", ascending=False)
fam_act_df.set_index(["genre"], inplace=True)

In [None]:
fam_act_df.loc[['dramas', 'international movies', 'comedies', 'international tv shows','action & adventure', 'tv dramas', 'independent movies','children & family movies', 'romantic movies', 'thrillers']]

### Country wise which is popular Movie or TV Show

In [None]:
tmp_df = merged_df.groupby(["show_id","country", "type"])[["title"]].size().reset_index().drop(columns=[0])

new_df = tmp_df.groupby(["country", "type"]).count().sort_values(by="show_id").groupby(level=0).tail(1).reset_index().sort_values(by="show_id", ascending=False)
new_df.set_index("country")

# Insights & Recommendations

**Insight - 1:** Top 10 countries where netflix released most of the movies/TV shows
1. United States    
2. India            
3. United Kingdom   
4. Canada           
5. France           
6. Japan            
7. Spain            
8. South Korea      
9. Germany          
10. Mexico      

**Recommendation:** Netflix should focus create more Movies/TV shows in these countries

**Insight - 2:** Top 10 most popular genres all over the world
1. international movies        
2. dramas                      
3. comedies                    
4. international tv shows      
5. action & adventure          
6. documentaries              
7. independent movies          
8. thrillers                    
9. tv dramas                  
10. children & family movies          

**Recommendation:** Netflix should focus create more Movies/TV shows in these genres

**Insight - 3:** Top 10 most popular actors all over the world

1. Anupam Kher          
2. David Attenborough   
3. John Cleese           
4. Tara Strong           
5. Shah Rukh Khan        
6. Liam Neeson           
7. James Franco          
8. Vincent Tong          
9. Om Puri               
10. Alfred Molina                  

**Recommendation:** Netflix should collaborate with these actors to create more Movies/TV shows

**Insight - 4:** Below are most popular actor of the countries where most of the movies/tv shows are released


| Country      | Famous Actor |
| ----------- | ----------- |
| united states      | tara strong        |
| india   | anupam kher        |
| united kingdom   | david attenborough        |
| canada   | robb wells        |
| france   | benoît magimel        |
| japan   | takahiro sakurai        |
| spain   | mario casas       |
| south korea   | sung dong-il        |
| germany   | daniel brühl        |
| mexico   | cassandra ciangherotti       |

**Recommendation:** Netflix should focus on creating movies/tv shows with these actors in these countries

**Insight - 5:** Below are most popular actor of most the most popular genres all over the world


| Genre      | Famous Actor |
| ----------- | ----------- |
| dramas      | naseeruddin shah        |
| international movies   | anupam kher        |
| comedies   | anupam kher        |
| international tv shows   | takahiro sakurai        |
| action & adventure   | bruce willis        |
| tv dramas   | tay ping hui        |
| independent movies   | naseeruddin shah       |
| children & family movies   | julie tejwani        |
| romantic movies   | akshay kumar        |
| thrillers   | nicolas cage       |

**Recommendation:** Netflix should focus on creating movies/tv shows with these actors in these genres
