<a href="https://colab.research.google.com/github/doruktopcu/GlobalAI-Hub-Python-Bootcamp-2022/blob/main/GlobalAiHubProject2_MoviesAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0) Libraries & Data 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#First we uploaded our dataset to colab.
#We will save it as a DataFrame with pandas.
movies_df = pd.read_csv('NetflixOriginals.csv', sep=',', encoding='latin-1')

In [None]:
#movies_df is ready to analyse.
movies_df

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,1922,Horror/Crime drama,20/10/2017,102,6.3,English
1,44764,Drama,10/10/2018,144,6.8,English
2,44788,Comedy-drama,29/03/2019,124,5.8,Marathi
3,#REALITYHIGH,Comedy,09/08/2017,99,5.2,English
4,13th,Documentary,10/07/2016,100,8.2,English
...,...,...,...,...,...,...
579,XOXO,Drama,26/08/2016,92,5.3,English
580,Yeh Ballet,Drama,21/02/2020,117,7.6,Hindi
581,Yes Day,Comedy,03/12/2021,86,5.7,English
582,You've Got This,Romantic comedy,10/02/2020,111,5.8,Spanish


In [None]:
movies_df.isnull().sum()

Title         0
Genre         0
Premiere      0
Runtime       0
IMDB Score    0
Language      0
dtype: int64

No null values in the dataset.

In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


In [None]:
#Converting Premiere column to date type.
movies_df['Premiere'] = pd.to_datetime(movies_df.Premiere)

In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Title       584 non-null    object        
 1   Genre       584 non-null    object        
 2   Premiere    584 non-null    datetime64[ns]
 3   Runtime     584 non-null    int64         
 4   IMDB Score  584 non-null    float64       
 5   Language    584 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 27.5+ KB


Premiere column is now a datetime64[ns] type

In [None]:
movies_df['Year'] = movies_df['Premiere'].dt.year

In [None]:
movies_df.head()

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language,Year
0,1922,Horror/Crime drama,2017-10-20,102,6.3,English,2017
1,44764,Drama,2018-10-10,144,6.8,English,2018
2,44788,Comedy-drama,2019-03-29,124,5.8,Marathi,2019
3,#REALITYHIGH,Comedy,2017-09-08,99,5.2,English,2017
4,13th,Documentary,2016-10-07,100,8.2,English,2016


We now have a year section which will make our job easier in answering some questions about the data set.

## 1) Genre Analysis

In [None]:
#We will find how many genres in the dataset.
genre_list_df = pd.DataFrame(movies_df.groupby('Genre').Genre.count())

In [None]:
genre_list_df

Unnamed: 0_level_0,Genre
Genre,Unnamed: 1_level_1
Action,7
Action comedy,5
Action thriller,1
Action-adventure,1
Action-thriller,3
...,...
War,2
War drama,2
War-Comedy,1
Western,3


In genre_list_df we can see how many genres there are (115) and how many movies that belong to that genre.

In [None]:
genre_list_df = genre_list_df.rename(columns = {'Genre' : 'Count' })

In [None]:
genre_list_df = genre_list_df.sort_values(by = 'Count', ascending = False)

In [None]:
genre_list_df.head(10)

Unnamed: 0_level_0,Count
Genre,Unnamed: 1_level_1
Documentary,159
Drama,77
Comedy,49
Romantic comedy,39
Thriller,33
Comedy-drama,14
Crime drama,11
Biopic,9
Horror,9
Action,7


In this dataset most movie produced top 10 genre can be seen. Documentary genre dominated the list.

In [None]:
genre_list_df.tail(10)

Unnamed: 0_level_0,Count
Genre,Unnamed: 1_level_1
Comedy horror,1
Drama-Comedy,1
Drama / Short,1
Dance comedy,1
Crime thriller,1
Coming-of-age comedy-drama,1
Comedy/Horror,1
Comedy/Fantasy/Family,1
Comedy mystery,1
Zombie/Heist,1


The last 10 genres can be seen in this table. But a question arises; can these genres be merged into a more generalised genre? Some of these genres seem utterly specific.

## 2) Language Analysis

In [None]:
movies_language_df = pd.DataFrame(movies_df.groupby('Language').Language.count())

In [None]:
movies_language_df = movies_language_df.rename(columns = {'Language' : 'Count' })

In [None]:
#We will sort the the list in descending order.
movies_language_df = movies_language_df.sort_values(by = 'Count', ascending = False) 

In [None]:
#Top 10 most used languages in movies in this dataset.
movies_language_df.head(10)

Unnamed: 0_level_0,Count
Language,Unnamed: 1_level_1
English,401
Hindi,33
Spanish,31
French,20
Italian,14
Portuguese,12
Indonesian,9
Korean,6
Japanese,6
German,5


It is clear to say Hollywood is dominating the industry.

In [None]:
#We can create a piechart to see which language has more percentage.

## 3) Runtime Analysis

In [None]:
#We select the movies that has more runtime than 120 minutes and save it as a dataframe.
long_runtime_movies = movies_df[(movies_df.Runtime > 120)]

In [None]:
long_runtime_movies.describe()


Unnamed: 0,Runtime,IMDB Score
count,68.0,68.0
mean,133.382353,6.592647
std,13.393058,0.997507
min,121.0,3.5
25%,124.0,6.075
50%,131.0,6.7
75%,139.0,7.2
max,209.0,8.5


## 4) IMDB Score Anaylsis

In [None]:
movies_df['IMDB Score'].describe()

count    584.000000
mean       6.271747
std        0.979256
min        2.500000
25%        5.700000
50%        6.350000
75%        7.000000
max        9.000000
Name: IMDB Score, dtype: float64

In [None]:
#Top 10 by IMDB scores.
top_ten_by_imdb = movies_df.sort_values(by = 'IMDB Score', ascending = False).head(10)

In [None]:
top_ten_by_imdb

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language,Year
121,David Attenborough: A Life on Our Planet,Documentary,2020-10-04,83,9.0,English,2020
145,Emicida: AmarElo - It's All For Yesterday,Documentary,2020-12-08,89,8.6,Portuguese,2020
412,Springsteen on Broadway,One-man show,2018-12-16,153,8.5,English,2018
67,Ben Platt: Live from Radio City Music Hall,Concert Film,2020-05-20,85,8.4,English,2020
577,Winter on Fire: Ukraine's Fight for Freedom,Documentary,2015-10-09,91,8.4,English/Ukranian/Russian,2015
427,Taylor Swift: Reputation Stadium Tour,Concert Film,2018-12-31,125,8.4,English,2018
114,Cuba and the Cameraman,Documentary,2017-11-24,114,8.3,English,2017
118,Dancing with the Birds,Documentary,2019-10-23,51,8.3,English,2019
523,The Three Deaths of Marisela Escobedo,Documentary,2020-10-14,109,8.2,Spanish,2020
383,Seaspiracy,Documentary,2021-03-24,89,8.2,English,2021


In [None]:
#IMDB scores of Documentaries that has premiered between 2019 January to 2020 June.


In [None]:
#Top 10 movies by IMDB in different genres.
top_movies_by_genre = pd.DataFrame(movies_df.groupby(['Genre'])['IMDB Score'].max().nlargest(10))

In [None]:
top_movies_by_genre

Unnamed: 0_level_0,IMDB Score
Genre,Unnamed: 1_level_1
Documentary,9.0
One-man show,8.5
Concert Film,8.4
Animation/Christmas/Comedy/Adventure,8.2
Drama,7.9
Animation / Short,7.8
Crime drama,7.8
Making-of,7.7
Musical / Short,7.7
War drama,7.7


In [None]:
#Movies that has the highest IMDB scores in English. 
top_movies_by_genre = pd.DataFrame(movies_df.filter(like = 'English'))


## 5) Correlation Between Rt and IMDB

In [None]:
#Correlation between runtime and IMDB score.
movies_df.corr()

Unnamed: 0,Runtime,IMDB Score
Runtime,1.0,-0.040896
IMDB Score,-0.040896,1.0


## 6) Year analysis

In [None]:
year_movies_df = pd.DataFrame(movies_df.Year.value_counts())

In [None]:
year_movies_df.rename(columns = {'Year' : 'Count'}, inplace = True)

In [None]:
year_movies_df

Unnamed: 0,Count
2020,183
2019,125
2018,99
2021,71
2017,66
2016,30
2015,9
2014,1


## TASKS

In [None]:
#Avg. runtime of Hindi movies.

In [None]:
#How many categories does Genre have.

In [None]:
#Most number of movies released by years.

In [None]:
#Movies in which languages has the lowest IMDB scores.

In [None]:
#Which year has to most runtime sum.

In [None]:
#Are there outliers in the data set?

In [None]:
#Which languages are mostly used by genre?