# Importing the required Libraries

In [175]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as  sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.subplots as sp
import plotly.figure_factory as ff
from itertools import cycle
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Reading the "Credits" Dataset

In [176]:
credits = pd.read_csv('Dataset\credits.csv')
credits.head()

Unnamed: 0,person_id,id,name,character,role
0,59401,ts20945,Joe Besser,Joe,ACTOR
1,31460,ts20945,Moe Howard,Moe,ACTOR
2,31461,ts20945,Larry Fine,Larry,ACTOR
3,21174,tm19248,Buster Keaton,Johnny Gray,ACTOR
4,28713,tm19248,Marion Mack,Annabelle Lee,ACTOR


### Analyzing the size of the 'Credits' dataset:

In [177]:
credits.shape

(124235, 5)

### Analysing the columns:

In [178]:
credits.columns

Index(['person_id', 'id', 'name', 'character', 'role'], dtype='object')

After reading the credits.csv dataset it is found that the dataset contains 5 columns and 120000+ rows.
The column names are as follows:
* person_id
* id
* name
* character
* role

Upon analysing the columns it is evident that this dataset contains data about the actors and what character they played.

Now, the dataset is to be analysed for null values.

### Analysing for null values

Null value analysis is important because the null values can lead to erroneous and misleading analysis of the dataset.

In [179]:
credits.isnull().sum().sort_values(ascending=False)

character    16287
person_id        0
id               0
name             0
role             0
dtype: int64

In [180]:
round(100*(credits.isnull().sum()/len(credits.index)),2).sort_values(ascending=False)

character    13.11
person_id     0.00
id            0.00
name          0.00
role          0.00
dtype: float64

This shows us that the 'character' feature in the dataset has around 13% null values. This is fairly low value and can be handled without dropping the column.

### Handling null values:

In [181]:
credits['character'].replace(np.nan, "No value", inplace=True)

In [182]:
credits.head()

Unnamed: 0,person_id,id,name,character,role
0,59401,ts20945,Joe Besser,Joe,ACTOR
1,31460,ts20945,Moe Howard,Moe,ACTOR
2,31461,ts20945,Larry Fine,Larry,ACTOR
3,21174,tm19248,Buster Keaton,Johnny Gray,ACTOR
4,28713,tm19248,Marion Mack,Annabelle Lee,ACTOR


Now the null values have been handled.

# Reading the 'Title' dataset

In [183]:
title = pd.read_csv("Dataset/titles.csv")
title.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,TV-PG,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],26.0,tt0850645,8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,"During America’s Civil War, Union spies steal ...",1926,,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],,tt0017925,8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,It's the hope that sustains the spirit of ever...,1946,,171,"['romance', 'war', 'drama']",['US'],,tt0036868,8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,,92,"['comedy', 'drama', 'romance']",['US'],,tt0032599,7.8,57835.0,11.27,7.4
4,tm56584,In a Lonely Place,MOVIE,An aspiring actress begins to suspect that her...,1950,,94,"['thriller', 'drama', 'romance']",['US'],,tt0042593,7.9,30924.0,8.273,7.6


### Analysing the size of the 'Titles' dataset:

In [184]:
title.shape

(9871, 15)

### Analysing the columns:

In [185]:
title.columns

Index(['id', 'title', 'type', 'description', 'release_year',
       'age_certification', 'runtime', 'genres', 'production_countries',
       'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity',
       'tmdb_score'],
      dtype='object')

After reading the titles.csv dataset it is found that the dataset contains 15 columns and 9800+ rows. The column names are as follows:
* id
* title
* type
* description
* release_year
* age_certification
* runtime
* genres
* production_countries
* seasons
* imdb_id
* imdb_score
* imdb_votes
* tmdb_popularity
* tmdb_score

Upon analysing the columns it is evident that this dataset contains data about various movies and web series available on Amazon Prime along with data about their release year, genres, production countries, type, runtime etc., along with their IMDB(Internet Movie Database) and TMDB(The Movie Database) scores and popularity.

Now, the dataset is to be analysed for null values.

### Analysing the null values:

Null value analysis is important because the null values can lead to erroneous and misleading analysis of the dataset.

In [186]:
title.isnull().sum().sort_values(ascending=False)

seasons                 8514
age_certification       6487
tmdb_score              2082
imdb_votes              1031
imdb_score              1021
imdb_id                  667
tmdb_popularity          547
description              119
id                         0
title                      0
type                       0
release_year               0
runtime                    0
genres                     0
production_countries       0
dtype: int64

In [187]:
round(100*(title.isnull().sum()/len(title.index)),2).sort_values(ascending=False)

seasons                 86.25
age_certification       65.72
tmdb_score              21.09
imdb_votes              10.44
imdb_score              10.34
imdb_id                  6.76
tmdb_popularity          5.54
description              1.21
id                       0.00
title                    0.00
type                     0.00
release_year             0.00
runtime                  0.00
genres                   0.00
production_countries     0.00
dtype: float64

This analysis show us that the 'seasons' and 'age_certification' features in the dataset has more than 50% null values. So it would be better to drop these features. Dropping these feature is a better option because the missing data is large, and it can't be handled just by replacing the null values.

### Dropping the columns with more than 50% null values:

In [188]:
title = title.drop(columns=['seasons','age_certification'])
round(100*(title.isnull().sum()/len(title.index)),2).sort_values(ascending=False)

tmdb_score              21.09
imdb_votes              10.44
imdb_score              10.34
imdb_id                  6.76
tmdb_popularity          5.54
description              1.21
id                       0.00
title                    0.00
type                     0.00
release_year             0.00
runtime                  0.00
genres                   0.00
production_countries     0.00
dtype: float64