# Prediction of Netflix movie rating(users' review): 

# Part-I

# Data wrangling 

In [1]:
# Import the pandas and numpy packages from Python
import numpy as np
import pandas as pd

In [2]:
# Import the data 
#url = 'https://www.kaggle.com/shivamb/netflix-shows?select=netflix_titles.csv'
mydata = pd.read_csv('netflix_titles.csv', header = 0)
mydata.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,14-Aug-20,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,23-Dec-16,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,20-Dec-18,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,16-Nov-17,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,1-Jan-20,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [3]:
# Find all the columns name
mydata.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [4]:
mydata.shape

(7787, 12)

In [5]:
# Find out the total number of null values in each column 
mydata.isnull().sum()

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

In [6]:
# Find out the data type of each column
mydata.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

In [7]:
mydata.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...",19-Oct-20,2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,2-Mar-19,2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,25-Sep-20,2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,31-Oct-20,2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...
7786,s7787,Movie,ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS,Sam Dunn,,"United Kingdom, Canada, United States",1-Mar-20,2019,TV-MA,90 min,"Documentaries, Music & Musicals",This documentary delves into the mystique behi...


In [8]:
# List the different counties in the data set
mydata['country'].unique()[:10]

array(['Brazil', 'Mexico', 'Singapore', 'United States', 'Turkey',
       'Egypt', 'India', 'Poland, United States', 'Thailand', 'Nigeria'],
      dtype=object)

In [9]:
# List countries and the number of times they appear in the data set, display top 10 countries  
mydata['country'].value_counts()[:10]

United States     2555
India              923
United Kingdom     397
Japan              226
South Korea        183
Canada             177
Spain              134
France             115
Egypt              101
Mexico             100
Name: country, dtype: int64

In [10]:
# List the types of show listed in the data set
mydata['type'].value_counts()

Movie      5377
TV Show    2410
Name: type, dtype: int64

In [11]:
# List the number of years appeared in the data set and diplay top 10 years  
mydata['release_year'].value_counts()[:10]

2018    1121
2017    1012
2019     996
2016     882
2020     868
2015     541
2014     334
2013     267
2012     219
2010     173
Name: release_year, dtype: int64

In [12]:
# List the year wise number of movie/tv show release
year_show_count = pd.DataFrame(mydata['release_year'].value_counts())
year_show_count.head()

Unnamed: 0,release_year
2018,1121
2017,1012
2019,996
2016,882
2020,868


In [13]:
# In the 'director' column many values are missing, so this column was dropped from the dataframe 
mydata2 = mydata.drop(columns = ['director'], axis = 1)
mydata2.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,14-Aug-20,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,23-Dec-16,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,20-Dec-18,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,16-Nov-17,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,1-Jan-20,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [14]:
mydata2['date_added'] = pd.to_datetime(mydata2['date_added'])
mydata2.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [15]:
# After converting to date, the year was extracted. 
mydata2['date_added'] = mydata2['date_added'].dt.year.fillna(0).astype(int)
#mydata2['date_added'] = mydata2['date_added'].map(lambda x: x.year)
mydata2.head(3)

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [16]:
mydata2.date_added.isnull().sum()

0

In [17]:
mydata2.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [20]:
#Check the null values in each column of the dataset
mydata2.isnull().sum()

show_id           0
type              0
title             0
cast            718
country         507
date_added        0
release_year      0
rating            7
duration          0
listed_in         0
description       0
dtype: int64

In [21]:
mydata2.dropna(inplace = True)
mydata2.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [22]:
mydata2.isnull().sum()

show_id         0
type            0
title           0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [23]:
mydata2['rating'].value_counts()

TV-MA       2452
TV-14       1639
R            655
TV-PG        629
PG-13        375
PG           238
TV-Y7        209
TV-Y         192
TV-G         146
NR            67
G             38
UR             5
TV-Y7-FV       4
NC-17          3
Name: rating, dtype: int64

In [24]:
mydata2.shape

(6652, 11)

In [25]:
mydata2.columns

Index(['show_id', 'type', 'title', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [26]:
mydata2['type'].value_counts()

Movie      4761
TV Show    1891
Name: type, dtype: int64

In [27]:
mydata2['duration'].value_counts()

1 Season     1184
2 Seasons     317
3 Seasons     161
93 min        121
94 min        120
             ... 
181 min         1
167 min         1
189 min         1
237 min         1
15 min          1
Name: duration, Length: 203, dtype: int64

In [28]:
# Make a dataframe containing only TV shows
mydata2_TV_Shows_only = mydata2[mydata2['type'] == 'TV Show']
mydata2_TV_Shows_only.head()                               

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
5,s6,TV Show,46,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,2017,2016,TV-MA,1 Season,"International TV Shows, TV Dramas, TV Mysteries",A genetics professor experiments with a treatm...
11,s12,TV Show,1983,"Robert Więckiewicz, Maciej Musiał, Michalina O...","Poland, United States",2018,2018,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Dramas","In this dark alt-history thriller, a naïve law..."
24,s25,TV Show,​SAINT SEIYA: Knights of the Zodiac,"Bryson Baugus, Emily Neves, Blake Shepard, Pat...",Japan,2020,2020,TV-14,2 Seasons,"Anime Series, International TV Shows",Seiya and the Knights of the Zodiac rise again...
29,s30,TV Show,#blackAF,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,2020,TV-MA,1 Season,TV Comedies,Kenya Barris and his family navigate relations...


In [29]:
mydata2_TV_Shows_only.shape

(1891, 11)

In [30]:
# Make a dataframe of Movies only
mydata2_movies_only = mydata2[mydata2['type'] == 'Movie']
mydata2_movies_only.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
6,s7,Movie,122,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,2020,2019,TV-MA,95 min,"Horror Movies, International Movies","After an awful accident, a couple admitted to ..."


In [31]:
mydata2_movies_only.shape

(4761, 11)

In [32]:
# Convert the column name 'duration' to 'duration_min' 
mydata2_movies_only.rename(columns = {'duration':'duration_min'}, inplace = True)
mydata2_movies_only.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration_min,listed_in,description
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [33]:
# Remove 'min' unit from contents of "duration" column, keep only the minutes in number and rename the column name
#"duration_min"
# For this, first split that column and keep only the number
mydata2_movies_only2 = mydata2_movies_only['duration_min'].str.split(" ", expand = True)
mydata2_movies_only2.head(2)

Unnamed: 0,0,1
1,93,min
2,78,min


In [34]:
mydata2_movies_only['duration_min'] = mydata2_movies_only2[0] 
mydata2_movies_only.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration_min,listed_in,description
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,TV-MA,93,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,R,78,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017,2009,PG-13,80,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020,2008,PG-13,123,Dramas,A brilliant group of students become card-coun...
6,s7,Movie,122,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,2020,2019,TV-MA,95,"Horror Movies, International Movies","After an awful accident, a couple admitted to ..."


In [35]:
# As the movie only dataframe, following the same procedure in the TV show only dataframe "mydata2_seasons_only",  drop the 'Season/Seasons' from the 'duration' column 
# and keep only the numbers.Before that the duration column in mydata2_TV_Shows_only is changed to 'Number_of_seasons'. 
mydata2_TV_Shows_only.rename(columns = {'duration': 'Number_of_seasons'}, inplace = True)
mydata2_TV_Shows_only.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,Number_of_seasons,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
5,s6,TV Show,46,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,2017,2016,TV-MA,1 Season,"International TV Shows, TV Dramas, TV Mysteries",A genetics professor experiments with a treatm...
11,s12,TV Show,1983,"Robert Więckiewicz, Maciej Musiał, Michalina O...","Poland, United States",2018,2018,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Dramas","In this dark alt-history thriller, a naïve law..."
24,s25,TV Show,​SAINT SEIYA: Knights of the Zodiac,"Bryson Baugus, Emily Neves, Blake Shepard, Pat...",Japan,2020,2020,TV-14,2 Seasons,"Anime Series, International TV Shows",Seiya and the Knights of the Zodiac rise again...
29,s30,TV Show,#blackAF,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,2020,TV-MA,1 Season,TV Comedies,Kenya Barris and his family navigate relations...


In [36]:
mydata2_TV_Shows_only.shape

(1891, 11)

In [37]:
# Find out the total number of TV shows with different episode number
print(mydata2_TV_Shows_only.Number_of_seasons.nunique())

15


In [38]:
#  Show the total number number of TV shows in each category based on episode number(s)
mydata2_TV_Shows_only.Number_of_seasons.value_counts()

1 Season      1184
2 Seasons      317
3 Seasons      161
4 Seasons       85
5 Seasons       58
6 Seasons       29
7 Seasons       18
8 Seasons       16
9 Seasons        7
10 Seasons       6
11 Seasons       3
12 Seasons       2
15 Seasons       2
13 Seasons       2
16 Seasons       1
Name: Number_of_seasons, dtype: int64

In [39]:
# Now carry out the splitting on the Number_of_seasons column
TV_shows_split_season = mydata2_TV_Shows_only.Number_of_seasons.str.split(' ', expand = True)
TV_shows_split_season.head()

Unnamed: 0,0,1
0,4,Seasons
5,1,Season
11,1,Season
24,2,Seasons
29,1,Season


In [40]:
# Take only the 1st column of the TV_shows_split_season and assign it to the number_of_season dataframe 
mydata2_TV_Shows_only['Number_of_seasons'] = TV_shows_split_season[0]
mydata2_TV_Shows_only.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,Number_of_seasons,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,2020,TV-MA,4,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
5,s6,TV Show,46,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,2017,2016,TV-MA,1,"International TV Shows, TV Dramas, TV Mysteries",A genetics professor experiments with a treatm...
11,s12,TV Show,1983,"Robert Więckiewicz, Maciej Musiał, Michalina O...","Poland, United States",2018,2018,TV-MA,1,"Crime TV Shows, International TV Shows, TV Dramas","In this dark alt-history thriller, a naïve law..."
24,s25,TV Show,​SAINT SEIYA: Knights of the Zodiac,"Bryson Baugus, Emily Neves, Blake Shepard, Pat...",Japan,2020,2020,TV-14,2,"Anime Series, International TV Shows",Seiya and the Knights of the Zodiac rise again...
29,s30,TV Show,#blackAF,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,2020,TV-MA,1,TV Comedies,Kenya Barris and his family navigate relations...


# Part-II

In the second part of this project, the IMDB datasets will be imported and data wrangling will be carried out for further analysis.

In [41]:
# Different methods to import datafile into jupyter notebook environment
# Import datasets, IMDb movies
# If the data file is in the same directory
#IMDB_Movies = pd.read_csv('IMDb movies.csv', header = 0)
#IMDB_Movies.head()

#If data file is in a sub directory of the current directory
IMDB_Movies = pd.read_csv('IMDB_Internet_Movie_Database/IMDb movies.csv', header = 0) 
IMDB_Movies.head()

# Import from url
#url = 'https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+movies.csv'
#IMDB_Movies3 = pd.read_csv('https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+movies.csv')
#IMDB_Movies3.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,12/26/1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,"$2,250",,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,8/19/1911,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,11/13/1912,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,"$45,000",,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,3/6/1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


In [42]:
# Import the rating dataset
IMDB_rating = pd.read_csv('IMDB_Internet_Movie_Database/IMDb ratings.csv', header = 0) 
IMDB_rating.head()

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,...,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0
1,tt0000574,6.1,589,6.3,6.0,57,18,58,137,139,...,6.2,23.0,6.6,14.0,6.4,66.0,6.0,96.0,6.2,331.0
2,tt0001892,5.8,188,6.0,6.0,6,6,17,44,52,...,5.8,4.0,6.8,7.0,5.4,32.0,6.2,31.0,5.9,123.0
3,tt0002101,5.2,446,5.3,5.0,15,8,16,62,98,...,5.5,14.0,6.1,21.0,4.9,57.0,5.5,207.0,4.7,105.0
4,tt0002130,7.0,2237,6.9,7.0,210,225,436,641,344,...,7.3,82.0,7.4,77.0,6.9,139.0,7.0,488.0,7.0,1166.0


In [43]:
# The 'avg_vote' in IMDB_Movies dataframe and 'weighted_average_vote' in IMDB_rating meant same. 
# Henece the IMDB_Movies dataframe was decided to be wrangled for further analysis along with netflix dataset.
IMDB_Movies.shape

(85855, 22)

In [44]:
IMDB_Movies.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

In [45]:
IMDB_Movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_title_id          85855 non-null  object 
 1   title                  85855 non-null  object 
 2   original_title         85855 non-null  object 
 3   year                   85855 non-null  object 
 4   date_published         85855 non-null  object 
 5   genre                  85855 non-null  object 
 6   duration               85855 non-null  int64  
 7   country                85791 non-null  object 
 8   language               85022 non-null  object 
 9   director               85768 non-null  object 
 10  writer                 84283 non-null  object 
 11  production_company     81400 non-null  object 
 12  actors                 85786 non-null  object 
 13  description            83740 non-null  object 
 14  avg_vote               85855 non-null  float64
 15  vo

In [46]:
IMDB_Movies.isnull().sum()

imdb_title_id                0
title                        0
original_title               0
year                         0
date_published               0
genre                        0
duration                     0
country                     64
language                   833
director                    87
writer                    1572
production_company        4455
actors                      69
description               2115
avg_vote                     0
votes                        0
budget                   62145
usa_gross_income         70529
worlwide_gross_income    54839
metascore                72550
reviews_from_users        7597
reviews_from_critics     11797
dtype: int64

In [47]:
IMDB_Movies.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,12/26/1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,"$2,250",,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,8/19/1911,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,11/13/1912,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,"$45,000",,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,3/6/1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


#### There are two columns 'title' and 'original_title', looking alike.  So, in order to decide which  column to keep, further exploration into these two columns was carried out in few shells below: 

In [48]:
IMDB_Movies['title2'] = np.where(IMDB_Movies['title'] == IMDB_Movies['original_title'], True, False )

In [49]:
IMDB_Movies['title2'].value_counts()

True     62483
False    23372
Name: title2, dtype: int64

It shows that 'title' and 'original_title' columns values are not exactly same. Though 72% values are same, rest are not. 

In [50]:
only_mismatch = IMDB_Movies [IMDB_Movies['title2'] == False][['original_title', 'title', 'title2']]
only_mismatch.head()

Unnamed: 0,original_title,title,title2
12,Ingeborg Holm,Il calvario di una madre,False
17,"Home, Sweet Home",Amore di madre,False
18,Der Student von Prag,Lo studente di Praga,False
38,Tess of the Storm Country,La madonnina del porto,False
47,The Birth of a Nation,Nascita di una nazione,False


In [51]:
only_mismatch.isnull().sum()

original_title    0
title             0
title2            0
dtype: int64

None of the values in the 'original_title' and 'title' are null. Hence, After a considerable thought, it was decided to keep the 'original_title'. 

As the year of 'date_published' is already listed in the 'year' column, The 'date_published' column seems to be not useful. 
Hence it is also decided to be dropped.

Most of the values in 'metascore' column (85%) are null values, hence this column should be dropped. The other columns 'usa_gross_income', 'worlwide_gross_income'and 'budget' columns also contain more than 50% values null, hence these columns are also dropped. The column "reviews_from_critics" has more than 11000 Null entries, hence it is also decided to be dropped.

In [52]:
IMDB_Movies.drop(columns =['title', 'title2', 'date_published', 'metascore','usa_gross_income', 'worlwide_gross_income', 'budget', 'reviews_from_critics'], inplace = True)

In [53]:
IMDB_Movies.head(2)

Unnamed: 0,imdb_title_id,original_title,year,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,tt0000009,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,tt0000574,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0


In [54]:
IMDB_Movies.columns

Index(['imdb_title_id', 'original_title', 'year', 'genre', 'duration',
       'country', 'language', 'director', 'writer', 'production_company',
       'actors', 'description', 'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

In [55]:
IMDB_Movies.isnull().sum()

imdb_title_id            0
original_title           0
year                     0
genre                    0
duration                 0
country                 64
language               833
director                87
writer                1572
production_company    4455
actors                  69
description           2115
avg_vote                 0
votes                    0
reviews_from_users    7597
dtype: int64

In [56]:
IMDB_Movies2 = IMDB_Movies.dropna()

In [57]:
IMDB_Movies2.shape

(72043, 15)

In [58]:
IMDB_Movies2.isnull().sum().sum() # No null values in the whole dataset 

0

## Save two clean data files for Exploratory Data Analysis  

In [59]:
# Save the netflix data file
mydata2_movies_only.to_csv('clean_datafiles/mydata_movies_only.csv')

In [60]:
#import os
#from library.sb_utils import save_file
#data_saving_path = '../clean_datafiles'
IMDB_Movies2.to_csv('clean_datafiles/IMDB_Movies_cleaned.csv')

# Summary

## Part-I 

* The Netflix datset used in this project has been taken from kaggle. The raw dateset comprises of 7787 rows and 12 columns. The dataset contains many null values. As the 'director' column has 2389 null values (30%) of total entry, this column was dropped from the dataset.<br><br>

* The dataset lists two types of shows: "TV shows" and "Movies". Hence the dataset was divided into two dataframes, one based on "TV shows" only and other on "Movies" only.<br><br>
* **Major problem encountered while wrangling the dataset:** <br><br>
After the conversion of the date_added column to date, the year was extracted. The output obtained was float instead of integer. To find out the reason, futrher investigation was carried out which shows that this is happening due to the presence of the null values.  So, to handle this, the null values were first filled with 0 using the fillna() method and then converted to integer. 

## Part-II

The IMDB dataset was imported from IMDB. The raw datset had 85855 rows and 22 columns.

* In the dataset, the 'original_title' and 'title' were looking alike, hence both columns equality was checked using numpy.where(), counted the similar and not similar values. It showed out of 85855 values, 62483 are similar, while rest are not. Finally, the 'original_title' was kept to use in further analysis.

* Few columns were deleted due to the following reasons:
    * 'date_published' column was deleted, as it seemed not  to be useful as there is the 'year' column to refer the published year. 
    * In 'metascore' column, 85% of its entries are null, hence it was dropped from the dataframe. 
    * The columns 'usa_gross_income', 'worlwide_gross_income'and 'budget' columns also contain more than 50% null values, hence these columns are also dropped. 
    * The column "reviews_from_critics" has more than 11000 Null entries, hence it is also decided to be dropped.
    