# Factors that influence the IMDb rating of a movie
## by Atharva J. Yeolekar

## Investigation Overview

This project attempts to identify the characteristics that influence the IMDb rating of a movie, with focus on features like _Genre_ of the movie, _Year of production_ of the movie and _Movie Runtime_.
## Dataset Overview

Dataset for this prject has been obtained from Kaggle. The dataset initially consisted of 16744 movies,however 502 rows and several columns had to be dropped due to data inconsistencies and missing data.The dataset consists of several features like _Title of the Movie,Year of Production,Genres_ and _IMDb ratings_. 

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [2]:
# load in the wrangled dataset into a pandas dataframe
movies = pd.read_csv('movies.csv')

##### All the following visualizations have directly been imported and the code for them can be found in the Jupyter Notebook.

## Distribution of the Year column

The __Year__ column contains the year in which the movie was produced.The following plot shows us that the distribution of this variable is left-skewed.This means that the OTT platforms in our dataset prefer to showcase relatively newer movies.Highest number of movies in our dataset were produced in the year 2017.Our dataset consists of several movies that were produced between the years __1902__ and __2020__.

![viz%201.png](attachment:viz%201.png)

## Distribution of the IMDb ratings
The IMDb ratings of the movies in our dataset seem to have a _normal distribution_ with a peak of __5.9__ .More than 700 movies in our dataset have IMDb rating of 5.9 .Majority of the movies are clustered between the ratings __5.3__ and __7.3__ . Very few movies have a rating greater than __9__ and less than __1.5__ . 

![viz%202.png](attachment:viz%202.png)

## IMDb rating vs Year of Production
Multiple plots(Heatmap and correlation heatmap) can be seen in the following slides.The first heatmap shows the __relationship between the two numeric variables__ and the second heatmap shows us the __correlation__ between them.The white spots on the heatmap indicate that there were very few or no movies produced satisfying the corresponding criteria of IMDb rating and Year of Production.The highest populous cluster of films is between the period 2015-2019 with the rating between 5.75 and 7.5.There is no obvious trend visible in the heatmap.The correlation heatmap shows that there is a __weak negative correlation__ between the IMDb and Year variables(__-0.024__).This confirms our observation that the IMDb rating of the movie is __not/barely__ affected by its year of production.

![viz%203.png](attachment:viz%203.png)

![viz%204.png](attachment:viz%204.png)

## IMDb rating vs runtime of the movie
Multiple plots(Heatmap and correlation heatmap) can be seen in the following slides.The first heatmap shows the __relationship between the two numeric variables__ and the second heatmap shows us the __correlation__ between them..The white spots on the heatmap indicate that there were very few or no movies produced satisfying the corresponding criteria of IMDb rating and teh runtime of the movie.Majority of the movies are clustered in the region with runtime around 100 minutes and IMDb rating between 5.5-7.5 .There are no movies with runtime more than 200 minutes and IMDB rating less than 4.It appears as if there is no general trend between the runtime of the movie and its IMDb rating.The correlation heatmap tells us that there is a weak positive correlation between the IMDb rating of the movie and its runtime(__0.09__).This confirms our observation from the initial heatmap that the IMDb rating of a movie is __not/barely__ affected by its runtime.

![viz%205.png](attachment:viz%205.png)

![viz%206.png](attachment:viz%206.png)

## IMDb rating vs Various OTT platforms
The following slide has a correlation heatmap that gives us the __correlation__ between various __OTT__ platforms and the __IMDb__ ratings of a movie.It gives us the correlation among the __OTT__ platforms aswell.There is __weak positive correlation__ between Netflix and IMDb rating(__0.14__), that is if a movie is offered by Netflix, it has a weak postive effect on its IMDb rating.There is __weak negative correlation__ between Prime Video and IMDb rating(__-0.16__), that is if a movie is offered by Prime Video, it has a weak negative effect on its IMDb rating.Both Hulu and Disney+ have a very __weak positive correlation__ with the IMDb rating of a movie offered by them(__0.04__ and __0.08__),thus it has a negligible effect on the movie's IMDb rating.There is a __strong negative correlation__ between Netflix and Prime Video(-0.75), that is if a movie is offered by one of the two platforms,it is __highly unlikely__ that the other platform will offer the same movie.To generalize, all the OTT platforms have a __negative correlation__ with each other(weak or strong).

![viz%208.png](attachment:viz%208.png)

## IMDb vs Year vs Country
Only the movies produced in __U.S.A__, __U.K__ and __India__ have been considered for the following plot.These plots have been obtained using __faceting__. We can see that __majority__ of the movies have been produced in __United States__.The number of movies produced in __India__ has __significantly increased__ in the __21st century__.There are very few movies from __United Kingdom__ with a low rating(__<2__).There are even less movies from __India__ with a low rating(__<2__).Despite segregating the movies with their country of production, no concrete trend could be found between the IMDb rating of a movie and its Year of Production.This further proves our previous observation that the __IMdb__ rating is __not significantly affected__ by its year of production.

![viz%209.jpg](attachment:viz%209.jpg)

## IMDb vs Genres vs Year
Only few of the genres have been considered for the following plots.These plots have been obtained by using __faceting__. We can see that all the plots follow a similer trend, that is __majority__ of the films across all the above genres have been produced in the __21st__ century.There is __no general trend__ between the __IMDb rating__ of the movie and its __Year of production__ even when the movies are separated by their genres.This further proves our previous observation that the __IMdb__ rating is __not significantly affected__ by its year of production.

![viz%201.jpg](attachment:viz%201.jpg)

![viz%202.jpg](attachment:viz%202.jpg)

![viz%203.jpg](attachment:viz%203.jpg)

### IMDb vs Runtime vs Country
Only the movies produced in __U.S.A__, __U.K__ and __India__ have been considered for the following plot.These plots have been obtained using __faceting__. We can observe that, movies in India generally have a __greater average runtime__ than __U.S__ and __U.K__.Movies from UK generally have a __runtime less than 200 minutes__.Despite __segregating__ the movies with respect to  their __country of production__, no concrete trend could be found between the Runtime and IMDb features of our dataset.This further proves our previous observation that the __IMdb__ rating is __not significantly affected__ by its runtime.

![viz%204.jpg](attachment:viz%204.jpg)

### IMDb vs Runtime vs Genres
Only few of the genres have been considered for the following plots.These plots have been obtained by using __faceting__. We can observe that __majority__ of the movies have a runtime __less than 200 minutes__.There are some outliers(in the action genre) with a __runtime of more than 500 minutes__. Despite __segregating__ the movies with respect to their __genres__, no concrete trend could be found between the Runtime and IMDb features of our dataset.This further proves our previous observation that the __IMdb__ rating is __not significantly affected__ by its runtime.

![viz%205.jpg](attachment:viz%205.jpg)

![viz%206.jpg](attachment:viz%206.jpg)

![viz%207.jpg](attachment:viz%207.jpg)

## Conclusion:
 - __There is no feature in our dataset that significantly affects the IMDb rating of a movie.__
 - __Netflix and Prime Video have the highest positive and negative correlation(0.14 and -0.16) with the IMDb column.__
 - __There is strong negative correlation between the availability of a movie on Netflix and Prime Video.__ 