
## Movie Data Set

### <b>UseCase:<b> To clean and analyze the movie_dataset using EDA to 
    - Gain Insights 
    - Determining relationships in Data
    - Checking of assumptions
    - Represent the analysis using plots
    

### The prime analysis include the below questions/RelationShips:<br>
1) What is the relation between Year and Movies Produced/Registered?<br>
2) Is there any relationship between duration and rating?<br>
3) Which Year produced most hit movies (liked by audience)<br>
4) Which Year generated maximum revenue?<br>
5) What is the relationship between Votes and Year?<br>
6) Which category of movies makes most of the good ratings?<br>
7) Which genre movies were most voted category?<br>
8) Which genre generated highest revenue?<br>
9) How director plays a role in revenues and votes?<br>
10) Who are the top performing actors?<br>


In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

In [None]:
movie_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/1000%20movies%20data.csv")

In [None]:
#movie_data=pd.read_csv("C:/Users/brahmishreem/Desktop/1000_movies_data.csv")

In [None]:
movie_data.head()

In [None]:
movie_data.shape

In [None]:
movie_data.info()

In [None]:
movie_data.describe()

In [None]:
movie_data.isnull().sum()

### 1. Data Pre-Profiling

In [None]:
movieprofile = pandas_profiling.ProfileReport(movie_data)
movieprofile.to_file(outputfile="movie_before_preprocessing.html")

### 2.Data Preprocessing


In [None]:
movie_data.Metascore= movie_data.Metascore.fillna(movie_data['Metascore'].median()) #as Metascore data is -ve skewed

In [None]:
movie_data['Metascore'].median()

In [None]:
movie_data.info()

In [None]:
movie_data[movie_data['Revenue (Millions)']<400]['Revenue (Millions)'].plot.hist()

In [None]:
movie_data.columns

In [None]:
movie_data.rename(columns={'Revenue (Millions)': 'RevenueInM'}, inplace=True)
movie_data.RevenueInM.fillna(movie_data.RevenueInM.median(), inplace=True)

### 1) What is the relation between Year and Movies Produced/Registered?

In [None]:
movie_data['Year'].value_counts().sort_index().plot.bar()
plt.title('Count of movies produced year wise')

#### The number of movies produced/registered has increased almost 5 folds from 2006 to 2016

### 2) Is there any relationship between duration and rating?

In [None]:
movie_data[['Runtime (Minutes)', 'Rating']].plot.scatter(x='Runtime (Minutes)', y='Rating')
plt.title('Relation between Rating and Duration of a movie using Scatter Plot')

In [None]:
#As the data has too many overlapping points, choose to take HeatMap for better insight

movie_data.plot.hexbin(x='Runtime (Minutes)', y='Rating', gridsize=20)
plt.title('Relation between Rating and Duration of a movie using HexPlot')

#### Most average(6-7) rated movies are have a movie duration of 90-130 mins. The above graph doesnot give much detail insight on duration impacting the rating of a movie but we can conclude most movie duration range between 90-135 minutes.

In [None]:
#Function for categorizing numerals to categorical class
def categorizeRating(x):
    if x >= 8.5:
        return "Excellent"
    elif x >=7  and x <8.5:
        return "Good"
    elif x >=6.0  and x <7:
        return "Average"
    elif x >=4.5  and x <6.0:
        return "Poor"
    else:
        return "Worst"

In [None]:
movie_data['MetacriticRating']= (movie_data['Metascore']/10)            #Standardise the metascore data with Rating

In [None]:
movie_data['RatingClass'] = movie_data['Rating'].apply(categorizeRating)         #Classifying Ratings into categories
movie_data['MetaScoreClass'] = movie_data['MetacriticRating'].apply(categorizeRating)         #Classifying MetacriticRating into categories

In [None]:
as_fig = sns.FacetGrid(movie_data,hue='RatingClass',aspect=5)
as_fig.map(sns.kdeplot,'Runtime (Minutes)',shade=True)
oldest = movie_data['Runtime (Minutes)'].max()
as_fig.set(xlim=(0,oldest))
as_fig.add_legend()
plt.title('Rating distribution againt duration')

#### The excellent rated movies has high variance with duration(graph is platykurtic) while worst performing movies mostly ranged between 85-105 mins.

In [None]:
as_fig = sns.FacetGrid(movie_data,hue='MetaScoreClass',aspect=5)
as_fig.map(sns.kdeplot,'Runtime (Minutes)',shade=True)
oldest = movie_data['Runtime (Minutes)'].max()
as_fig.set(xlim=(0,oldest))
as_fig.add_legend()
plt.title('MetaScoreClass distribution againt duration')

#### The metacritic ratings against duration is almost same for each category ranged between 85-105 mins with the excellent mmovies having slightly greater duration 

### We cannot much conclude if duration has any role to play with the sucess(Ratings) of the movie

<p/>

In [None]:
sns.boxplot('Year','Runtime (Minutes)',data=movie_data)
plt.title("Duration and Year wise plot")

### The average duration of movies each year ranges between 100-120 minutes.

<p/>

In [None]:
movie_data['RatingClass'] = movie_data['Rating'].apply(categorizeRating)                      #Classifying Ratings into categories

In [None]:
movie_data['MetaScoreClass'] = movie_data['MetacriticRating'].apply(categorizeRating)         #Classifying Ratings into categories

In [None]:
movie_data.head()

### 3) Which Year produced most hit movies (liked by audience)

In [None]:
good_movies = movie_data[(movie_data['RatingClass']=='Good') | (movie_data['RatingClass']=='Excellent'  )]    # Filter out the high rated movies

In [None]:
excellent_movies = movie_data[(movie_data['RatingClass']=='Excellent'  )] 

In [None]:
sns.countplot('Year',data=good_movies)
plt.title('Count plot for year wise of movies produced')

As per graph, 2016 produced most number of hit movies. However, the data is not standadized as the no of movie produced each year varies and cannot be concluded

In [None]:
movie_data.groupby(['Year'])['Year'].count()

In [None]:
((good_movies.groupby(['Year'])['Year'].count()/movie_data.groupby(['Year'])['Year'].count())*100).plot.bar()
plt.title("Plot shows the hit(rating above 7) produced each year")

#### 1) Year 2007 produced most of the hit movies while 2006 produced maximum number of high rated movies
#### 2) Although 2016 produced a highest quantity of movies, quality wise it scored the lowest.

In [None]:
((excellent_movies.groupby(['Year'])['Year'].count()/movie_data.groupby(['Year'])['Year'].count())*100).plot.bar()
plt.title("Superhit movies(rating above 8.5) produced each year")

In [None]:
f,ax = plt.subplots(1,2,figsize=(15,7))

((good_movies.groupby(['Year'])['Year'].count()/movie_data.groupby(['Year'])['Year'].count())*100).plot.pie(autopct='%1.1f%%', ax=ax[0])
((excellent_movies.groupby(['Year'])['Year'].count()/movie_data.groupby(['Year'])['Year'].count())*100).plot.pie(autopct='%1.1f%%', ax=ax[1])
ax[0].set_title('Year wise contribution in hit movies')
ax[1].set_title('Year wise contribution of only super hit movies')

#### <b>2007</b>, produced large number of hit movies, followed by <b>2006, 2009</b>
If we consider only the super hits <b>2006,  was the golden year in the film industry</b> and alone constituted <b>37.5</b> share of excellent rated movies<br>
2009, 2013, 2015 had no superhit movies

<p>

### 4) Which Year generated maximum revenue?


In [None]:
movie_data.groupby("Year").RevenueInM.sum().reset_index()

In [None]:
movie_data.groupby("Year").RevenueInM.sum()/movie_data.groupby(['Year'])['Year'].count()

In [None]:
sns.barplot(x="Year", y="RevenueInM", data=movie_data, palette="summer")
plt.title("Year wise Revenue generated")

 1) Year 2016 has generated a total revenue of <b>15626.270M</b><p>
 2) However considering the number of released per year, <b>2009</b> and <b>2012</b> has the highest revenue while 2016 had least revenue in comparision</p> 
 3) In 2016, as the number of hit and superhit movies released is the lowest. It has directly impacted the revenue.

In [None]:
movie_data.corr()

<p>

### 5) What is the relationship between Votes and Year?

In [None]:
sns.barplot(x="Year", y="Votes", data=movie_data, palette="Set3")
plt.title("Relationship between Votes and Year")

##### 1) From the plot, 2012 has the highest number of votes.
##### 2) After 2012, the number of votes have been declining. 
#### Could it be that the other entertainment sources(Netflix/Amazon Originals, Youtube webseries/channels etc) have gained more popularty and have better content in recent years. Therefore, people focus is distributed leading to  low votes and revenue


In [None]:
movie_data.groupby("Year").Votes.sum()/movie_data.groupby(['Year'])['Year'].count()

In [None]:
### How UserRating, Critic Rating impact the votes?

In [None]:
fig, ax =plt.subplots(1,2, figsize=(10,5))
a=sns.barplot(x="MetaScoreClass", y="Votes", data=movie_data, palette="pastel",  ax=ax[0], order=["Worst", "Poor","Average","Good", "Excellent"])
b=sns.barplot(x="RatingClass", y="Votes", data=movie_data, palette="pastel",  ax=ax[1], order=["Worst", "Poor","Average","Good", "Excellent"])
a.title.set_text('Votes and Rating Classification')
b.title.set_text('Votes and Metacritic Classification')
fig.show()


#### The high rated movies both critically acclaimed and user ratings states that <b>More the votes, merrier the ratings<b/>

<p/>

## 6) Which genre of movies are the most commonly produced?

##### Number of movies produced in each genre

In [None]:
#movie_data.groupby("PrimeGenre").PrimeGenre.count()
movies1 = movie_data.assign(PrimeGenre1=movie_data['Genre'].str.split(',').str[0])
movies2 = movie_data.assign(PrimeGenre1=movie_data['Genre'].str.split(',').str[1])
movies3 = movie_data.assign(PrimeGenre1=movie_data['Genre'].str.split(',').str[2])

#remove leading white spaces
movies2['PrimeGenre1']=movies2['PrimeGenre1'].str.lstrip()
movies3['PrimeGenre1']=movies3['PrimeGenre1'].str.lstrip()

movies=pd.concat([movies1,movies2,movies3], axis=0)
movies.groupby("PrimeGenre1").PrimeGenre1.count().sort_values()

In [None]:
#del movies

In [None]:
fig, ax =plt.subplots( figsize=(20,7))
#sns.countplot('PrimeGenre',data=movie_data, palette="GnBu_d", order=movie_data['PrimeGenre'].value_counts().index)
sns.countplot('PrimeGenre1',data=movies, palette="GnBu_d", order=movies['PrimeGenre1'].value_counts().index)
plt.title("Movies produced in each genre")

##### 1) <b>Drama</b> Genre is the clear winner with total of <b>513</b> movies followed by Action with 303 categoried under it </br> 2) Western and Musical genre were <b>least</b> produced.
The dominance of drama as a genre is perhaps not surprising when we consider the following:
* Drama is the cheapest genre to produce as movies don’t necessarily require special sets, costumes, locations, props, special/visual effects, etc.
* Drama has the broadest definition of all genres – everything that happens anywhere ever is a drama. Conversely, other genres have a higher bar for classification, such as the need for high-octane events for a movie to be classed as Action, scary events to be Horror, funny elements to be a Comedy, etc.

<p>
    <p>

### 7) Which genre movies received highest positive reviews?

In [None]:
#df.nlargest(10, ['Weight']) 
(movies.groupby("PrimeGenre1").Metascore.sum()).sort_values()


In [None]:
fig, ax =plt.subplots( figsize=(20,5))
#genRating=((movies.groupby("PrimeGenre1").Metascore.sum()).sort_values(ascending=False)).plot.bar()
genRating=sns.countplot(x="PrimeGenre1", data=movies, palette="ch:.25",order = movies['PrimeGenre1'].value_counts().index)
for item in genRating.get_xticklabels():
    item.set_rotation(30)
plt.title('Count plot for Ratings received to each Genre')

#### Drama,Action, Comedy, Adventure are the with top genres with highest ratings.

<p>

### 8) Which genre generated highest revenue?

In [None]:
movies.groupby("PrimeGenre1").RevenueInM.sum().sort_values().plot.barh(color=(0.0, 0.5, 0.0, 0.8))
plt.xlabel('Genre', fontsize=5)
plt.ylabel('Revenue', fontsize=5)
plt.title('Market Share for Each Genre 2006-2016')
plt.show()

#### Adventure, Action,Drama & Comedy were the genre generating highest revenues.

<p>

### 9) How director plays a role in revenues and votes?

As there is a high cardinality in the director column, we are considering the directors who have directed atleast 4 movies 

Capture top directors

In [None]:
#Let us filter some of the directors who have produced hit movies and have done atleast 4 Movies 
good_dir=good_movies.groupby("Director").filter(lambda x: len(x) > 3)
#good_dir.head()

In [None]:
good_dir.groupby(['Director'])['Director'].count()

In [None]:
fig, ax =plt.subplots(1,2, figsize=(20,5))
sDV=sns.barplot(x="Director", y="Rating", data=good_dir.iloc[0:10,:], palette="summer",  ax=ax[0])
sDR=sns.barplot(x="Director", y="RevenueInM", data=good_dir.iloc[0:10,:], palette="Blues",  ax=ax[1])
sDV.title.set_text('Director and Votes')
sDR.title.set_text('Director and Revenue')
for item in sDV.get_xticklabels():
    item.set_rotation(90)
for item in sDR.get_xticklabels():
    item.set_rotation(90)
fig.show()

#### Data suggests that Nolan Movies are the highest voted while J.J Abrams movies get more revenues

In [None]:
result=good_dir.groupby(["Director"])['Rating'].aggregate(np.median).reset_index().sort_values('Rating', ascending=False)
result.iloc[0:4,:]  
#result

In [None]:
#del good_dir

In [None]:
sDR=sns.barplot(x="Director", y="Rating", data=result.iloc[0:10,:] , color="Orange")
plt.title("Top 10 Directors based on highest average ratings")
for item in sDR.get_xticklabels():
    item.set_rotation(90)

Christopher Nolan tops the score board with highest number top rated movies

In [None]:
result=good_dir.groupby(["Director"])['Metascore'].aggregate(np.mean).reset_index().sort_values('Metascore', ascending=False)
result.iloc[0:4,:] 

In [None]:
sDR=sns.barplot(x="Director", y="Metascore", data=result.iloc[0:10,:] , color="Yellow")
plt.title("Top Directors with movies of highest positive reviews")
for item in sDR.get_xticklabels():
    item.set_rotation(90)

In [None]:
result=good_dir.groupby(["Director"])['RevenueInM'].sum().reset_index().sort_values('RevenueInM', ascending=False)
result.iloc[0:4,:]  

In [None]:
sDR=sns.barplot(x="Director", y="RevenueInM", data=result.iloc[0:10,:] , color="Blue")
plt.title("Top Directors with movies of highest revenues")
for item in sDR.get_xticklabels():
    item.set_rotation(90)

<p/>

In [None]:
fig, ax =plt.subplots(1,2, figsize=(20,5))
sDV=sns.barplot(x="Director", y="Votes", data=good_dir, palette="summer",  ax=ax[0])
sDR=sns.barplot(x="Director", y="RevenueInM", data=good_dir, palette="Blues",  ax=ax[1])
sDV.title.set_text('Director and Votes')
sDR.title.set_text('Director and Revenue')
for item in sDV.get_xticklabels():
    item.set_rotation(90)
for item in sDR.get_xticklabels():
    item.set_rotation(90)
fig.show()

### 10) Who are the top performing actors?

In [None]:
#dist1= good_movies['Actors'].str.split(',').str[0]
dist1 = good_movies.assign(Actor1=good_movies['Actors'].str.split(',').str[0])
dist2 = good_movies.assign(Actor1=good_movies['Actors'].str.split(',').str[1])
dist3 = good_movies.assign(Actor1=good_movies['Actors'].str.split(',').str[2])
dist4 = good_movies.assign(Actor1=good_movies['Actors'].str.split(',').str[3])
dist5 = good_movies.assign(Actor1=good_movies['Actors'].str.split(',').str[4])


In [None]:
dist1.count()

In [None]:
#remove leading white spaces
dist2['Actor1']=dist2['Actor1'].str.lstrip()
dist3['Actor1']=dist3['Actor1'].str.lstrip()

In [None]:
#concat all 3 datasets
actorPreProcess=pd.concat([dist1, dist2,dist3 ], axis=0)   

In [None]:
topActors_MovieData=actorPreProcess.groupby('Actor1').filter(lambda x: len(x) > 4)         #filter dataset wth actors having done more  than 3 movies

In [None]:
topactors=topActors_MovieData.groupby('Actor1')['Actor1'].count().sort_values(ascending=False)
topactors[topactors>6]

In [None]:
fig, ax =plt.subplots( figsize=(20,5))
actors=sns.countplot('Actor1',data=topActors_MovieData, order = topActors_MovieData['Actor1'].value_counts().index)
for item in actors.get_xticklabels():
    item.set_rotation(90)
plt.title("Top rated actors as per hit movies")

#### List of top actors with maximum hit movies<br/>
Robert Downey Jr.       <b>12</b> hit movies <br/>
Tom Hardy & Leonardo DiCaprio <b>10</b> hit movies <br/>
Brad Pitt               <b>9</b> hit movies <br/>
Ryan Gosling, Jake Gyllenhaal, Michael Fassbender, Amy Adams, Christian Bale   <b>8</b> hit movies <br/>
<p>
    <b>Amy Adams, Scarlet Johansen & Jenifer lawerence</b> were some of the top female actress having starred in most hit movies

In [None]:
result=topActors_MovieData.groupby(["Actor1"])['RevenueInM'].aggregate(np.median).reset_index().sort_values('RevenueInM', ascending=False)


In [None]:
fig, ax =plt.subplots( figsize=(20,5))
rev=sns.barplot(x="Actor1", y="RevenueInM", data=result, palette="pastel",order = result["Actor1"])
for item in rev.get_xticklabels():
    item.set_rotation(90)
plt.title("Top rated actors as per hit movies")

##### Top 5 Actors with films with highest revenue
Chris Evans	333.915<br>
Daniel Radcliffe	294.980<br>
Emma Watson	293.490<br>
Robert Downey Jr.	260.540<br>
Chris Pratt	257.760<br>
Martin Freeman	255.110<br>

In [None]:
m1=movie_data[movie_data['Actors'].str.contains('Chris Evans')]
m1.RevenueInM.sum()