# # Project: Investigate a Dataset (The movie database)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [None]:
<a id='intro'></a>
## Introduction
This data set contains information about 10,000 movies collected from The Movie Database (TMDb), 
including user ratings and revenue.

Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.
There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is.
The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, 
accounting for inflation over time.

In [18]:
# Loading necessary libraries
#   plan to use.
import pandas as pd
import numpy as np
import operator 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib

sns.set()

Using matplotlib backend: Qt5Agg


In [None]:
<a id='wrangling'></a>
## Data Wrangling

# loading the data I am going to use in my analysis


In [3]:
# Load your data and print out a few lines. Perform operations to inspect data

df = pd.read_csv("D:/dev/DA/tmdb-movies.csv")

# Data type and dimensions

print(df.shape)

The TMdb dataset consists of 10866 rows and  21 columns

(10866, 21)


In [4]:
# Inspecting the first 10 rows from tmdb data

df.head(10)


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0
5,281957,tt1663202,9.1107,135000000,532950503,The Revenant,Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn...,http://www.foxmovies.com/movies/the-revenant,Alejandro GonzÃ¡lez IÃ±Ã¡rritu,"(n. One who has returned, as if from the dead.)",...,"In the 1820s, a frontiersman, Hugh Glass, sets...",156,Western|Drama|Adventure|Thriller,Regency Enterprises|Appian Way|CatchPlay|Anony...,12/25/15,3929,7.2,2015,124199900.0,490314200.0
6,87101,tt1340138,8.654359,155000000,440603537,Terminator Genisys,Arnold Schwarzenegger|Jason Clarke|Emilia Clar...,http://www.terminatormovie.com/,Alan Taylor,Reset the future,...,"The year is 2029. John Connor, leader of the r...",125,Science Fiction|Action|Thriller|Adventure,Paramount Pictures|Skydance Productions,6/23/15,2598,5.8,2015,142599900.0,405355100.0
7,286217,tt3659388,7.6674,108000000,595380321,The Martian,Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ...,http://www.foxmovies.com/movies/the-martian,Ridley Scott,Bring Him Home,...,"During a manned mission to Mars, Astronaut Mar...",141,Drama|Adventure|Science Fiction,Twentieth Century Fox Film Corporation|Scott F...,9/30/15,4572,7.6,2015,99359960.0,547749700.0
8,211672,tt2293640,7.404165,74000000,1156730962,Minions,Sandra Bullock|Jon Hamm|Michael Keaton|Allison...,http://www.minionsmovie.com/,Kyle Balda|Pierre Coffin,"Before Gru, they had a history of bad bosses",...,"Minions Stuart, Kevin and Bob are recruited by...",91,Family|Animation|Adventure|Comedy,Universal Pictures|Illumination Entertainment,6/17/15,2893,6.5,2015,68079970.0,1064192000.0
9,150540,tt2096673,6.326804,175000000,853708609,Inside Out,Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha...,http://movies.disney.com/inside-out,Pete Docter,Meet the little voices inside your head.,...,"Growing up can be a bumpy road, and it's no ex...",94,Comedy|Animation|Family,Walt Disney Pictures|Pixar Animation Studios|W...,6/9/15,3935,8.0,2015,160999900.0,785411600.0


In [5]:
# Inspecting the last 10 rows from tmdb data
df.tail(10)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
10856,20277,tt0061135,0.140934,0,0,The Ugly Dachshund,Dean Jones|Suzanne Pleshette|Charles Ruggles|K...,,Norman Tokar,A HAPPY HONEYMOON GOES TO THE DOGS!...When a G...,...,The Garrisons (Dean Jones and Suzanne Pleshett...,93,Comedy|Drama|Family,Walt Disney Pictures,2/16/66,14,5.7,1966,0.0,0.0
10857,5921,tt0060748,0.131378,0,0,Nevada Smith,Steve McQueen|Karl Malden|Brian Keith|Arthur K...,,Henry Hathaway,Some called him savage- and some called him sa...,...,Nevada Smith is the young son of an Indian mot...,128,Action|Western,Paramount Pictures|Solar Productions|Embassy P...,6/10/66,10,5.9,1966,0.0,0.0
10858,31918,tt0060921,0.317824,0,0,"The Russians Are Coming, The Russians Are Coming",Carl Reiner|Eva Marie Saint|Alan Arkin|Brian K...,,Norman Jewison,IT'S A PLOT! ...to make the world die laughing!!,...,"Without hostile intent, a Soviet sub runs agro...",126,Comedy|War,The Mirisch Corporation,5/25/66,11,5.5,1966,0.0,0.0
10859,20620,tt0060955,0.089072,0,0,Seconds,Rock Hudson|Salome Jens|John Randolph|Will Gee...,,John Frankenheimer,,...,A secret organisation offers wealthy people a ...,100,Mystery|Science Fiction|Thriller|Drama,Gibraltar Productions|Joel Productions|John Fr...,10/5/66,22,6.6,1966,0.0,0.0
10860,5060,tt0060214,0.087034,0,0,Carry On Screaming!,Kenneth Williams|Jim Dale|Harry H. Corbett|Joa...,,Gerald Thomas,Carry On Screaming with the Hilarious CARRY ON...,...,The sinister Dr Watt has an evil scheme going....,87,Comedy,Peter Rogers Productions|Anglo-Amalgamated Fil...,5/20/66,13,7.0,1966,0.0,0.0
10861,21,tt0060371,0.080598,0,0,The Endless Summer,Michael Hynson|Robert August|Lord 'Tally Ho' B...,,Bruce Brown,,...,"The Endless Summer, by Bruce Brown, is one of ...",95,Documentary,Bruce Brown Films,6/15/66,11,7.4,1966,0.0,0.0
10862,20379,tt0060472,0.065543,0,0,Grand Prix,James Garner|Eva Marie Saint|Yves Montand|Tosh...,,John Frankenheimer,Cinerama sweeps YOU into a drama of speed and ...,...,Grand Prix driver Pete Aron is fired by his te...,176,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,20,5.7,1966,0.0,0.0
10863,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...,,Eldar Ryazanov,,...,An insurance agent who moonlights as a carthie...,94,Mystery|Comedy,Mosfilm,1/1/66,11,6.5,1966,0.0,0.0
10864,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...,,Woody Allen,WOODY ALLEN STRIKES BACK!,...,"In comic Woody Allen's film debut, he took the...",80,Action|Comedy,Benedict Pictures Corp.,11/2/66,22,5.4,1966,0.0,0.0
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren|Tom Neyman|John Reynolds|Dian...,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,127642.279154,0.0


In [None]:
## Data Cleaning 

## Getting rid of non values from cast column , keeping only movies with cast 
## Removing rows where there is zero budget_adj and revenue

In [8]:
## Data cleaning

df = df[df["cast"].isnull() == False]
df = df[df["genres"].isnull() == False]

df = df[df.revenue_adj != 0]

## Describing the data 



In [10]:
df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,4845.0,4845.0,4845.0,4845.0,4845.0,4845.0,4845.0,4845.0,4845.0,4845.0
mean,44545.609288,1.046037,29598560.0,89305400.0,107.975645,436.639009,6.148751,2000.917853,35195110.0,115189100.0
std,72364.883496,1.357018,40524350.0,162130000.0,21.11423,806.724317,0.798296,11.570711,43766370.0,198913700.0
min,5.0,0.001117,0.0,2.0,15.0,10.0,2.1,1960.0,0.0,2.370705
25%,8279.0,0.388036,1700000.0,7770731.0,95.0,47.0,5.6,1994.0,2340083.0,10477490.0
50%,12151.0,0.68078,15000000.0,31899000.0,104.0,147.0,6.2,2004.0,20328010.0,44017460.0
75%,43949.0,1.210531,40000000.0,100000000.0,117.0,435.0,6.7,2010.0,49735160.0,131665000.0
max,417859.0,32.985763,425000000.0,2781506000.0,705.0,9767.0,8.4,2015.0,425000000.0,2827124000.0


In [25]:
## Finding more observation
df.info()  

## We can see that id and imdb_id columns are similar so we can get rid of one.
## Popularity, budget and revenue column will be very usefull for our analysis

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4845 entries, 0 to 10848
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    4845 non-null   int64  
 1   imdb_id               4845 non-null   object 
 2   popularity            4845 non-null   float64
 3   budget                4845 non-null   int64  
 4   revenue               4845 non-null   int64  
 5   original_title        4845 non-null   object 
 6   cast                  4845 non-null   object 
 7   homepage              1663 non-null   object 
 8   director              4844 non-null   object 
 9   tagline               4384 non-null   object 
 10  keywords              4612 non-null   object 
 11  overview              4845 non-null   object 
 12  runtime               4845 non-null   int64  
 13  genres                4845 non-null   object 
 14  production_companies  4752 non-null   object 
 15  release_date        

In [26]:
## Let's find out if the data set contains duplicates

df.duplicated().sum()

## We can conclude that we have one

1

In [27]:
## Since we now know that there is duplicate, let's show where it is 

df[df.duplicated()]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


In [28]:
## I am going to drop that duplicate row 

df.drop([2090], axis = 0, inplace = True)


In [29]:
## Confirm whether the duplicate row is dropped 

df.duplicated().sum()

0

In [30]:
## Checking null values 

df.isnull().sum()

id                         0
imdb_id                    0
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                       0
homepage                3181
director                   1
tagline                  461
keywords                 233
overview                   0
runtime                    0
genres                     0
production_companies      93
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [31]:
## We have a lot of null values in the data set so I will drop them

df.drop(['imdb_id', 'homepage','tagline', 'keywords', 'overview', 'budget_adj','revenue_adj'], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4844 entries, 0 to 10848
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    4844 non-null   int64  
 1   popularity            4844 non-null   float64
 2   budget                4844 non-null   int64  
 3   revenue               4844 non-null   int64  
 4   original_title        4844 non-null   object 
 5   cast                  4844 non-null   object 
 6   director              4843 non-null   object 
 7   runtime               4844 non-null   int64  
 8   genres                4844 non-null   object 
 9   production_companies  4751 non-null   object 
 10  release_date          4844 non-null   object 
 11  vote_count            4844 non-null   int64  
 12  vote_average          4844 non-null   float64
 13  release_year          4844 non-null   int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 567.7+ KB


In [32]:
df.isnull().sum()

id                       0
popularity               0
budget                   0
revenue                  0
original_title           0
cast                     0
director                 1
runtime                  0
genres                   0
production_companies    93
release_date             0
vote_count               0
vote_average             0
release_year             0
dtype: int64

In [34]:
df.dropna(how = 'any', inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4750 entries, 0 to 10848
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    4750 non-null   int64  
 1   popularity            4750 non-null   float64
 2   budget                4750 non-null   int64  
 3   revenue               4750 non-null   int64  
 4   original_title        4750 non-null   object 
 5   cast                  4750 non-null   object 
 6   director              4750 non-null   object 
 7   runtime               4750 non-null   int64  
 8   genres                4750 non-null   object 
 9   production_companies  4750 non-null   object 
 10  release_date          4750 non-null   object 
 11  vote_count            4750 non-null   int64  
 12  vote_average          4750 non-null   float64
 13  release_year          4750 non-null   int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 556.6+ KB


In [35]:
df.isnull().sum()

## Good to go now

id                      0
popularity              0
budget                  0
revenue                 0
original_title          0
cast                    0
director                0
runtime                 0
genres                  0
production_companies    0
release_date            0
vote_count              0
vote_average            0
release_year            0
dtype: int64

In [36]:
## Converting the release date column from string to datetime

df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date'].head()

0   2015-06-09
1   2015-05-13
2   2015-03-18
3   2015-12-15
4   2015-04-01
Name: release_date, dtype: datetime64[ns]

### Research Question 1  (Actors with the most appearances )

In [75]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.

# finding how many times an actor casted for a movie

act_dict = {}
actors = df["cast"]
actors = actors.str.split("|")
actors = np.array(actors)

for actrList in actors:
    for actor in actrList:
        actor = actor.lstrip() #trim whitespaces
        if actor not in act_dict:
            act_dict[actor] = 1
        else:
            act_dict[actor] += 1
            
            
actor_dict_sorted = sorted(act_dict.items(), key = operator.itemgetter(1), reverse = True)

x_axis = list()
y_axis = list()

for item in actor_dict_sorted[0:20]:
    x_axis.append(item[0])
    y_axis.append(item[1])
    
sns.set(rc={'figure.figsize': (12,10)}, font_scale=1.4)
ax = sns.barplot(x_axis, y_axis, palette="Set3")

#x-axis rotation

for item in ax.get_xticklabels():
    item.set_rotation(85)
    

ax.set(xlabel='actor names', ylabel='number of appearances', title = 'Top 20 actors based on appearances in movies')
plt.show()



## Q2 Let's find out if there is any relationship between the revenue, popularity and rating

In [46]:
def plot_box(feature):
    df.boxplot(feature, vert=False, showfliers=False)
    

In [47]:
#Visualizing rating distribution
plot_box('vote_average')

In [48]:
plot_box('popularity')



In [49]:
# distribution of revenue

plot_box('revenue')


In [50]:
# Revenue plot of Revenue vs Popularity

df.plot(x='popularity', y='revenue', kind='scatter')
plt.title('Popularity Vs Revenue')
plt.xlabel('Popularity')
plt.ylabel('Revenue')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.


Text(0, 0.5, 'Revenue')

In [51]:
## Popularity verses Rating using scatter plot

df.plot(x='vote_average', y='popularity', kind='scatter')
plt.title('Popularity Vs Rating')
plt.xlabel('Vote Average / Rating')
plt.ylabel('Popularity')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.


Text(0, 0.5, 'Popularity')

In [52]:
## Revenue vs Rating using scatter plot

df.plot(x='vote_average', y='revenue', kind='scatter')
plt.title('Rating Vs Revenue')
plt.xlabel('Vote Average / Rating')
plt.ylabel('Revenue')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.


Text(0, 0.5, 'Revenue')

In [None]:
## We can observe that the relationship between rating vs revenue and Popularity vs Rating are rightly biased


In [56]:
Q3## let's find out what are the top ten profitable movies?

df['profit'] = df['revenue']-df['budget']
df['profit'] = df['profit'].apply(np.int64)
df['budget'] = df['budget'].apply(np.int64)
df['revenue'] = df['revenue'].apply(np.int64)



def top_10(col_name,size=10):
    #find the all times top 10 for a five column
    #sort the given column and select the top 10
    df_sorted = pd.DataFrame(df[col_name].sort_values(ascending=False))[:size]
    df_sorted['original_title'] = df['original_title']
    plt.figure(figsize=(12,6))
    #Calculate the avarage
    avg = np.mean(df[col_name])   
    sns.barplot(x=col_name, y='original_title', data=df_sorted, label=col_name)
    plt.axvline(avg, color='k', linestyle='--', label='mean')
    if (col_name == 'profit' or col_name == 'budget' or col_name == 'revenue'):
        plt.xlabel(col_name.capitalize() + ' (U.S Dolar)')
    else:
        plt.xlabel(col_name.capitalize())
    plt.ylabel('')
    plt.title('Top 10 Movies in: ' + col_name.capitalize())
    plt.legend()
top_10('profit')

<a id='conclusions'></a>
## Que(3)  Most used Genre



In [58]:
from matplotlib import gridspec
def split_count_data(col_name, size=15):
        
    data = df[col_name].str.cat(sep='|')
    #storing the values separately in the series
    data = pd.Series(data.split('|'))
    
    #Let's count the most frequenties values for given column
    
    count = data.value_counts(ascending=False)
    count_size = count.head(size)
    
    #Setting axis name for multiple names
    
    if (col_name == 'production_companies'):
        sp = col_name.split('_')
        axis_name = sp[0].capitalize()+' '+ sp[1].capitalize()
    else:
        axis_name = col_name.capitalize()
    fig = plt.figure(figsize=(14, 6))
    
    #set the subplot 
    
    gs = gridspec.GridSpec(1,2, width_ratios=[2,2])
    
    #count of given column on the bar plot
    ax0 = plt.subplot(gs[0])
    count_size.plot.barh()
    plt.xlabel('Number of Movies')
    plt.ylabel(axis_name)
    plt.title('The Most '+str(size)+' Filmed ' +axis_name+' Versus Number of Movies')
    ax = plt.subplot(gs[1])
    
    #setting the explode to adjust the pei chart explode variable to any given size
    explode = []
    total = 0
    for i in range(size):
         total = total + 0.015
         explode.append(total)
    #pie chart for given size and given column
    ax = count_size.plot.pie(autopct='%1.2f%%', shadow=True, startangle=0, pctdistance=0.9, explode=explode)
    plt.title('The most '+str(size)+' Filmed ' +axis_name+ ' in Pie Chart')
    plt.xlabel('')
    plt.ylabel('')
    plt.axis('equal')
    plt.legend(loc=9, bbox_to_anchor=(1.4, 1))

In [60]:
split_count_data("genres")

In [62]:
## Which directore has been successful in producing better movies
## only basing on movies produced in 2015

df.loc[df['release_year'].idxmax()]

id                                                                 135397
popularity                                                      32.985763
budget                                                          150000000
revenue                                                        1513528810
original_title                                             Jurassic World
cast                    Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
director                                                  Colin Trevorrow
runtime                                                               124
genres                          Action|Adventure|Science Fiction|Thriller
production_companies    Universal Studios|Amblin Entertainment|Legenda...
release_date                                          2015-06-09 00:00:00
vote_count                                                           5562
vote_average                                                          6.5
release_year                          

In [67]:
df_dir = df[df['release_year'] == 2015]


In [68]:
dire_data = df_dir.groupby('director').mean().vote_average



In [69]:
dire_data

director
Aaron Moorhead|Justin Benson    6.5
Adam McKay                      7.3
Afonso Poyart                   6.2
Alan Taylor                     5.8
Alejandro AmenÃ¡bar             5.2
                               ... 
Wes Ball                        6.4
Will Canon                      5.0
Wim Wenders                     5.1
Woody Allen                     6.1
Yorgos Lanthimos                6.6
Name: vote_average, Length: 211, dtype: float64

In [71]:
## sorting them based on the ratings

sort_dire = dire_data.sort_values(ascending=False)

In [73]:
sort_dire.head()

director
Pete Docter                                     8.0
Lenny Abrahamson                                8.0
Tom McCarthy                                    7.8
Jared P. Scott|Peter D. Hutchison|Kelly Nyks    7.8
F. Gary Gray                                    7.7
Name: vote_average, dtype: float64

In [74]:
## Visualizing directors to their ratings

plt.subplots(figsize = (10,6))
plt.bar(sort_dire.index[:5], sort_dire[:5])
plt.title('Top 5 Directors of 2015 ')
plt.xlabel('Directors')
plt.ylabel('Average rating for all movies released in 2015');

## Conclusion 


From the various scatter plots shown above, I can easily conclude that revenue increases when there is an increase in movie popularity,That makes Revenue and Popularity are directly proportional. I also visualize the right-biased normal distributions for relationship between Revenue Vs Rating and Popularity Vs Rating.
 
There a lot of information in this dataset and exploration can go on and on. Some limitations the dataset contains are null and zero values in some features such as revenue and budjet columns. These zero and null values hinders the analysis and have to be removed, which could have hindered exploring top actors. 

From the pie chart shown above and the bar chart we can see that Drama is the most popular genre, following by action, comedy and thriller.Drame, Comedy, Thriller and Action are four most-made genre. The top 5 directors are Don Hertzfeldt' with mean average rating of '8.2','Mark Neale' with mean average rating of '8.0'.'Pete Docter' with mean average rating of '8.0'.
'Lenny Abrahamson'with mean average rating of '8.0


