
# Project: Investigate TMDB Movie Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


In this report, I am going to demonstrate my investigations and findings on the TMDB movie dataset by answering a set of questions. 

### Assumptions
1. According to the [TMDB Movie Data metadata page](https://www.kaggle.com/tmdb/tmdb-movie-metadata), actors are listed in the descending order of how much they billed. Therefore I had used the first listed name as the leading actor for each movie in the 'cast' column to analyse the correlation between popularity and the actor, and to analyse the correlation between revenue and the actor. 

2. For the mojority of movies, there is a single director. For movies with multiple directors, I considered them all rather than only the first listed director when analyse the correlation between popularity and the directors. Same as when analysing the correlation between revenue and directors. 

3. I will use imdb_id as the primary key of each movie. Therefore empty imdb_id or duplicate imdb_id will be considered as bad data and will be removed in the <a href="#cleaning">data cleaning</a> process.

4. I will assume values of zero in `revenue` and `popularity` columns as missing. Therefore in <a href="#cleaning">Data Cleaning</a> section I will remove these rows from my datasets.

### Questions to be Answered
Here is the list of questions I am going to find answers for:

1. Who were playing in the top popular movies? 
2. Who were playing in the movies with the most revenue?
3. Who were the directors in the top popular movies?
4. Who were the directors in the movies with the most revenue?
5. For individual actor, are there any difference in popularity and revenue?
6. Does popularity have a positive correlation with vote_average?
7. In general how is popularity and revenue related? (double line chart)

### DataFrame Instances to be Generated
In order to answering the above questions, I am going to create a set of new DataFrame instances:
1. `movies_df_original`: the original DataFrame instance where all data in the csv file will be loaded into.
2. `movies_df_imdb_id_cleaned`: the DataFrame instance without rows where `imdb_id = NaN` or `imdb_id` is duplicated.
3. `movies_df_revenue`: the DataFrame instance generated from `movies_df_imdb_id_cleaned` without rows where `revenue` field is 0.
4. `movies_df_pop`: the DataFrame instance generated from `movies_df_imdb_id_cleaned` without rows where `popularity` field is 0.
5. `movies_df_actor`: the DataFrame instance generated from `movies_df_imdb_id_cleaned` without rows where `cast` field is NaN.
6. `movies_df_director`: the DataFrame instance generated from `movies_df_imdb_id_cleaned` without rows where `director` field is 0.

These DataFrame instances will be used in <a href="#eda">Exploratory Data Analysis</a> section.




In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
pd.options.display.max_columns = None

import numpy as np
np.set_printoptions(threshold=np.nan)

import matplotlib as mp
import unicodecsv

### to make sure all outputs will be displayed rather than only the output for the last expression.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

In the below cell, I am going to load the csv file and print the top 5 rows and the data type of each column:

In [2]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
movies_df_original = pd.read_csv('tmdb-movies.csv')
movies_df_original.head()
movies_df_original.dtypes


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

In [3]:
### This will return me the total number of rows and columns in the dataset
print ('Number of rows and columns in the original movie dataset: %s' % movies_df_original.shape[0])

### Next I want to see the count of non-NaN rows of each column
print ('Number of non-NaN rows in each column:')
movies_df_original.count()

Number of rows and columns in the original movie dataset: 10866
Number of non-NaN rows in each column:


id                      10866
imdb_id                 10856
popularity              10866
budget                  10866
revenue                 10866
original_title          10866
cast                    10790
homepage                 2936
director                10822
tagline                  8042
keywords                 9373
overview                10862
runtime                 10866
genres                  10843
production_companies     9836
release_date            10866
vote_count              10866
vote_average            10866
release_year            10866
budget_adj              10866
revenue_adj             10866
dtype: int64

Then I will going to check how many rows in total, as well as the number of `non-NaN` cells in each column: 

The output shows that there are 10856 rows in the dataset that the `imdb_id` is not `NaN`, wheras there are 10866 rows in total. This means that there are 10 rows whose `imdb_id` is `NaN`. 

Since later on when I am analysing the correlation between popularity/revenue and actors, I will use imdb_id as the unique key to join datasets, I do not want `imdb_id` to be `NaN`. So in the I will remove these rows and return a new DataFrame object `movies_df_imdb_id_cleaned` later on in the <a href="#cleaning">Data Cleaning</a> section.

Now I am curious to see ignoring the NaN imdb_id, if there are duplicate imdb_id:

In [4]:
movies_df_original.groupby('imdb_id').filter(lambda x: len(x) > 1)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2089,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


The output shows that imdb_id tt0411951 has a duplicate row. Here I assumed that since the two lines has the same imdb_id, the whole rows should be exactly the same. So in <a href="#cleaning">Data Cleaning</a> section I will remove this duplicated row from `movies_df_imdb_id_cleaned`.

Since I am interested in the correlation between the revenue with who is acting in the movie, and the correlation between the popularity with who is acting in the movie, as well as the correlation between popularity and vote_average, I am going to have a look in `revenue`, `popularity`, `cast`, `director` and `vote_average` columns and find if there are any data required to be cleaned:

In [5]:

len(movies_df_original[movies_df_original['revenue'] == 0])
len(movies_df_original[movies_df_original['popularity'] == 0])
len(movies_df_original[movies_df_original['cast'].isnull()])
len(movies_df_original[movies_df_original['director'].isnull()])
len(movies_df_original[movies_df_original['vote_average'] == 0])



6016

0

76

44

0

The outputs show that in `revenue` column, the count of 0s is 6016; in `cast` column, the count of `NaN` is 76; and in `director` column, the count of `NaN` is 44. Therefore in <a href="#cleaning">Data Cleaning</a> section, I will need to removed these rows from and generate new DataFrame instances without these rows.

<a id='cleaning'></a>
### Data Cleaning

In summay, in this section I will demonstrate the process of:
1. Removing rows from the original dataset `movies_df_original` where `imdb_id` is `NaN` and generate new dataset `movies_df_imdb_id_cleaned`.
2. From `movies_df_imdb_id_cleaned` removing rows where `imdb_id` is duplicated.
3. Creating a new dataset `movies_df_revenue` from `movies_df_imdb_id_cleaned` where rows that `revenue` field is not zero. This new dataset will be used for analysing the correlation between revenue and actor. Columns in the dataset are: `imdb_id`, `revenue`.
4. Creating a new dataset `movies_df_pop` from `movies_df_imdb_id_cleaned` where rows that `popularity` field is not zero. This new dataset will be used for analysing the correlation between popularity and actor. Columns in the dataset are: `imdb_id`, `popularity`.
5. Creating a new dataset 'movies_df_pop_vote' from `movies_df_imdb_id_cleaned` where rows that both `popularity` and `vote_average` fields are not zero. This new dataset will be used for analysing the correlation between popularity and vote_average. Columns in the dataset are: `imdb_id`, `popularity` and `vote_average`.
6. Creating a new dataset `movies_df_actor` from `movies_df_imdb_id_cleaned` where rows that `cast` field is not `NaN`. Note that only the first actor listed will be selected. Columns in the dataset are: `imdb_id`, `actor`.
7. Creating a new dataset `movies_df_director` from `movies_df_imdb_id_cleaned` where rows that `director` field is not `NaN`. Columns in the dataset are: `imdb_id`, `director`.


In [6]:
#1. Removing rows from the original dataset movies_df_original where imdb_id is NaN and generate new dataset movies_df_imdb_id_cleaned.
#Firstly returning all rows where `imdb_id` = `NaN`:
movies_df_original[movies_df_original['imdb_id'].isnull()]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
548,355131,,0.108072,0,0,Sense8: Creating the World,Tuppence Middleton|Bae Doona |Brian J. Smith|A...,,,,sexuality|superhuman|superpower|making of|soci...,,25,Documentary|Science Fiction,Netflix,8/10/15,12,7.5,2015,0.0,0.0
997,287663,,0.330431,0,0,Star Wars Rebels: Spark of Rebellion,Freddie Prinze Jr.|Vanessa Marshall|Steve Blum...,,Steward Lee|Steven G. Lee,,,"A Long Time Ago In A Galaxy Far, Far Awayâ€¦ A...",44,,,10/3/14,13,6.8,2014,0.0,0.0
1528,15257,,0.607851,0,0,Hulk vs. Wolverine,Fred Tatasciore|Bryce Johnson|Steve Blum|Nolan...,,Frank Paur,,marvel comic|superhero|wolverine|hulk|norse my...,Department H sends in Wolverine to track down ...,38,Animation|Action|Science Fiction,Marvel Studios,1/27/09,38,6.9,2009,0.0,0.0
1750,101907,,0.256975,0,0,Hulk vs. Thor,Graham McTavish|Fred Tatasciore|Matthew Wolf|J...,,Sam Liu,A Battle Between God and Monster,marvel comic|superhero|hulk|norse mythology|su...,"For ages, Odin has protected his kingdom of As...",41,Action|Animation|Fantasy|Science Fiction,Marvel Studios,1/27/09,38,6.4,2009,0.0,0.0
2401,45644,,0.067753,0,0,Opeth: In Live Concert At The Royal Albert Hall,"Mikael Ã…kerfeldt|Martin ""Axe"" Axenrot|Martin ...",http://www.opeth.com,,"The Loyal Disharmonic Orchestra, Conducted By ...",,As part of the ongoing celebration of their 20...,163,Music,,9/21/10,10,8.6,2010,0.0,0.0
4797,369145,,0.167501,0,0,Doctor Who: The Snowmen,Matt Smith|Jenna Coleman|Richard E. Grant|Ian ...,,,,,"Christmas Eve, 1892, and the falling snow is t...",60,,BBC Television UK,12/25/12,10,7.8,2012,0.0,0.0
4872,269177,,0.090552,0,0,Party Bercy,Florence Foresti,,,,,Florence Foresti is offered Bercy tribute to a...,120,Comedy,TF1 VidÃ©o,9/23/12,15,6.4,2012,0.0,0.0
6071,279954,,0.004323,500,0,Portal: Survive!,Monique Blanchard|Bradley Mixon,https://www.kickstarter.com/projects/colinandc...,Connor McGuire|Colin McGuire,The Cake is a Lie,portal|aperture,"A short, live action fan film by Collin and Co...",7,Action|Science Fiction,,10/8/13,11,7.5,2013,468.016676,0.0
7527,50127,,0.570337,0,0,Fallen: The Journey,Paul Wesley|Fernanda Andrade|Tom Skerritt|Rick...,,Mikael Salomon,,,"A year later, Aaron is still traveling around ...",80,Action|Adventure|Drama|Fantasy|Family,,1/1/07,11,7.3,2007,0.0,0.0
7809,50128,,0.060795,0,0,Fallen: The Destiny,Paul Wesley|Fernanda Andrade|Tom Skerritt|Rick...,,Mikael Salomon,,,"Aaron and Azazel defeat the Powers, and force ...",80,Adventure|Fantasy|Drama|Action|Science Fiction,,1/1/07,13,7.0,2007,0.0,0.0


In [7]:
#Now drop all rows whose imdb_id is NaN (10 rows) and return a new DataFrame instance movies_df_imdb_id_cleaned:
movies_df_imdb_id_cleaned = movies_df_original.dropna(subset=['imdb_id'])

#Check if rows are dropped. The output should return nothing:
movies_df_imdb_id_cleaned[movies_df_imdb_id_cleaned['imdb_id'].isnull()]

#Check count of non-NaN rows of each column of movies_df_imdb_id_cleaned and confirm NaN rows of imdb_id column are deleted:
movies_df_imdb_id_cleaned.count()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj


id                      10856
imdb_id                 10856
popularity              10856
budget                  10856
revenue                 10856
original_title          10856
cast                    10780
homepage                 2934
director                10816
tagline                  8039
keywords                 9369
overview                10853
runtime                 10856
genres                  10835
production_companies     9831
release_date            10856
vote_count              10856
vote_average            10856
release_year            10856
budget_adj              10856
revenue_adj             10856
dtype: int64

In [8]:
#2. From movies_df_imdb_id_cleaned removing rows where imdb_id is duplicated.
# Now from movies_df_imdb_id_cleaned, I want to check if any of the id is duplicate
(movies_df_imdb_id_cleaned.groupby('imdb_id').filter(lambda x: len(x) > 1))
# Then I will drop the second row as duplicate and keep the top row
movies_df_imdb_id_cleaned = movies_df_imdb_id_cleaned.drop_duplicates(subset='imdb_id', keep='first')

# Double-check that there is no duplicate imdb_id anymore. This should return nothing:
(movies_df_imdb_id_cleaned.groupby('imdb_id').filter(lambda x: len(x) > 1))

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2089,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj


In [9]:
# Defining functions that can be used to extract data:

# return_cleaned_data_for_num_column: returning a new DataFrame instance that only containing imdb_id and
# the other single column in interest, where the value of the column in interested is selecting by comparing with the passed-in 
# numerical parameter compared_value.

def return_cleaned_data_for_num_column(column_interested, compared_value):
    df_column_interested = movies_df_imdb_id_cleaned[['imdb_id', column_interested]].copy()
    return df_column_interested[df_column_interested[column_interested ]!= compared_value]

# return_cleaned_data_for_num_column: returning a new DataFrame instance that only containing imdb_id and
# the other single non-numerical column in interest, where the value of the column in interested is selected if it is not null.
def return_cleaned_data_for_char_column(column_interested):
    df_column_interested = movies_df_imdb_id_cleaned[['imdb_id', column_interested]].copy()
    return df_column_interested[df_column_interested[column_interested].notnull()]

# get_first_actor: returning the first actor in the cast column for each movie.
def get_first_actor(actors):
    return actors.split('|')[0]


In [10]:

#3. Creating a new dataset movies_df_revenue from movies_df_imdb_id_cleaned where rows that revenue field is zero. 
#   This new dataset will be used for analysing the correlation between revenue and actor. 
#   Columns in the dataset are: imdb_id, revenue.
movies_df_revenue = return_cleaned_data_for_num_column('revenue', 0)
movies_df_revenue.head()

Unnamed: 0,imdb_id,revenue
0,tt0369610,1513528810
1,tt1392190,378436354
2,tt2908446,295238201
3,tt2488496,2068178225
4,tt2820852,1506249360


In [11]:
# 4. Creating a new dataset movies_df_pop from movies_df_imdb_id_cleaned where rows that popularity field is zero. 
#    This new dataset will be used for analysing the correlation between popularity and actor. 
#    Columns in the dataset are: imdb_id, popularity.
movies_df_pop = return_cleaned_data_for_num_column('popularity', 0)
movies_df_pop.head()

Unnamed: 0,imdb_id,popularity
0,tt0369610,32.985763
1,tt1392190,28.419936
2,tt2908446,13.112507
3,tt2488496,11.173104
4,tt2820852,9.335014


In [37]:
# 5. Creating a new dataset 'movies_df_pop_vote' from movies_df_imdb_id_cleaned where rows that both popularity and vote_average fields are not zero.
#    This new dataset will be used for analysing the correlation between popularity and vote_average. 
#    Columns in the dataset are: imdb_id, popularity and vote_average.
movies_df_pop_vote = movies_df_pop.merge(return_cleaned_data_for_num_column('vote_average',0), on=['imdb_id'], how='inner')
movies_df_pop_vote.head()

Unnamed: 0,imdb_id,popularity,vote_average
0,tt0369610,32.985763,6.5
1,tt1392190,28.419936,7.1
2,tt2908446,13.112507,6.3
3,tt2488496,11.173104,7.5
4,tt2820852,9.335014,7.3


In [12]:
# 6. Creating a new dataset `movies_df_actor` from `movies_df_imdb_id_cleaned` where rows that `cast` field is not `NaN`.
#    Note that only the first actor listed will be selected. Therefore there should not have any duplicate imdb_id in this dataset.
#    Columns in the dataset are: `imdb_id`, `actor`.
movies_df_actor = return_cleaned_data_for_char_column('cast')
movies_df_actor['cast'] = movies_df_actor['cast'].apply(get_first_actor)
movies_df_actor.head()

Unnamed: 0,imdb_id,cast
0,tt0369610,Chris Pratt
1,tt1392190,Tom Hardy
2,tt2908446,Shailene Woodley
3,tt2488496,Harrison Ford
4,tt2820852,Vin Diesel


In [13]:
# 7. Creating a new dataset movies_df_director from movies_df_imdb_id_cleaned where rows that director field is not NaN. 
#    Columns in the dataset are: imdb_id, director.
#    Note that all directors will be selected. Therefore there will be multiple rows with same imdb_id.
#    Referencing https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows
#                http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/

# returning dataframe where director list is not split yet
movies_df_director_temp = return_cleaned_data_for_char_column('director')

# What the below line does:
# a. split director list into list, where the index of the list is imdb_id, and column name of each element in the 
#    list is starting from 0 and increasing.
# b. Using stack() to pivot the dataframe that the innermost column index become the innermost row index.
#    It is returning a series since both the imdb_id and the converted innermost row index are the new index. 
#    And the returned object has a single level of column lable, which is 0
director_series = pd.DataFrame(movies_df_director_temp['director'].str.split('|').tolist(),index=movies_df_director_temp.imdb_id).stack()

# Now reset the index of the series instead of using both imdb_id and converted row index as the outer lever index
# and selecting the imdb_id and splitted director column and generating a new DataFrame instance
movies_df_director = director_series.reset_index()[['imdb_id', 0]]

# Now renaming the columns to be 'imdb_id' and 'director'
movies_df_director.columns = ['imdb_id', 'director']
movies_df_director.head()

Unnamed: 0,imdb_id,director
0,tt0369610,Colin Trevorrow
1,tt1392190,George Miller
2,tt2908446,Robert Schwentke
3,tt2488496,J.J. Abrams
4,tt2820852,James Wan


<a id='eda'></a>
## Exploratory Data Analysis

Now that I have a set of datasets that denormalised and cleaned from the original dataset, and they are:
1. movies_df_revenue
2. movies_df_pop
3. movies_df_pop_vote
4. movies_df_actor
5. movies_df_director

Each dataset contains a subset of columns of the original dataset. I will use these datasets for my data exploration. 

### Research Question 1 Who were acting in the 

In [14]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [15]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!