# Project: Investigate The Movie Database (TMDb) Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.
>
>

In [1]:
import pandas as pd
import datetime as dt
import locale
locale.setlocale(locale.LC_ALL, '')     #set python local to the system (UK in my case)
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
tmdb = pd.read_csv("tmdb-movies.csv")       #load data
print tmdb.iloc[0]                          #print column names and first row (0)
tmdb.head                                   #output the start and end of the data

id                                                                 135397
imdb_id                                                         tt0369610
popularity                                                        32.9858
budget                                                          150000000
revenue                                                        1513528810
original_title                                             Jurassic World
cast                    Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
homepage                                    http://www.jurassicworld.com/
director                                                  Colin Trevorrow
tagline                                                 The park is open.
keywords                monster|dna|tyrannosaurus rex|velociraptor|island
overview                Twenty-two years after the events of Jurassic ...
runtime                                                               124
genres                          Action

<bound method DataFrame.head of            id    imdb_id  popularity     budget     revenue  \
0      135397  tt0369610   32.985763  150000000  1513528810   
1       76341  tt1392190   28.419936  150000000   378436354   
2      262500  tt2908446   13.112507  110000000   295238201   
3      140607  tt2488496   11.173104  200000000  2068178225   
4      168259  tt2820852    9.335014  190000000  1506249360   
5      281957  tt1663202    9.110700  135000000   532950503   
6       87101  tt1340138    8.654359  155000000   440603537   
7      286217  tt3659388    7.667400  108000000   595380321   
8      211672  tt2293640    7.404165   74000000  1156730962   
9      150540  tt2096673    6.326804  175000000   853708609   
10     206647  tt2379713    6.200282  245000000   880674609   
11      76757  tt1617661    6.189369  176000003   183987723   
12     264660  tt0470752    6.118847   15000000    36869414   
13     257344  tt2120120    5.984995   88000000   243637091   
14      99861  tt239542

>Looking at the '.head' data, it can be seen that the budget and revenue data is missing for older films, but as this data may be usefull, it is worth keeping the data series.
>The homepage, tagline, keywords and overview columns are long columns which are unlikely to provide usable data, so I plan to drop these to make the table easier to read. I am also unlikely to use the imdb_id column, so this will also be dropped.
>
>Note - Only run this cell once, otherwise you will get an error as it tries to delete the columns again.

In [3]:
del tmdb['homepage']
del tmdb['tagline']
del tmdb['keywords']
del tmdb['overview']
del tmdb['imdb_id']
print tmdb.iloc[0]                          #print column names and first row (0)

id                                                                 135397
popularity                                                        32.9858
budget                                                          150000000
revenue                                                        1513528810
original_title                                             Jurassic World
cast                    Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
director                                                  Colin Trevorrow
runtime                                                               124
genres                          Action|Adventure|Science Fiction|Thriller
production_companies    Universal Studios|Amblin Entertainment|Legenda...
release_date                                                       6/9/15
vote_count                                                           5562
vote_average                                                          6.5
release_year                          

>The cast, genres and production_companies look like interesting datasets to probe, so I am going to add total counts of the data in these columns to 3 new columns.
>First of all I need to check for any empty series, as I will be adding 1 to the count of '|' characters, which will only be present when there is more that one item.


In [4]:
print len(tmdb['cast'])
print len(tmdb['genres'])
print len(tmdb['production_companies'])

10866
10866
10866


>Then count for the separators and add one for the count of items in the series.
>
>Note - Only run this cell once, otherwise you will get an error as it tries to create the columns again.

In [5]:
tmdb['cast_count'] = (tmdb['cast'].str.count('\|')) + 1
tmdb['genre_count'] = (tmdb['genres'].str.count('\|')) + 1
tmdb['companies_count'] = (tmdb['production_companies'].str.count('\|')) + 1

>Again print the first dataset and the '.desribe' statistics for the dataset.

In [7]:
print tmdb.iloc[0]
print tmdb['release_date'].iloc[0]
tmdb.describe()
print tmdb.dtypes


id                                                                 135397
popularity                                                        32.9858
budget                                                          150000000
revenue                                                        1513528810
original_title                                             Jurassic World
cast                    Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
director                                                  Colin Trevorrow
runtime                                                               124
genres                          Action|Adventure|Science Fiction|Thriller
production_companies    Universal Studios|Amblin Entertainment|Legenda...
release_date                                                       6/9/15
vote_count                                                           5562
vote_average                                                          6.5
release_year                          

Now change the data types in the data frame, from all strings, to integers, floats and dates.


In [6]:
def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)

tmdb['budget_adj'] = tmdb['budget_adj'].apply(parse_maybe_int)
tmdb['revenue_adj'] = tmdb['revenue_adj'].apply(parse_maybe_int)

print tmdb['budget_adj'].head
print tmdb['revenue_adj'].head

<bound method Series.head of 0        137999939
1        137999939
2        101199955
3        183999919
4        174799923
5        124199945
6        142599937
7         99359956
8         68079970
9        160999929
10       225399900
11       161919931
12        13799993
13        80959964
14       257599886
15        40479982
16        44159980
17       119599947
18        87399961
19       147199935
20       174799923
21        27599987
22       101199955
23        36799983
24        25759988
25       137999939
26        62559972
27        74519967
28        18399991
29        56119975
           ...    
10836            0
10837            0
10838            0
10839            0
10840            0
10841       503851
10842            0
10843            0
10844            0
10845            0
10846            0
10847            0
10848     34362645
10849            0
10850            0
10851            0
10852            0
10853            0
10854            0
10855      4702610
10

Note the dates are in US format - months/days/years
Default 2digit dates for 19xx are from 1969, so we will need to change the dates before 1969, from 2066, 2067 & 2068 to 1966, 1967 & 1968 respectively. As the dates are in the range 1966 to 2015, I know all dates after 2015 should infact begin with 19xx, so I will subtract 100 years off of all dates with a date greater than today (2018).

In [7]:
def parse_date(date):
    if date == '':
        return None
    else:
        return dt.datetime.strptime(date, '%m/%d/%y')
    
def parse_pre69(date1):
    if date1.year > 2018:
        return date1.replace(year=date1.year-100)
    else:
        return date1

tmdb['release_date'] = tmdb['release_date'].apply(parse_date)
tmdb['release_date'] = tmdb['release_date'].apply(parse_pre69)
 
print tmdb['release_date'].head
print tmdb['release_year'].head


<bound method Series.head of 0       2015-06-09
1       2015-05-13
2       2015-03-18
3       2015-12-15
4       2015-04-01
5       2015-12-25
6       2015-06-23
7       2015-09-30
8       2015-06-17
9       2015-06-09
10      2015-10-26
11      2015-02-04
12      2015-01-21
13      2015-07-16
14      2015-04-22
15      2015-12-25
16      2015-01-01
17      2015-07-14
18      2015-03-12
19      2015-11-18
20      2015-05-19
21      2015-06-15
22      2015-05-27
23      2015-02-11
24      2015-12-11
25      2015-07-23
26      2015-06-25
27      2015-01-24
28      2015-11-06
29      2015-09-09
           ...    
10836   1966-01-01
10837   1966-06-21
10838   1966-11-01
10839   1966-10-27
10840   1966-12-22
10841   1966-10-23
10842   1966-01-01
10843   1966-06-09
10844   1966-01-16
10845   1966-03-01
10846   1966-01-09
10847   1966-06-20
10848   1966-08-24
10849   1966-12-16
10850   1966-02-23
10851   1966-06-22
10852   1966-05-31
10853   1966-03-29
10854   1966-02-17
10855   1966-01-20
10

In [8]:
print tmdb.dtypes
tmdb.describe()

id                               int64
popularity                     float64
budget                           int64
revenue                          int64
original_title                  object
cast                            object
director                        object
runtime                          int64
genres                          object
production_companies            object
release_date            datetime64[ns]
vote_count                       int64
vote_average                   float64
release_year                     int64
budget_adj                       int64
revenue_adj                      int64
cast_count                     float64
genre_count                    float64
companies_count                float64
dtype: object


Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj,cast_count,genre_count,companies_count
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10790.0,10843.0,9836.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0,4.872382,2.486397,2.361427
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0,0.584604,1.115649,1.343804
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0,1.0,1.0,1.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0,5.0,2.0,1.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0,5.0,2.0,2.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0,5.0,3.0,3.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0,5.0,5.0,5.0






> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

As movie stars progress in their career, they normally take on more major parts in big films. I plan to see if there is an increase in the budget of the films over the top stars careers. With the number of actors for each film restrcted to 5, parts of an actors career will not show in the statistics. I will need to compare this first to the inflation adjusted budget of the film, and then also the takings of the film, as a high takings to budget ratio may show when an actor has made the film more money, which then enabled then to claim a higher wage.

First of all I will separate all of the actors out into a separate data frame, to see who the top 20 actors are.
### Research Question 1 (Replace this header name!)

In [46]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.
actors = tmdb['cast'].str.split('|', expand=True).rename(columns = lambda x:"actor")
# print tmdb.iloc[0]
# actors.name = ['Actors']
#tmdb.join(actors)

ActorsDF = pd.DataFrame(actors)
#print ActorsDF.iloc[0]
#Actors1 = pd.DataFrame(ActorsDF['actor1'])
#Actors2 = pd.DataFrame(ActorsDF['actor2'])
#Actors3 = pd.DataFrame(ActorsDF['actor3'])
#Actors4 = pd.DataFrame(ActorsDF['actor4'])
#Actors5 = pd.DataFrame(ActorsDF['actor5'])

#Actors1.name = 'actor'


Actors0 = pd.DataFrame(ActorsDF['actor'])

#print ActorsDF.head
#print Actors2.head
#Actors0 = pd.concat([Actors1,Actors2,Actors3,Actors4,Actors5], axis=0)
print Actors0.head


<bound method DataFrame.head of                           actor                actor                   actor  \
0                   Chris Pratt  Bryce Dallas Howard             Irrfan Khan   
1                     Tom Hardy      Charlize Theron        Hugh Keays-Byrne   
2              Shailene Woodley           Theo James            Kate Winslet   
3                 Harrison Ford          Mark Hamill           Carrie Fisher   
4                    Vin Diesel          Paul Walker           Jason Statham   
5             Leonardo DiCaprio            Tom Hardy            Will Poulter   
6         Arnold Schwarzenegger         Jason Clarke           Emilia Clarke   
7                    Matt Damon     Jessica Chastain            Kristen Wiig   
8                Sandra Bullock             Jon Hamm          Michael Keaton   
9                   Amy Poehler        Phyllis Smith            Richard Kind   
10                 Daniel Craig      Christoph Waltz            LÃ©a Seydoux   
11      

### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!