# Let's Play "A Movie Presentation"

## Overview

Our team, consisting of Whitlee, Yiyi and Jacob, was asked to present three business recommendations as our company establishes its very own movie studio.

## Business Understanding

As fun as it sounds to make movies, Tom Cruise (and others) have taught us that the business of making movies can be a... wait for it... Risky Business. Anyway, our company had no real knowledge of what makes a good movie or how to make a movie profitable, so our team was tasked with exploring data and running tests upon that data in order to provide that information and inform our business recommendations. 

In [90]:
import pandas as pd
import sqlite3
import re
import numpy as np

## Data Understanding

We were provided five (5) data files, from various sources, and our first task was to decide which of those data files were viable for our purposes and which were, for lack of a better word, useless. Of those five (5) available choices, "bom.movie_gross.csv.gz" contained financial information and only that, so we discarded it as we could find that information elsewhere. Additionally, "rt.movie_info.tsv.gz" and "rt.reviews.tsv.gz" were synopses and reviews respectively, and full of strings that would be quite difficult to quantify and proved unnecessary.

Therefore, we settled on using "im.db", a SQL database for IMDB. IMDB (or, the International Movie Database) is a very popular movie-ranking website. The IMDB dataset contained such fare as ratings, directors, and actors. This data spanned from 2010 to 2027 as it dealt with production budgets and movies still being made and scheduled.

Additionally, we chose to use "tmdb.movies.csv.gz". This data came from "The Movie Database", and spanned the years 2013 to 2018. We chose this dataset because it contained a rating system, as well as title information.

Finally, as our third data source, we chose "tn.movie_budgets.csv.gz". This data came from The Numbers, and features such information as budget (what a movie cost to make), domestic gross and worldwide gross. These figures are very important for mathematical calculations, and answering the question of how much a movie is worth in a quantifiable amount.

In [91]:
tmdb_df = pd.read_csv("Data/tmdb.movies.csv.gz")

In [92]:
tmdb_df.head(10)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186
5,5,"[12, 14, 10751]",32657,en,Percy Jackson & the Olympians: The Lightning T...,26.691,2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229
6,6,"[28, 12, 14, 878]",19995,en,Avatar,26.526,2009-12-18,Avatar,7.4,18676
7,7,"[16, 10751, 35]",10193,en,Toy Story 3,24.445,2010-06-17,Toy Story 3,7.7,8340
8,8,"[16, 10751, 35]",20352,en,Despicable Me,23.673,2010-07-09,Despicable Me,7.2,10057
9,9,"[16, 28, 35, 10751, 878]",38055,en,Megamind,22.855,2010-11-04,Megamind,6.8,3635


In [93]:
tmdb_df.tail(10)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26507,26507,[99],545555,ar,Dreamaway,0.6,2018-10-14,Dream Away,0.0,2
26508,26508,[16],514492,en,Jaws,0.6,2018-05-29,Jaws,0.0,1
26509,26509,[27],502255,en,Closing Time,0.6,2018-02-24,Closing Time,0.0,1
26510,26510,[99],495045,en,Fail State,0.6,2018-10-19,Fail State,0.0,1
26511,26511,[99],492837,en,Making Filmmakers,0.6,2018-04-07,Making Filmmakers,0.0,1
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


First we must load in the data and explore it. We began with the TMDB data, and a quick look at the first and last 10 entries show us some interesting things. Columns of note for later work include "genre_ids", "original_title" and "title", "release_date", and the two columns of voting information.

In [94]:
tmdb_df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


In [95]:
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


Above, we've explored the TMDB data a little further, and we see that there are no nulls in the data. Very helpful! Furthermore, we can see some of the counts of each column, as well as statistical data such as the mean, standard deviation, minimum and max in the columns that are integer based. We can also see that there are over 26,000 entries in this dataframe.

Now, we must do the same for the other two datasets, and the process is much the same, at least for the budgets data.

In [96]:
df = pd.read_csv("Data/tn.movie_budgets.csv.gz")

In [97]:
df.head(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,6,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"
6,7,"Apr 27, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200"
7,8,"May 24, 2007",Pirates of the Caribbean: At Worldâs End,"$300,000,000","$309,420,425","$963,420,425"
8,9,"Nov 17, 2017",Justice League,"$300,000,000","$229,024,295","$655,945,209"
9,10,"Nov 6, 2015",Spectre,"$300,000,000","$200,074,175","$879,620,923"


In [98]:
df.tail(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5772,73,"Jan 13, 2012",Newlyweds,"$9,000","$4,584","$4,584"
5773,74,"Feb 26, 1993",El Mariachi,"$7,000","$2,040,920","$2,041,928"
5774,75,"Oct 8, 2004",Primer,"$7,000","$424,760","$841,926"
5775,76,"May 26, 2006",Cavite,"$7,000","$70,071","$71,644"
5776,77,"Dec 31, 2004",The Mongol King,"$7,000",$900,$900
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


From a quick exploration of The Numbers data, we can see that the columns include "release_date", "movie" (or title), "production_budget", and the gross for both domestic and worldwide. Since it is indisputable that money makes the world go 'round, the budgetary and gross information here will be absolutely critical.

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [100]:
df.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


Now, here is where things get interesting for The Numbers' data. We can see that there are over 5700 entries, and no nulls. However, the main monetary columns are listed as objects rather than integers or floats, which will make later necessary math operations a pain in the buttsholes. We will revisit a .describe() on this dataframe after we've done some cleaning and can then see the mean, standard deviation, minimum and maximum for the monetary columns.

Finally, in terms of our basic data understanding, we must load in the IMDB SQL Database and attempt to discover roughly the same information, though the schema is very different and so therefore is the information contained within.

In [101]:
conn = sqlite3.connect('Data/im.db')
query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(query, conn)
print(tables)

            name
0   movie_basics
1      directors
2      known_for
3     movie_akas
4  movie_ratings
5        persons
6     principals
7        writers


As we can see, SQL works a little (a lot) differently from Pandas, and so what we are able to ascertain (thus far) are the tables contained within. These are such things as "movie_basics", "directors", "persons", and so on. These will be useful in terms of determining whether a certain writer, actor, or director has an influence in the popularity or revenue of a movie. We did, however, need to do a little more exploration to ascertain what each table contained and its relevance.

In [102]:
q_1 = """
SELECT * FROM movie_akas
"""
movie_akas = pd.read_sql_query(q_1, conn)
movie_akas.head()

Unnamed: 0,movie_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


In [103]:
q_2 = """
SELECT * FROM movie_basics
"""
movie_basics = pd.read_sql_query(q_2, conn)
movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


From these first two queries to the database, we have pulled the results from "movie_akas" and "movie_basics", respectively, to get an idea of what each table contains and the relevance of that information. "movies_akas" contains such things as "title" and "region", but sadly is not particularly of use outside of that, featuring no ratings, budgetary information, or revenue information. "movie_basics" is contains notable things such as "genres" and "runtime_minutes".

In [104]:
q_3 = """
SELECT * FROM directors
"""
directors = pd.read_sql_query(q_3, conn)
directors.head()

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0462036,nm1940585
2,tt0835418,nm0151540
3,tt0835418,nm0151540
4,tt0878654,nm0089502


In [105]:
q_4 = """
SELECT * FROM known_for
"""
known_for = pd.read_sql_query(q_4, conn)
known_for.head()

Unnamed: 0,person_id,movie_id
0,nm0061671,tt0837562
1,nm0061671,tt2398241
2,nm0061671,tt0844471
3,nm0061671,tt0118553
4,nm0061865,tt0896534


These two queries were interesting in terms of discovering what each table contained, but overall useless.

In [106]:
q_5 = """
SELECT * FROM movie_ratings
"""
movie_ratings = pd.read_sql_query(q_5, conn)
movie_ratings.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [107]:
q_6 = """
SELECT * FROM persons
"""
persons = pd.read_sql_query(q_6, conn)
persons.head()

Unnamed: 0,person_id,primary_name,birth_year,death_year,primary_profession
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator"


In [108]:
q_7 = """
SELECT * FROM principals
"""
principals = pd.read_sql_query(q_7, conn)
principals.head()

Unnamed: 0,movie_id,ordering,person_id,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


These next few queries proved to be much more useful, as the table "movie_ratings" is incredibly relevant, containing information on the average rating of each movie and the number of votes it received to arrive at that number. The second table, "persons", contains the names and professions of actors, writers, directors, and so on. Dovetailing with the second table, the third, "principals", further informs us about the actors, producers, directors, and those involved in the making of the movie, and may be very relevant to our third recommendation.

## Data Preparation

Now, we must move on to cleaning up our data, removing nulls and ensuring that things are standardized. We begin with The Movie Database dataframe, henceforth referred to as "tmdb_df" both in code and markdown. It is briefly shown below.

In [109]:
tmdb_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


The first thing that steamed our broccoli as regarded this data was that the "genre_ids" column was in number form rather than any sort of understandable terms like "action" or "drama". Blessedly, Yiyi came right to the rescue and did a little extraneous research, discovering that these numbers are an industry standard for genres. Therefore, she was able to write a code to change the values from their respective integer value to the proper genre, as we needed.

In [110]:
genre_mapping = {
    28: "Action", 12: "Adventure", 16: "Animation", 35: "Comedy", 80: "Crime", 
    99: "Documentary", 18: "Drama", 10751: "Family", 14: "Fantasy", 36: "History", 
    27: "Horror", 10402: "Music", 9648: "Mystery", 10749: "Romance", 878: "Science Fiction", 
    10770: "TV Movie", 53: "Thriller", 10752: "War", 37: "Western"
}

In [111]:
def convert_genre_ids_to_names(ids):
    genre_ids = re.findall(r'\d+', ids)
    return ', '.join([genre_mapping.get(int(id), "Unknown") for id in genre_ids])

tmdb_df['genre_ids'] = tmdb_df['genre_ids'].apply(convert_genre_ids_to_names)

In [112]:
tmdb_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"Adventure, Fantasy, Family",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"Fantasy, Adventure, Animation, Family",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"Adventure, Action, Science Fiction",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"Animation, Comedy, Family",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"Action, Science Fiction, Adventure",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Lovely! Now, we have the genres in an understandable format that we can compare easily to the other genres seen in our available data. Our next "issue" was whether the available movie data was movies that were in English. However, another brief look showed us that roughly 23,000 of the 26,500 entries in the dataframe were in English, so we felt comfortable removing those that were not.

In [113]:
tmdb_df = tmdb_df.loc[tmdb_df['original_language'] == "en"]

In [114]:
tmdb_df['original_language'].value_counts()

en    23291
Name: original_language, dtype: int64

As we can see, we still have upwards of 23,000 results to work with. Now, we had some concerns that there might be duplicates within the "original_title" and "title" columns, so we set about exploring and fixing that.

In [115]:
tmdb_df['original_title'].value_counts()

Eden                                   7
Home                                   6
Truth or Dare                          5
Legend                                 5
Lucky                                  5
                                      ..
Sleeping with Other People             1
President Trump: Can He Really Win?    1
Dark Circles                           1
A Sinner in Mecca                      1
Public Sex, Private Lives              1
Name: original_title, Length: 21781, dtype: int64

In [116]:
tmdb_df = tmdb_df.drop_duplicates(subset=['original_title'])

In [117]:
tmdb_df['original_title'].value_counts()

The Current War                        1
Lavalantula                            1
Cartels                                1
A Mouse Tale                           1
Shades                                 1
                                      ..
Richard Linklater: Dream Is Destiny    1
Art Show Bingo                         1
Innsmouth                              1
Passfire                               1
Public Sex, Private Lives              1
Name: original_title, Length: 21781, dtype: int64

In [118]:
tmdb_df['title'].value_counts()

August                                      2
Wings                                       2
Rage                                        2
Do Not Disturb                              2
The Gift                                    2
                                           ..
The Milky Way                               1
Disgruntled Employee                        1
Heaven's Door                               1
Aliens and Astronauts: UFO's on the Moon    1
Public Sex, Private Lives                   1
Name: title, Length: 21767, dtype: int64

In [119]:
tmdb_df = tmdb_df.drop_duplicates(subset=['title'])

In [120]:
tmdb_df['title'].value_counts()

The Current War                             1
Me2                                         1
What We Did on Our Holiday                  1
Reflection                                  1
WWE: The Kliq Rules                         1
                                           ..
Heaven's Door                               1
Aliens and Astronauts: UFO's on the Moon    1
Lucky Christmas                             1
Season's Greetings                          1
Public Sex, Private Lives                   1
Name: title, Length: 21767, dtype: int64

In [121]:
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21767 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         21767 non-null  int64  
 1   genre_ids          21767 non-null  object 
 2   id                 21767 non-null  int64  
 3   original_language  21767 non-null  object 
 4   original_title     21767 non-null  object 
 5   popularity         21767 non-null  float64
 6   release_date       21767 non-null  object 
 7   title              21767 non-null  object 
 8   vote_average       21767 non-null  float64
 9   vote_count         21767 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 1.8+ MB


As we can see, there were some duplicates within both columns, and so we dropped any duplicate rows and further trimmed our data down to nearly 22,000 entries. Still a very respectable dataset! 

Now, contained within that data was a voting average based on a vote count that approximated how popular the movie was. Some very important pieces of information, especially considering that popularity is a huge part of what makes a successful movie. People don't spend money on things they don't like. Now, it could be said that when few people vote for something, they're disinterested, so we opted to remove the bottom "chunk" of votes. Not enough people saw that movie for it to be relevant, in other words.

In [122]:
tmdb_df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,21767.0,21767.0,21767.0,21767.0,21767.0
mean,13063.441953,297996.04562,2.963221,5.946226,187.565397
std,7609.083387,153589.915884,4.323317,1.912034,949.350173
min,0.0,27.0,0.6,0.0,1.0
25%,6535.5,160783.0,0.6,5.0,1.0
50%,12972.0,311093.0,1.182,6.0,4.0
75%,19602.5,422517.0,3.11,7.0,22.0
max,26516.0,608444.0,80.773,10.0,22186.0


There are a lot of low outliers in the vote_count and vote_average columns, meaning that there are many entries here that not a lot of people saw. Those aren't particularly relevant to our data, so by using the mean vote count displayed here, 188, we set a cutoff for a vote amount, which is a backend way of judging by popularity.

In [123]:
tmdb_df = tmdb_df[tmdb_df['vote_count'] >= 188]

In [124]:
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2143 entries, 0 to 24472
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2143 non-null   int64  
 1   genre_ids          2143 non-null   object 
 2   id                 2143 non-null   int64  
 3   original_language  2143 non-null   object 
 4   original_title     2143 non-null   object 
 5   popularity         2143 non-null   float64
 6   release_date       2143 non-null   object 
 7   title              2143 non-null   object 
 8   vote_average       2143 non-null   float64
 9   vote_count         2143 non-null   int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 184.2+ KB


This has still left us with a very respectable 2100 entries. Now, at this point, we have ensured our entries are all in English, and that the original titles and titles match. Due to this, we decided to then drop "original_title" and "original_language", as well as "id". We didn't feel "id" was informative nor necessary as a connective point. 'Unnamed' also didn't seem important, so we dropped that as well.

In [125]:
tmdb_df = tmdb_df.drop(['id', 'original_language', 'original_title', 'Unnamed: 0'], axis=1)

In [126]:
tmdb_df

Unnamed: 0,genre_ids,popularity,release_date,title,vote_average,vote_count
0,"Adventure, Fantasy, Family",33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"Fantasy, Adventure, Animation, Family",28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"Adventure, Action, Science Fiction",28.515,2010-05-07,Iron Man 2,6.8,12368
3,"Animation, Comedy, Family",28.005,1995-11-22,Toy Story,7.9,10174
4,"Action, Science Fiction, Adventure",27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...
24271,Comedy,8.669,2018-04-27,The Week Of,5.1,344
24275,"Drama, Fantasy, Horror, Thriller",8.631,2018-01-05,Before I Wake,6.4,941
24287,"Romance, Drama, History",8.402,2018-08-10,The Guernsey Literary & Potato Peel Pie Society,7.7,594
24338,Comedy,7.897,2018-01-19,Step Sisters,6.4,285


In [127]:
tmdb_df.to_csv('Data/tmdb_df_redone.csv', index=False)

Et voila! We have our cleaned dataset, ready to be joined with other dataframes and experimented on, for lack of a better word. Now, we move on to the second dataset and more or less repeat the process. For this portion, we will be working with The Numbers data, henceforth referred to as "budgets_df" both in code and markdown. The data is displayed briefly below for a refresher.

In [128]:
df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


The first "issue" noted was that the "$" from the monetary columns needed to be removed, both for cleanliness and potentially for later math calculations. There was no need to "muddy the waters" with additional symbols. Additionally, the monetary columns are listed as "objects" rather than integers or floats, and that will need to be fixed for math calculations as well.

In [130]:
#removing $ from monetary columns
df['production_budget'] = df['production_budget'].replace('[\$,]', '', regex=True).astype(int)

# Check the result
print(df['production_budget'])

0       425000000
1       410600000
2       350000000
3       330600000
4       317000000
          ...    
5777         7000
5778         6000
5779         5000
5780         1400
5781         1100
Name: production_budget, Length: 5782, dtype: int32


In [131]:
#removing $ from monetary columns
df['domestic_gross'] = df['domestic_gross'].replace('[\$,]', '', regex=True).astype(int)

# Check the result
print(df['domestic_gross'])

0       760507625
1       241063875
2        42762350
3       459005868
4       620181382
          ...    
5777            0
5778        48482
5779         1338
5780            0
5781       181041
Name: domestic_gross, Length: 5782, dtype: int32


In [132]:
#removing $ from monetary columns
df['worldwide_gross'] = df['worldwide_gross'].replace('[\$,]', '', regex=True).astype(np.int64)


# Check the result
print(df['worldwide_gross'])

0       2776345279
1       1045663875
2        149762350
3       1403013963
4       1316721747
           ...    
5777             0
5778        240495
5779          1338
5780             0
5781        181041
Name: worldwide_gross, Length: 5782, dtype: int64


In [133]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   int32 
 4   domestic_gross     5782 non-null   int32 
 5   worldwide_gross    5782 non-null   int64 
dtypes: int32(2), int64(2), object(2)
memory usage: 226.0+ KB


Now, the unsightly dollar signs have been removed, and the monetary columns are listed as integers, meaning that math can be performed upon them as necessary. It also means we can get the standard deviation, minimum, maximum, and other information for those columns now.

In [134]:
df.describe()

Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross
count,5782.0,5782.0,5782.0,5782.0
mean,50.372363,31587760.0,41873330.0,91487460.0
std,28.821076,41812080.0,68240600.0,174720000.0
min,1.0,1100.0,0.0,0.0
25%,25.0,5000000.0,1429534.0,4125415.0
50%,50.0,17000000.0,17225940.0,27984450.0
75%,75.0,40000000.0,52348660.0,97645840.0
max,100.0,425000000.0,936662200.0,2776345000.0


Obviously, in the context of budget discussions, we care about which movies made the most money, and so we've sorted the data by "worldwide_gross".

In [135]:
df = df.sort_values(by='worldwide_gross', ascending=False)

In [136]:
df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
42,43,"Dec 19, 1997",Titanic,200000000,659363944,2208208395
5,6,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220
6,7,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200
33,34,"Jun 12, 2015",Jurassic World,215000000,652270625,1648854864
...,...,...,...,...,...,...
5474,75,"Dec 31, 2005",Insomnia Manica,500000,0,0
5473,74,"Jul 17, 2012",Girls Gone Dead,500000,0,0
5472,73,"Apr 3, 2012",Enter Nowhere,500000,0,0
5471,72,"Dec 31, 2010",Drones,500000,0,0


Lastly, we wanted to ensure that the dataframe had no null values, and that process can be seen here.

In [137]:
#making sure there are no null values
null_values = df.isnull().sum()

# Display the count of null values for each column
print(null_values)

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64


## Analysis and Results

### Business Recommendation 1

### Business Recommendation 2

### Business Recommendation 3

## Conclusion

### Next Steps