# ITEC 4210 - TMDB Movies Dataset 2025

**Notebook overview:** This notebook will keep track of the work done amongst team members John Brannigan & Caleb Cedeno to access, clean, and manipulate the dataset acquired for the digital analytics story project.

**Problem Statement:** What production companies make the most profit on their movies?

**Approach:** The dataset is a comprehensive collection of almost 1 million movies curated from The Movie Database. Due to the size of the dataset and for the purposes in this class, the dataset may be subset for simplicity of use with a smaller sample. The original dataset includes titles, release info, genres, ratings, popularity, production insights, and even cast & crew. The project is based on exploring those production companies with the highest profit and revenue. Do we see the expected producers at the top? Are there any surprising insights?

# Data Acquisition and Cleanup

Importing basic libraries for use with datasets.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading initial dataset from our datasource as a dataframe & showing a quick statistical summary

In [None]:
'''
Make sure to load dataset into files before running this code.
It should be saved under the same name.
'''
dfOriginal = pd.read_csv('/content/TMDB_all_movies.csv')
dfOriginal.describe()

Unnamed: 0,id,vote_average,vote_count,revenue,runtime,budget,popularity,imdb_rating,imdb_votes
count,70158.0,70158.0,70158.0,70158.0,70158.0,70158.0,70158.0,63952.0,63952.0
mean,51605.092605,5.343108,203.731634,6678561.0,87.593988,2600369.0,3.224481,6.134509,12712.23
std,27977.354467,2.105521,1103.402384,46914320.0,41.129123,13081830.0,4.05011,1.262134,72373.3
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0
25%,28609.25,4.8,4.0,0.0,79.0,0.0,0.955,5.4,222.0
50%,51749.5,5.9,14.0,0.0,92.0,0.0,2.052,6.3,759.0
75%,74816.0,6.6,50.0,0.0,104.0,0.0,3.944,7.0,2913.0
max,103923.0,10.0,37302.0,2923706000.0,1800.0,460000000.0,166.041,10.0,3029801.0


In [None]:
#Run .head to get a feel for the fields we are working with. We essentially would like to keep only relevant fields to what we would want to potentially visualize
dfOriginal.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,340.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,...,suomi,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.3,403.0,Released,1986-10-17,0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.9,2681.0,Released,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,...,English,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3,6,Judgment Night,6.458,349.0,Released,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,...,English,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006-01-01,0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


In [None]:
df1 = dfOriginal.copy()
df1.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,340.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,...,suomi,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.3,403.0,Released,1986-10-17,0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.9,2681.0,Released,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,...,English,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3,6,Judgment Night,6.458,349.0,Released,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,...,English,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006-01-01,0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


### Splitting the release date into year, month, and day

In [None]:
df1["release_date"] = df1["release_date"].str.split("-")
df1.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,340.0,Released,"[1988, 10, 21]",0.0,73.0,0.0,tt0094675,...,suomi,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.3,403.0,Released,"[1986, 10, 17]",0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.9,2681.0,Released,"[1995, 12, 09]",4257354.0,98.0,4000000.0,tt0113101,...,English,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3,6,Judgment Night,6.458,349.0,Released,"[1993, 10, 15]",12136938.0,109.0,21000000.0,tt0107286,...,English,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,"[2006, 01, 01]",0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


In [None]:
def get_year(list_):
    return list_[0] if isinstance(list_, list) and list_ else None

### Only returning the release year and renaming the column to reflect

In [None]:
df1["release_date"] = df1["release_date"].apply(get_year)
df1.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,340.0,Released,1988,0.0,73.0,0.0,tt0094675,...,suomi,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.3,403.0,Released,1986,0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.9,2681.0,Released,1995,4257354.0,98.0,4000000.0,tt0113101,...,English,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3,6,Judgment Night,6.458,349.0,Released,1993,12136938.0,109.0,21000000.0,tt0107286,...,English,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006,0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


In [None]:
df1.rename(columns={"release_date" : "release_year"}, inplace=True)
df1.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_year,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,340.0,Released,1988,0.0,73.0,0.0,tt0094675,...,suomi,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.3,403.0,Released,1986,0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.9,2681.0,Released,1995,4257354.0,98.0,4000000.0,tt0113101,...,English,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3,6,Judgment Night,6.458,349.0,Released,1993,12136938.0,109.0,21000000.0,tt0107286,...,English,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006,0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


### Creating a column to show if a movie was released in or after 1990

In [None]:
df1["contemporary"] = df1["release_year"] >= "1990"
df1.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_year,revenue,runtime,budget,imdb_id,...,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path,contemporary
0,2,Ariel,7.1,340.0,Released,1988,0.0,73.0,0.0,tt0094675,...,"Eetu Hilkamo, Turo Pajala, Jorma Markkula, Han...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9107.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg,False
1,3,Shadows in Paradise,7.3,403.0,Released,1986,0.0,74.0,0.0,tt0092149,...,"Haije Alanoja, Aki Kaurismäki, Jukka-Pekka Pal...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7922.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg,False
2,5,Four Rooms,5.9,2681.0,Released,1995,4257354.0,98.0,4000000.0,tt0113101,...,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg,True
3,6,Judgment Night,6.458,349.0,Released,1993,12136938.0,109.0,21000000.0,tt0107286,...,"Stephen Dorff, Deirdre Kelly, Emilio Estevez, ...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Marilyn Vance, Gene Levy, Lloyd Segan",Alan Silvestri,6.6,19826.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg,True
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006,0.0,80.0,42000.0,tt0825671,...,,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg,True


### Creating a new dataset containing movies that are qualified for analysis:
#### - 20,000+ votes on iMDB
#### - Released on or after 1990
#### - Has a runtime of 60 minutes or more (feature films)
#### - Is not primarily a documentary

In [None]:
mask = (df1["imdb_votes"] >= 20000) & (df1["contemporary"] == True) & (df1["runtime"] >= 60) & (df1["genres"] != "Documentary")
qualData = df1[mask]
qualData.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_year,revenue,runtime,budget,imdb_id,...,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path,contemporary
2,5,Four Rooms,5.9,2681.0,Released,1995,4257354.0,98.0,4000000.0,tt0113101,...,"Antonio Banderas, Sammi Davis, Kimberly Blair,...","Allison Anders, Quentin Tarantino, Robert Rodr...","Andrzej Sekula, Phil Parmet, Guillermo Navarro...","Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",Combustible Edison,6.7,113889.0,/pyCk5JgtRZwRxnXwfrvyzukaKue.jpg,True
7,12,Finding Nemo,7.817,19556.0,Released,2003,940335536.0,100.0,94000000.0,tt0266543,...,"Albert Brooks, Bob Bergen, Bobby Block, Ellen ...",Andrew Stanton,"Sharon Calahan, Jeremy Lasky","Blake Tucker, Bob Peterson, David Reynolds, Ad...","John Lasseter, Graham Walters",Thomas Newman,8.2,1155485.0,/eHuGQ10FUzK1mdOY69wF5pGgEf5.jpg,True
8,13,Forrest Gump,8.467,28060.0,Released,1994,677387716.0,142.0,55000000.0,tt0109830,...,"Bonnie Ann Burgess, Bob Hope, Aloysius Gigl, S...",Robert Zemeckis,Don Burgess,"Winston Groom, Eric Roth","Steve Tisch, Steve Starkey, Wendy Finerman",Alan Silvestri,8.8,2367925.0,/arw2vcBveWOVZr6pxd9XTd1TdQa.jpg,True
9,14,American Beauty,8.008,12277.0,Released,1999,356296601.0,122.0,15000000.0,tt0169547,...,"Kevin Spacey, Tom Miller, Reshma Gajjar, Lisa ...",Sam Mendes,Conrad L. Hall,Alan Ball,"Dan Jinks, Bruce Cohen",Thomas Newman,8.3,1245768.0,/s5PXkDqS8W3K4wCPNZBzf10zycw.jpg,True
11,16,Dancer in the Dark,7.9,1832.0,Released,2000,40061153.0,140.0,12500000.0,tt0168629,...,"Marianne Bengtsson, Britt Bendixen, David Mors...",Lars von Trier,Robby Müller,"Lars von Trier, Sjón","Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",Björk,7.9,120263.0,/9rsivF4sWfmBzrNr4LPu6TNJhXX.jpg,True


### Dropped unneeded columns

In [None]:
qualData.drop(['id', 'vote_average', 'vote_count', 'status', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'tagline', 'spoken_languages', 'cast', 'director_of_photography', 'writers', 'music_composer', 'poster_path'], axis=1, inplace=True)
qualData.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualData.drop(['id', 'vote_average', 'vote_count', 'status', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'tagline', 'spoken_languages', 'cast', 'director_of_photography', 'writers', 'music_composer', 'poster_path'], axis=1, inplace=True)


Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"Comedy, Crime","Miramax, A Band Apart",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"Animation, Family",Pixar,United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"Comedy, Drama, Romance","Paramount Pictures, The Steve Tisch Company, W...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,Drama,"DreamWorks Pictures, Jinks/Cohen Company",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"Drama, Crime","Zentropa Entertainments, DR, SVT Drama, ARTE, ...","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


In [None]:
qualData

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"Comedy, Crime","Miramax, A Band Apart",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"Animation, Family",Pixar,United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"Comedy, Drama, Romance","Paramount Pictures, The Steve Tisch Company, W...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,Drama,"DreamWorks Pictures, Jinks/Cohen Company",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"Drama, Crime","Zentropa Entertainments, DR, SVT Drama, ARTE, ...","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69856,Ruby Sparks,2012,9368803.0,104.0,8000000.0,"Comedy, Romance, Fantasy, Drama","Fox Searchlight Pictures, DragonCove Studios, ...",United States of America,"Jonathan Dayton, Valerie Faris","Zoe Kazan, Paul Dano, Ron Yerxa, Albert Berger...",7.2,107016.0,True
69982,Maniac,2012,2631275.0,89.0,6000000.0,"Horror, Thriller","La Petite Reine, Studio 37",France,Franck Khalfoun,"Alix Taylor, Alexandre Aja, Pavlina Hatoupis, ...",6.1,41569.0,True
70014,The Hunt,2012,18300000.0,116.0,3800000.0,Drama,"Zentropa International Sweden, Zentropa Entert...","Denmark, Sweden",Thomas Vinterberg,"Thomas Vinterberg, Sisse Graum Jørgensen, Mort...",8.3,386077.0,True
70055,Mud,2013,32613173.0,130.0,10000000.0,Drama,"Everest Entertainment, FilmNation Entertainmen...",United States of America,Jeff Nichols,"Lisa Maria Falcone, Sarah Green, Glen Basner, ...",7.4,194124.0,True


### Making a Series out of the 'genres' column, dropping the null values, only returning the first three, and sending it back to the dataframe

In [None]:
genreList = qualData["genres"].str.split(", ")
genreList.head()

Unnamed: 0,genres
2,"[Comedy, Crime]"
7,"[Animation, Family]"
8,"[Comedy, Drama, Romance]"
9,[Drama]
11,"[Drama, Crime]"


In [None]:
def get_first_three(item):
    if isinstance(item, list):
        return item[:3]
    return None

In [None]:
genreList.apply(type).value_counts()

Unnamed: 0_level_0,count
genres,Unnamed: 1_level_1
<class 'list'>,3830


In [None]:
genreList = genreList[genreList.apply(type) != float]

In [None]:
genreList.apply(type).value_counts()

Unnamed: 0_level_0,count
genres,Unnamed: 1_level_1
<class 'list'>,3830


In [None]:
genreList = genreList.apply(get_first_three)
genreList.head()

Unnamed: 0,genres
2,"[Comedy, Crime]"
7,"[Animation, Family]"
8,"[Comedy, Drama, Romance]"
9,[Drama]
11,"[Drama, Crime]"


In [None]:
qualData["genres"] = genreList
qualData.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualData["genres"] = genreList


Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","Miramax, A Band Apart",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",Pixar,United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","Paramount Pictures, The Steve Tisch Company, W...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"DreamWorks Pictures, Jinks/Cohen Company",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","Zentropa Entertainments, DR, SVT Drama, ARTE, ...","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


In [None]:
qualData = qualData[qualData['genres'].apply(type) != float]
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","Miramax, A Band Apart",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",Pixar,United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","Paramount Pictures, The Steve Tisch Company, W...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"DreamWorks Pictures, Jinks/Cohen Company",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","Zentropa Entertainments, DR, SVT Drama, ARTE, ...","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


In [None]:
qualData["genres"] = qualData["genres"].apply(get_first_three)
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","Miramax, A Band Apart",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",Pixar,United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","Paramount Pictures, The Steve Tisch Company, W...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"DreamWorks Pictures, Jinks/Cohen Company",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","Zentropa Entertainments, DR, SVT Drama, ARTE, ...","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


### Repeating for production companies

In [None]:
prodCompanies = qualData["production_companies"].str.split(", ")
prodCompanies.head()

Unnamed: 0,production_companies
2,"[Miramax, A Band Apart]"
7,[Pixar]
8,"[Paramount Pictures, The Steve Tisch Company, ..."
9,"[DreamWorks Pictures, Jinks/Cohen Company]"
11,"[Zentropa Entertainments, DR, SVT Drama, ARTE,..."


In [None]:
prodCompanies.apply(type).value_counts()

Unnamed: 0_level_0,count
production_companies,Unnamed: 1_level_1
<class 'list'>,3824
<class 'float'>,6


In [None]:
prodCompanies = prodCompanies[prodCompanies.apply(type) != float]
prodCompanies.apply(type).value_counts()

Unnamed: 0_level_0,count
production_companies,Unnamed: 1_level_1
<class 'list'>,3824


In [None]:
prodCompanies = prodCompanies.apply(get_first_three)
prodCompanies.head()

Unnamed: 0,production_companies
2,"[Miramax, A Band Apart]"
7,[Pixar]
8,"[Paramount Pictures, The Steve Tisch Company, ..."
9,"[DreamWorks Pictures, Jinks/Cohen Company]"
11,"[Zentropa Entertainments, DR, SVT Drama]"


In [None]:
qualData["production_companies"] = prodCompanies
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","[Miramax, A Band Apart]",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",[Pixar],United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


In [None]:
qualData = qualData[qualData['production_companies'].apply(type) != float]
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","[Miramax, A Band Apart]",United States of America,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",[Pixar],United States of America,Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",United States of America,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",United States of America,Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","Denmark, Finland, France, Germany, Iceland, Ne...",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


### Repeating for production countries

In [None]:
countries = qualData["production_countries"].str.split(", ")
countries.head()

Unnamed: 0,production_countries
2,[United States of America]
7,[United States of America]
8,[United States of America]
9,[United States of America]
11,"[Denmark, Finland, France, Germany, Iceland, N..."


In [None]:
countries.apply(type).value_counts()

Unnamed: 0_level_0,count
production_countries,Unnamed: 1_level_1
<class 'list'>,3824


In [None]:
countries = countries[countries.apply(type) != float]
countries.apply(type).value_counts()

Unnamed: 0_level_0,count
production_countries,Unnamed: 1_level_1
<class 'list'>,3824


In [None]:
countries = countries.apply(get_first_three)
countries.head()

Unnamed: 0,production_countries
2,[United States of America]
7,[United States of America]
8,[United States of America]
9,[United States of America]
11,"[Denmark, Finland, France]"


In [None]:
qualData["production_countries"] = countries
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","[Miramax, A Band Apart]",[United States of America],"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",[Pixar],[United States of America],Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",[United States of America],Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","[Denmark, Finland, France]",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


In [None]:
qualData = qualData[qualData['production_countries'].apply(type) != float]
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","[Miramax, A Band Apart]",[United States of America],"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",[Pixar],[United States of America],Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",[United States of America],Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","[Denmark, Finland, France]",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True


### Creating a "profit" column

In [None]:
qualData["profit"] =  qualData["revenue"] - qualData["budget"]
qualData.head()

Unnamed: 0,title,release_year,revenue,runtime,budget,genres,production_companies,production_countries,director,producers,imdb_rating,imdb_votes,contemporary,profit
2,Four Rooms,1995,4257354.0,98.0,4000000.0,"[Comedy, Crime]","[Miramax, A Band Apart]",[United States of America],"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",6.7,113889.0,True,257354.0
7,Finding Nemo,2003,940335536.0,100.0,94000000.0,"[Animation, Family]",[Pixar],[United States of America],Andrew Stanton,"John Lasseter, Graham Walters",8.2,1155485.0,True,846335536.0
8,Forrest Gump,1994,677387716.0,142.0,55000000.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",8.8,2367925.0,True,622387716.0
9,American Beauty,1999,356296601.0,122.0,15000000.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",[United States of America],Sam Mendes,"Dan Jinks, Bruce Cohen",8.3,1245768.0,True,341296601.0
11,Dancer in the Dark,2000,40061153.0,140.0,12500000.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","[Denmark, Finland, France]",Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",7.9,120263.0,True,27561153.0


### Rearranging columns

In [None]:
qualData = qualData[['title', 'release_year', 'runtime', 'director', 'producers', 'revenue', 'budget', 'profit', 'genres',
                     'production_companies', 'production_countries', 'imdb_rating']]
qualData.head()

Unnamed: 0,title,release_year,runtime,director,producers,revenue,budget,profit,genres,production_companies,production_countries,imdb_rating
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,"[Comedy, Crime]","[Miramax, A Band Apart]",[United States of America],6.7
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,"[Animation, Family]",[Pixar],[United States of America],8.2
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],8.8
9,American Beauty,1999,122.0,Sam Mendes,"Dan Jinks, Bruce Cohen",356296601.0,15000000.0,341296601.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",[United States of America],8.3
11,Dancer in the Dark,2000,140.0,Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",40061153.0,12500000.0,27561153.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","[Denmark, Finland, France]",7.9


### Making sure no secondary or tertiary documentaries escape filtering

In [None]:
def isDocumentary(data):
    if isinstance(data, list):
        return "Documentary" in data
    return False

In [None]:
qualData = qualData[~qualData['genres'].apply(isDocumentary)]
qualData.head()

Unnamed: 0,title,release_year,runtime,director,producers,revenue,budget,profit,genres,production_companies,production_countries,imdb_rating
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,"[Comedy, Crime]","[Miramax, A Band Apart]",[United States of America],6.7
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,"[Animation, Family]",[Pixar],[United States of America],8.2
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],8.8
9,American Beauty,1999,122.0,Sam Mendes,"Dan Jinks, Bruce Cohen",356296601.0,15000000.0,341296601.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",[United States of America],8.3
11,Dancer in the Dark,2000,140.0,Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",40061153.0,12500000.0,27561153.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]","[Denmark, Finland, France]",7.9


### Exploding the 'genres' column in order to analyze profit with genres

In [None]:
df_genres = qualData.explode('genres')
df_genres.head()

Unnamed: 0,title,release_year,runtime,director,producers,revenue,budget,profit,genres,production_companies,production_countries,imdb_rating
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,Comedy,"[Miramax, A Band Apart]",[United States of America],6.7
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,Crime,"[Miramax, A Band Apart]",[United States of America],6.7
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,Animation,[Pixar],[United States of America],8.2
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,Family,[Pixar],[United States of America],8.2
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,Comedy,"[Paramount Pictures, The Steve Tisch Company, ...",[United States of America],8.8


### Exploding the 'production_companies' column in order to analyze profit with companies

In [None]:
df_companies = qualData.explode('production_companies')
df_companies.head()

Unnamed: 0,title,release_year,runtime,director,producers,revenue,budget,profit,genres,production_companies,production_countries,imdb_rating
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,"[Comedy, Crime]",Miramax,[United States of America],6.7
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,"[Comedy, Crime]",A Band Apart,[United States of America],6.7
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,"[Animation, Family]",Pixar,[United States of America],8.2
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,"[Comedy, Drama, Romance]",Paramount Pictures,[United States of America],8.8
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,"[Comedy, Drama, Romance]",The Steve Tisch Company,[United States of America],8.8


### Exploding the 'production_countries' column in order to analyze profit with countries

In [None]:
df_countries = qualData.explode('production_countries')
df_countries.head()

Unnamed: 0,title,release_year,runtime,director,producers,revenue,budget,profit,genres,production_companies,production_countries,imdb_rating
2,Four Rooms,1995,98.0,"Allison Anders, Quentin Tarantino, Robert Rodr...","Alexandre Rockwell, Lawrence Bender, Quentin T...",4257354.0,4000000.0,257354.0,"[Comedy, Crime]","[Miramax, A Band Apart]",United States of America,6.7
7,Finding Nemo,2003,100.0,Andrew Stanton,"John Lasseter, Graham Walters",940335536.0,94000000.0,846335536.0,"[Animation, Family]",[Pixar],United States of America,8.2
8,Forrest Gump,1994,142.0,Robert Zemeckis,"Steve Tisch, Steve Starkey, Wendy Finerman",677387716.0,55000000.0,622387716.0,"[Comedy, Drama, Romance]","[Paramount Pictures, The Steve Tisch Company, ...",United States of America,8.8
9,American Beauty,1999,122.0,Sam Mendes,"Dan Jinks, Bruce Cohen",356296601.0,15000000.0,341296601.0,[Drama],"[DreamWorks Pictures, Jinks/Cohen Company]",United States of America,8.3
11,Dancer in the Dark,2000,140.0,Lars von Trier,"Vibeke Windeløv, Leo Pescarolo, Peter Aalbæk J...",40061153.0,12500000.0,27561153.0,"[Drama, Crime]","[Zentropa Entertainments, DR, SVT Drama]",Denmark,7.9


In [None]:
qualData.to_csv("movies_cleaned.csv", index=False)

In [None]:
df_genres.to_csv("genre_analysis.csv", index=False)

In [None]:
df_companies.to_csv("company_analysis.csv", index=False)

In [None]:
df_countries.to_csv("country_analysis.csv", index=False)

# Dataset Analysis

Questions to be considered:

1. What is the sample size of your data set? Over 1 million values which we would want to eventually cut down.
  - https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates
  - After munging/cleaning data to only keep relevant records/attributes for this analysis, we ended up with a dataset of a little over 6,500 values
2. What are the key figures or measures present in your data?
  - After cleaning and manipulating data to end up with a movies_cleaned.csv, the numeric fields that can be aggregated, averaged, or analyzed are: release_year, runtime, revenue, budget, profit, imdb_rating
3. What are the dimensions present in your data?
  - Similar to the past question, after cleaning data and ending up with movies_cleaned.csv, the dimensions present in the data and allow for grouping or filtering are: title, director, producers, genres, production_companies, production_countries
4. What are the properties of the dimensions? eg. Are there hierarchical data, geo location-based data, time dimensions and so on
  - title: Categorical, no hierarchies
  - director: Categorical, multivalued
  - producers: Categorical, multivalued
  - genres: Categorical, multivalued and can form hierarchies if wished
  - production_companies: Categorical, multivalued
  - production_countries: Geographical, can be mapped or grouped

# Benefits/Implications

In a hypothetical real life scenario where a movie director wants to hire us to run some anaylisis on movies and their producers, we want to gather insights for him. For example the producers that generate the most revenue on their movies, the genre of the highest grossing movies, etc. This would then give the movie director better direction on where he should aim in regards to movie production if

# Visualizations

After ending up with a movies_cleaned csv with filtered data only relevant to our project, we proceed to plug in the clean csv into Power BI to continue on to our visualizations based off of the dataset. Please refer to Power BI dashboard from here.