# Matthew's Data Manipulation and Analysis

In this Notebook I will be taking a look at the data from the "IMDB" and "The Numbers" databases. Using the data from these two places, I will answer the following question:

**Which genre(s) give the most lift to profit? ROI? Profit margin?**

I will clean, investigate, and visualize data in relation to the question; I will then create a specific recommendation for the director of a new movie studio based on my findings.

In [49]:
# Import relevant Python packages.
import pandas as pd
import matplotlib as plt
import seaborn as sns

In [50]:
# Open and save data from "IMDB" and "The Numbers" csv files,
imdb = pd.read_csv("../Data/imdb_data")
tn = pd.read_csv("../Data/tn_data")

In [51]:
# Visually confirm IMDB data loaded as expected
imdb.head()

Unnamed: 0,movie_id,primary_title,runtime_minutes,genres,averagerating,numvotes
0,sunghursh2013,Sunghursh,175.0,"Action,Crime,Drama",7.0,77.0
1,one day before the rainy season2019,One Day Before the Rainy Season,114.0,"Biography,Drama",7.2,43.0
2,the other side of the wind2018,The Other Side of the Wind,122.0,Drama,6.9,4517.0
3,sabse bada sukh2018,Sabse Bada Sukh,,"Comedy,Drama",6.1,13.0
4,the wandering soap opera2017,The Wandering Soap Opera,80.0,"Comedy,Drama,Fantasy",6.5,119.0


In [184]:
# Visually confirm TN data loaded as expected
tn.head()

Unnamed: 0,movie_id,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,total profit,ROI,domestic profit,domestic profit margin,total profit margin
0,avatar2009,1,2009-12-18,Avatar,425000000,760507625,2776345279,2351345279,5.532577,335507625,0.441163,0.846921
1,pirates of the caribbean: on stranger tides2011,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,635063875,1.546673,-169536125,-0.703283,0.607331
2,dark phoenix2019,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-200237650,-0.572108,-307237650,-7.18477,-1.337036
3,avengers: age of ultron2015,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,1072413963,3.243841,128405868,0.279748,0.764364
4,star wars ep. viii: the last jedi2017,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,999721747,3.153696,303181382,0.488859,0.759251


## Data Preperation

### IMDB
The data requires a bit more cleaning before we begin.

In [53]:
# Our group decided to drop the IMDB rows where the 'numvotes' column = 0/NaN as these entries are usually an error
imdb.dropna(subset=['numvotes'], inplace=True)


In [54]:
# Confirm rows were dropped by checking how many rows there are.
imdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73856 entries, 0 to 146134
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   runtime_minutes  66236 non-null  float64
 3   genres           73052 non-null  object 
 4   averagerating    73856 non-null  float64
 5   numvotes         73856 non-null  float64
dtypes: float64(3), object(3)
memory usage: 3.9+ MB


In [148]:
# Remove the duplicate values in 'movie_id' and keep the entry with the most votes
imdb = imdb.sort_values('numvotes', ascending=False).drop_duplicates(subset='movie_id')
#I sorted the table by 'numvotes', descending, and then kept the first occurance of an entry.

# Confirming duplicated were deleted.
imdb.duplicated('movie_id').value_counts()

False    73264
dtype: int64

In [149]:
# Set the index as movie_id
imdb.set_index('movie_id')

Unnamed: 0_level_0,primary_title,runtime_minutes,genres,averagerating,numvotes
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
inception2010,Inception,148.0,"Action,Adventure,Sci-Fi",8.8,1841066.0
the dark knight rises2012,The Dark Knight Rises,164.0,"Action,Thriller",8.4,1387769.0
interstellar2014,Interstellar,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334.0
django unchained2012,Django Unchained,165.0,"Drama,Western",8.4,1211405.0
the avengers2012,The Avengers,143.0,"Action,Adventure,Sci-Fi",8.1,1183655.0
...,...,...,...,...,...
ann2016,Ann,79.0,Drama,8.6,5.0
the whale caller2016,The Whale Caller,94.0,Drama,7.4,5.0
choctaw code talkers2010,Choctaw Code Talkers,56.0,Documentary,8.8,5.0
crime2010,Crime,70.0,Thriller,4.2,5.0


In [None]:
# For data comparis

### The Numbers

In [155]:
# Check for duplicates
tn[tn.duplicated(subset=['movie_id'], keep=False)]

Unnamed: 0,movie_id,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,total profit,ROI,domestic profit,domestic profit margin,total profit margin
3455,home2009,56,2009-06-05,Home,12000000,0,0,-12000000,-1.0,-12000000,-inf,-inf
5459,home2009,60,2009-04-23,Home,500000,15433,44793168,44293168,88.586336,-484567,-31.398108,0.988838


It is two different movies with the same name, it would be normally fine to keep both of them, but because one has a "worldwide_gross" of 0 we will be removing it anyway.

In [173]:
# My group agreed to remove any row with a 'worldwide_gross' of 0 since it is probably a placeholder value.
tn['worldwide_gross'].value_counts().head(1)

0    367
Name: worldwide_gross, dtype: int64

In [185]:
# Remove "worldwide_gross" columns that are 0
tn.drop(tn[tn['worldwide_gross'] == 0].index, inplace=True)


In [196]:
# Visually confirme expected results
tn[tn['worldwide_gross'] == 0]

Unnamed: 0,movie_id,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,total profit,ROI,domestic profit,domestic profit margin,total profit margin
