# **Analysis of Film Industry ROIs**
## Malcolm Katzenbach, Lauren Phipps, Dan Valenzuela

***



## **Overview** <a id="Overview"></a>

[**1. Business Problem**](#Business-Problem)

[**2. Data Understanding**](#Data-Understanding)

[**3. Data Preparation**](#Data-Preparation)

[**4. Data Analysis**](#Data-Analysis)

[**5. Evaluation**](#Evaluation)

[**6. Conclusion**](#Conclusion)
***

## **Business Problem** <a id="Business-Understanding"></a>
[*↑ Back to overview*](#Overview)

[*↑ Back to overview*](#Overview)
***

## **Data Understanding** <a id="Data-Understanding"></a>
[*↑ Back to overview*](#Overview)

### Datasets
For the purposes of this analysis we focused primarily on data from the Internet Movie Database (IMDB) and The-Numbers.com (TN), two sources that focus on the film industry. Specifically we used datasets that included--on one hand--title, date released, and genre data and--on the other hand--title, date released, production budgets, and box office figures. Below is a summary of the data pertinent to our analysis broken down by file. 

| imdb.title.basics.csv | tn.movie_budgets.csv |
| --- | --- |
| primary_title | movie |
| start_year | release_date |
| genres |  |
|  | production_budget |
|  | domestic_gross |
|  | worldwide_gross |


In this table you can see our understanding of how data between the two datasets can "match" in the sense that they provide they same kind of data but may be in different formats. For example, `start_year` in `imdb.title.basics.csv` is  formatted as `YYYY` whereas `release_date` in `tn.movie_budgets.csv` is formatted as `MMM DD, YYYY`. You can see below examples of such a discrepancy.

In [15]:
import CustomLibrary as cl
from CustomLibrary import df_title_basics, df_movie_budgets

df_title_basics.head(3)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama


In [16]:
df_movie_budgets.head(3)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"


Further, there are issues with being able to utilize the data due to their data types. The previously discussed `release_date` and `start_year` data are actually `objects` and `integers`, respectively. And much of the box office data are `objects` that can't be added and subtraced. How we dealt with them can be seen in the [data preparation](#Data-Preparation) section.

In [4]:
cl.df_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
tconst             146144 non-null object
primary_title      146144 non-null object
original_title     146123 non-null object
start_year         146144 non-null int64
runtime_minutes    114405 non-null float64
genres             140736 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
imdb.title.basics.csv
 None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB
tn._movie_budgets.csv
 None


## Key Data for Merging Datasets and Analysis

Another key issue for this data is identifying which data we can use to perform a merge. Although the `imdb.title.basics.csv` contains a unique identifier column `tconst` the `tn.movie_budgets.csv` dataset contains no unique IDs. We worked around that issue by identifying the title variables `primary_title` and `movie` and release date variables `start_year`and `release_date` variables as our keys. As mentioned before these variables did not match in terms of format so, for example, we identified the year string in `release_date` as data that will be used to match release date variables. 

Although the datasets contain key variables as `genres`, `production_budget`, and `worldwide_gross`, the datasets also lack total costs of producing each movie, including costs like marketing, and more granular genre data. As seen in [data preparation](#Data-Preparation), we made key assumptions, calculations, and manipulations to make the dataframe accessible for our analysis of ROI, genre, and budgets. 

[*↑ Back to overview*](#Overview)
***

## **Data Preparation** <a id="Data-Preparation"></a>
[*↑ Back to overview*](#Overview)

[*↑ Back to overview*](#Overview)
***

## **Data Analysis** <a id="Data-Analysis"></a>
[*↑ Back to overview*](#Overview)

### Question 1<a id="Question-1"></a> - What is the relationship between a film's budget and it's return on investment?

The first step in the analysis was to determine if there is a correlation between the amount of money invested in a movie and the return on investment. This will help decide if it is more beneficial to invest in larger blockbuster movies or if there is more return for smaller, low budget films. A scatterplot was created with the film's budget (production and advertisement costs) on the x-axis and the return on investment as a percentage on the y-axis. This was then further broken down by the low, mid, and high budget classification to more clearly show the divide between the budgets.

![graph1](./images/ALLBudget_vs_ROI.png)

![graph2](./images/lowbudgetROIscatter.png)  

![graph3](./images/midbudgetROIscatter.png)

![graph4](./images/highbudgetROIscatter.png)


These plots show that there is no strong correlation between a film's budget and it's return on investment. The correlation coefficient is -0.05 for all movies of all budgets. When broken out into the individual budget tiers, the low and mid-range budgets had a correlation coefficient of -0.1, while the high budget had a correlation coeffiecient of 0.05. When looking at only the budget of a movie, there is not a relationship between the investment and the return on investment.   



### Question 2 - What is the distribution of ROI by budget tiers? <a id="Question-2"></a>

******Next, it was important to look more closely at the distribution of each of the budget categories to 

![graph5](./images/budgettierboxplot.png)

### Question 3<a id="Question-3"></a> - What is the distribution of the returns on investment for each movie genre?

When investing in movies, it is important to know if certain genres lead to a higher return on investment than others. To answer this question, the data was broken into 7 genres. These were the most common genres contained in the data set. Some movies were categorized as more than one genre. Those movies are included in each of the listed genres. For example, if a movie is considered Action and Adventure, it's data in included in both genres because it is representative of both genres. The final category is "other" for any movies that did not fall into one of the seven listed genres. 

Because there is a wide range of values for returns on investment, it is beneficial to use a boxplot to get a more complete look at the data for each genre. Outliers have been excluded from this plot, but is still part of the data. Their impact can be seen from the range of the whiskers. The median is represented by the red line. Because of the significant outliers, the median is a better representation of the data than the mean.

![graph8](./images/genreROIboxplot.png)

With the exception of documentaries, all genres have a median return on investment between 0% and 50%, with adventure and thriller movies having the highest. Most genres have their 25%-75% quartile range, so most of the data points, between 0 and 100% return. However, thriller movies have the most significant outliers and the highest 75% quartile range (with the exception of the other category). This means that there are more thriller movies that have a return on investment about 100%, with some ranging upwards of 400%. 

### Question 4<a id="Question-4"></a> - 

### Question 5<a id="Question-5"></a> - 

[*↑ Back to overview*](#Overview)
***

## **Evaluation**<a id="Evaluation"></a>
[*↑ Back to overview*](#Overview)

[*↑ Back to overview*](#Overview)
***

## **Conlcusion**<a id="Conclusion"></a>
[*↑ Back to overview*](#Overview)

[*↑ Back to overview*](#Overview)
***