## TECHNICAL REPORT - MOVIE REVENUE REGRESSION

### PROBLEM STATEMENT


The film industry is still a rapidly expanding beast. As of 2018, the global box office is worth at least 41.7 billion dollars, with home entertainment bringing that up to a massive 136 billion. Thus, every movie designed to be released for a wide audience has serious expectations on its shoulders. However, sometimes movies don't like up to the hype. For example, “Solo: A Star Wars Story” made roughly 400 million dollars worldwide, but cost over 450 million to create and market. On the other hand, the film "Napoleon Dynamite" only cost 400,000 dollar to create and market, but made 46.1 million worldwide. 

With this in mind, my goal for this project was to build a regression model that could predict how much a proposed movie would make given who would be cast, write, direct, etc. This would provide studios with the capability to predict how well proposed movies would do would as well choose whether or not a movie should be financed. 

### DATA COLLECTION

I obtained the reviews and business data directly from Kaggle. The dataset can be found on Kaggle [here](https://www.kaggle.com/c/tmdb-box-office-prediction/).

Data was obtained in .csv format and stored on my local machine. 

### EXPLORATORY DATA ANALYSIS

The dataset originally contained 3000  movies. However, after cleaning the data and imputing and removing null values, I pared it down to 2775 values. 

The distribution of movie genres is located in the histogram below. <img src="number_of_genres_vs_number_of_movies.png"> 
Most movies have between 1 and 5 genres, with only a few having 6 or more.


Additionally, the distribution of revenue is located in the histogram below. <img src="revenue_vs_number_of_movies.png"> 
It appears that the distribution of revenue is skewed highly to the right. 

The distribution of runtime is located in the histogram below. <img src="Runtime_vs_number_of_movies.png"> 
Most movies are between an hour and 4 hours long, with only a few being shorter than an hour or longer than 4 hours. 

According to the scatterplot of revenue and budget below, there is a roughly positive correlation between the revenue and the budget of a movie. <img src="Revenue_vs_Budget.png"> 

The top 20 most important actors  and the number of movies they've been in are [('Samuel L. Jackson', 29),
 ('Robert De Niro', 29),
 ('Morgan Freeman', 27),
 ('Bruce Willis', 25),
 ('Liam Neeson', 25),
 ('J.K. Simmons', 23),
 ('Willem Dafoe', 23),
 ('Bruce McGill', 23),
 ('Susan Sarandon', 23),
 ('John Turturro', 23),
 ('Forest Whitaker', 22),
 ('Bill Murray', 22),
 ('Owen Wilson', 22),
 ('Sylvester Stallone', 21),
 ('Jason Statham', 21),
 ('John Goodman', 21),
 ('Mel Gibson', 21),
 ('Sigourney Weaver', 21),
 ('Nicolas Cage', 21),
  ('Keith David', 20)].

The top 20 most important crew members and the number of movies they've been in are [('Avy Kaufman', 49),
 ('Robert Rodriguez', 44),
 ('Deborah Aquila', 40),
 ('James Newton Howard', 38),
 ('Mary Vernieu', 38),
 ('Steven Spielberg', 37),
 ('Luc Besson', 37),
 ('Jerry Goldsmith', 37),
 ('Francine Maisler', 35),
 ('Tricia Wood', 35),
 ('James Horner', 33),
 ('Kerry Barden', 30),
 ('Bob Weinstein', 30),
 ('Harvey Weinstein', 30),
 ('Janet Hirshenson', 30),
 ('Jane Jenkins', 29),
 ('John Debney', 28),
 ('John Papsidera', 28),
 ('Francis Ford Coppola', 28),
 ('Hans Zimmer', 27)]

All of my statistical analysis was performed on my local machine.  

#### RESULTS

For my Naive model, I used the mean of my dataset as the prediction for every movie, and obtained a Root Mean Squared error of 158616484.34 dollars.  I used XGBoost, LinearRegression, Lasso, Ridge, and RandomForestRegressor models to try to predict movie revenue.  Out of the four, my XGBoost model performed the best, lowering the Root Mean Squared Error compared to the Naive model by 25 percent, to 121703294.48 dollars. Thus, I am choosing my XGBoost model to productionalize. Further hyperparameter tuning yielded no noticeable decrease of RMSE. 

## FUTURE STEPS

#### ACQUIRE MORE DATA <br>

Although my model did well when trying to predict revenues, my initial dataset was only 3000 reviews, which is a relatively small dataset. I would like to acquire more data specifically of non-English movies such as films from Bollywood or the Chinese film market. 

#### IMPROVE MODEL BY INCLUDING PCA

Given the time, in the future I would like to use principal component analysis to analyze movies' taglines and sypnoses to find out if there is any correlation between certain popular key phrases and movie revenue.