Skip to content

EDA and model fitting work for the TMDB Box Office Prediction Kaggle Competition. The goal of the competition is to predict the amount of box office revenue for a movie. Would be in the 58th percentile (top 42%) in the competition, if the competition was open still.

Notifications You must be signed in to change notification settings

dteuscher1/Movie-Revenue

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TMDB Box Office Revenue Prediction Project

This repository includes exploratory analysis and predictive modeling for the TMDB Box Office Prediction Kaggle competition (https://www.kaggle.com/c/tmdb-box-office-prediction/overview).

The .Rmd file where the analysis was done is available, while an HTML document, Movie-EDA.html, has also been created that includes the code and output as well. A notebook was created on Kaggle for this competition and can be found here.

The goal of this competition is to determine the box office revenue of different movies. All movies are trying to maximize revenue, especially movies that spend a lot of money on production. Movies are often considered a success or failure based on the success at the box office, so if it is possible to accurately predict the box office revenue based off of certain factors, it can be beneficial to production companies when approaching different movies in the future.

The training set includes 7398 movies from The Movie Database (TMDB). Each movie has a unique id and includes information such as cast, crew, plot keywords, budget, release dates, among other variables. The revenue column contains the revenue, which is what is trying to be predicted. The test set includes 4398 movies and includes all of the same variables that are included in the training set besides revenue.

The best model that was fit on this data was a gbm model with a RMSLE of 2.062, which is approximately the 58th percentile in the competition if this competition was still open at this time. A lot of the work was done extracting information about the genres, spoken languages, production companies, and the production countries and visualizing the data. In search of improvement, I believe there could be additional information obtained from the genres, production companies and countries, and the spoken languages through additional data manipulation. I also imputed a number of values using the median and I think that using a model to impute those values might be better. Third, I think improvement could be found by working with the tuning parameters and other types of models as well.

If there are any questions or comments about the analysis and work done, feel free to email me at david.teuscher.96@gmail.com

About

EDA and model fitting work for the TMDB Box Office Prediction Kaggle Competition. The goal of the competition is to predict the amount of box office revenue for a movie. Would be in the 58th percentile (top 42%) in the competition, if the competition was open still.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 100.0%