This repository includes exploratory analysis and predictive modeling for the TMDB Box Office Prediction Kaggle competition (https://www.kaggle.com/c/tmdb-box-office-prediction/overview).
The .Rmd file where the analysis was done is available, while an HTML document, Movie-EDA.html
, has also been created that includes the code and output as well. A notebook was created on Kaggle for this competition and can be found here.
The goal of this competition is to determine the box office revenue of different movies. All movies are trying to maximize revenue, especially movies that spend a lot of money on production. Movies are often considered a success or failure based on the success at the box office, so if it is possible to accurately predict the box office revenue based off of certain factors, it can be beneficial to production companies when approaching different movies in the future.
The training set includes 7398 movies from The Movie Database (TMDB). Each movie has a unique id and includes information such as cast, crew, plot keywords, budget, release dates, among other variables. The revenue
column contains the revenue, which is what is trying to be predicted. The test set includes 4398 movies and includes all of the same variables that are included in the training set besides revenue.
The best model that was fit on this data was a gbm
model with a RMSLE of 2.062, which is approximately the 58th percentile in the competition if this competition was still open at this time. A lot of the work was done extracting information about the genres, spoken languages, production companies, and the production countries and visualizing the data. In search of improvement, I believe there could be additional information obtained from the genres, production companies and countries, and the spoken languages through additional data manipulation. I also imputed a number of values using the median and I think that using a model to impute those values might be better. Third, I think improvement could be found by working with the tuning parameters and other types of models as well.
If there are any questions or comments about the analysis and work done, feel free to email me at david.teuscher.96@gmail.com