# Minty Pandas' Work Flow

In this notebook, we present a chronological summary of our workflow throughout the Datatonic data challenge. It consists of four parts: 

(1) Data pre-processing

(2) Feature engineering

(3) Exploratory data analysis (EDA), data visualisation, and ordinary least squares (OLS) regression

(4) Machine learning (ML) models

# (1) Data pre-processing

## Summary of  "removing_outliers_and_null_values.ipynb"

Two datasets describing 5000 TMDB movies were imported from Kaggle: one with movie details and another with movie credits.

After reading in the two datasets as pandas dataframes, the first thing we did was to remove null values and outliers from the movie details dataset. In a nutshell, the accomplishes the following: remove columns that contain a large number of null values as well as removing rows that contain null values. The rationale behind this is that they are unreliable sources of information. 

Secondly, outliers were discarded: all values that deviated by more than three standard deviations from the mean were removed from the datset.

Finally, the updated movie details dataframe was joined together with the movie credits dataframe to produce one master dataframe - this was exported as a pickle file that was readily available for subsequent analysis.

## Summary of "strings_to_dicts.ipynb"

In the original dataset, features such as genres, keywords, and production companies were stored in raw list JSON format. To make these features more accessible, we had to transform their format: this involved converting the data type from being a string of dictionaries into a single dictionary. For example if a movie had the entry [{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 10749, "name": "Romance"}] under genres, this would be converted to {'id': [12, 14, 28, 10749], 'name': ['Adventure', 'Fantasy', 'Action', 'Romance']} in the updated dataframe. This was done across columns with their entries stored in JSON format, resulting in a dataframe that was easier to work with. However, the two columns of cast and crew were skipped as they contained more than two keys. Given the time constraints, we decided that this added complexity was not worth pursuing further.

Next, we created id maps that link a given id to its corresponding name. Using the above example, the id 10749 would be mapped to the genre of Romance. These id maps were crucial for subsequent one-hot encoding of a range of variables, which in turns were instrumental for determining important features in our ML models. We discuss these below. 

# (2) Feature engineering

## Summary of "one-hot-encoding.ipynb"

As a large fraction of movies were missing keywords, we first sought to supplement these by processing each movie's overview. We produced a function which breaks up the overview into individual words, and compares these to the existing keywords from our id maps. If there was a match, the word was added to that movie's keywords dictionary. This resulted in an enhancement of existing keywords for many movies, and added entries to ones which were previously lacking.

We decided to use one hot encoding to convert the following categorical features into ones suitable for our machine learning algorithms: genres, keywords, production companies, production countries, spoken languages, crew and cast. In doing so we also collected information regarding the gender of the first and second lead cast members.

To reduce the number of these categorical features in our dataset, we only kept the top 500 most frequently occurring keywords and crew members. However, for cast members we adjusted this such that we only kept actors who appeared in >10 movies.

As we were interested in examining movies based on books we wanted to create a binary feature which would mark whether a movie was based on a book (1) or not (0). From the keywords we found that films could be based off a range of literary material: based on novel, based on young adult novel, based on comic book. In the interest of increasing our N for this feature, we pooled these three keywords under our book feature.

## Summary of "adjusting-revenue-variable.ipynb"

As we aimed to build a model which would predict a movie's success to help guide a production company's decision to greenlight production, we decided that the measure of success should be financial. Most relevant to this, were the revenue and budget features. Ideally, we would calculate a measure similar to profit (e.g. revenue-budget), or a value normalised to budget (e.g. revenue/budget). However, there were many missing budget values, likely due to the actual values being unknown/not reported, which meant these metrics could not be calculated. As removing movies with a budget value of 0 would significantly reduce the size of our dataset, we ultimately decided to use revenue as our measure of a movie's success. Budget was used as one of our predictors, and the missing values were replaced by the median budget value calculated from all movies.

The release date feature was adapted such that the release year was ignored. This resulted in a cyclical date which would allow us to retain information about seasonality. We believed that this could have valuable insights on how release of a movie might impact its success e.g. due to awards seasons, seasonal holidays etc. 

# (3) EDA, OLS Regression, and data visualisation

## EDA: Summary of "Category_plots.ipynb"
Providing a preliminary examination of the release of movies based on books, sequels, or other over time. We find that there is a gradual increase in these types of movies over the previous few decades.
<img src="files/plots/categoryplots_releasedate_books-sequels-other.png">

In addition, the revenue distrubitions for movies based on books or sequels show a similar range to that achieved by other movies. 
<img src="files/plots/categoryplots_revenue-boxplot_books-sequels-other.png">

Plots of estimated profit (calculated as revenue-budget), revenue, budget, and vote average for films originally identified as based on a novel, graphc novel, or young adult novel.
<img src="files/plots/category_plots_profit-budget-revenue-voteavg_boxplot_different-novel-types.png">


## EDA: Summary of "some_aggregation_and_trends.ipynb"

To understand the distribution of original languages, we plotted a histogram of this feature. We found that English is the most common original language in this dataset. Due to this dominance of a single value, we decided to remove original language as a predictor.
<img src="files/plots/Barchart_original_language.png">

Similarly, we found that the majority of popularity scores for this dataset were "extremely unpopular". It is also important to note that the documentation describing how popularity was calculated was not detailed.
<img src="files/plots/Barchart_popularity.png">

We considered the correlation between revenue, our measure of movie success, and other prospective measures of success such as popularity and vote average. We found that revenue correlated with popularity, but did not correlate well with vote average.
<img src="files/plots/Scatterplot_revenue_popularity.png">

<img src="files/plots/Scatterplot_revenue_vote_average.png">



## EDA: Summary of "production_company_analysis.ipynb"
We looked to gain insight into the production houses involved with different movies.

As we had two variables relating to production houses: production house and production country, we wanted to determine if the two features relayed the same information. We correlated the two, and found that production companies and production countries were poorly correlated to one another. This led us to include both variables in our machine learning models.

<img src="files/plots/corrmatrix_productionhouses-vs-productioncountries.png">


## OLS: Summary of "OLS_regression_and_correlations.ipynb"
As we have a dataset with a large number of predictors, we conducted OLS regression to help identify features which would be particularly useful to bring forward to our machine learning model.

<img src="files/plots/multivariate_linear_regression_plot.png">

To guide our consideration of suitable success measures, for use in future analyses, we also correlated candidate success measures such as: budget, revenue, runtime, vote average, and vote count.
# QQQ: missing plot?
<img src="files/plots/correlation_matrix_plot.png">


