# TMDB Box Office Prediction

In this dataset, you are provided with 7398 movies and a variety of metadata obtained from The Movie Database (TMDB). Data points include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.

You are predicting the worldwide revenue for the movies.

### Data
- Train (labelled): https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/tmdb_box_office_prediction/train.csv
- Test (unlabelled): https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/tmdb_box_office_prediction/test.csv

Link to the Kaggle competition: https://www.kaggle.com/c/tmdb-box-office-prediction/data

### Your task
1. Train `RandomForestRegressor` that predicts the value in the column `revenue`.
Select best parameters using cross-validation and evaluate the accuracy on 20% test data.
2. Predict revenues on the test file and submit your solution to Kaggle. Follow the link https://www.kaggle.com/c/tmdb-box-office-prediction/overview for further instructions.

### Scoring metric
In this competition, RMSLE (Root Mean Squared Logatirhmic Error) is used.
To optimize this metric, you need to
1. Before training, convert the target `y_from_data` using log-transformation
```
import numpy as np
y_for_training = np.log1p(y_from_data)
```
2. Before submission, convert your predictions `predicted_y` back using the inverse transformation:
```
import numpy as np
predictions_for_submission = np.expm1(predicted_y)
```

### Additional information
- API to get images: https://image.tmdb.org/t/p/w500/iEhb00TGPucF0b4joM1ieyY026U.jpg

In this competition, some data is represented in strings that have JSON-like format.
The code below shows how to handle such data in Pandas

In [None]:
# short demo of Pandas datetime and JSON
# note that this is not a valid JSON - it uses single quotes instead of double quotes
input_csv = \
"""int_field,float_field,cat_field,json_field,datetime_field
42,3.14,Some Category,"[{'string_key': 'string value', 'int_key': 1}, {'string_key': 'string value 2', 'int_key': 2}]",05/14/19
15,2.72,Other Category,"[{'string_key': 'string value 2', THIS JSON IS BROKEN!",05/14/1991
"""

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(input_csv), sep=",")

print(df['int_field'])
print(df['float_field'])
print(df['cat_field'])
print(df['json_field'])
print(df['datetime_field'])
df['parsed_datetime_field'] = pd.to_datetime(df['datetime_field'])
print(df['parsed_datetime_field'])

import json
def load_and_fix_json(s):
  # 1. fix JSON syntax: ' -> "
  s = s.replace("'", '"')
  # 2. handle broken JSON
  try:
    return json.loads(s)
  except:
    return None

# you can store arbitrary objects in Series
df['parsed_json_field'] = df['json_field'].apply(load_and_fix_json)
print(df['parsed_json_field'])

# feature engineering using df.apply
def extract_features_from_parsed_json(d):
  feature_names = ['has_int_value_greater_than_1', 'max_num_words_in_string_value']
  if d:
    features = [
        int(any([dd['int_key'] > 1 for dd in d])),
        max([len(dd['string_key'].split()) for dd in d])
    ]
  else:
    # empty or broken JSON
    features = [-1, -1]
    
  # return Series with column names
  return pd.Series(features, index=feature_names)

feature_engineered_df = df['parsed_json_field'].apply(extract_features_from_parsed_json)
pd.concat([df, feature_engineered_df], axis=1)

0    42
1    15
Name: int_field, dtype: int64
0    3.14
1    2.72
Name: float_field, dtype: float64
0     Some Category
1    Other Category
Name: cat_field, dtype: object
0    [{'string_key': 'string value', 'int_key': 1},...
1    [{'string_key': 'string value 2', THIS JSON IS...
Name: json_field, dtype: object
0      05/14/19
1    05/14/1991
Name: datetime_field, dtype: object
0   2019-05-14
1   1991-05-14
Name: parsed_datetime_field, dtype: datetime64[ns]
0    [{'string_key': 'string value', 'int_key': 1},...
1                                                 None
Name: parsed_json_field, dtype: object


Unnamed: 0,int_field,float_field,cat_field,json_field,datetime_field,parsed_datetime_field,parsed_json_field,has_int_value_greater_than_1,max_num_words_in_string_value
0,42,3.14,Some Category,"[{'string_key': 'string value', 'int_key': 1},...",05/14/19,2019-05-14,"[{'string_key': 'string value', 'int_key': 1},...",1,3
1,15,2.72,Other Category,"[{'string_key': 'string value 2', THIS JSON IS...",05/14/1991,1991-05-14,,-1,-1


In [None]:
import pandas as pd
import numpy as np

train_df = pd.read_csv('https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/tmdb_box_office_prediction/train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/tmdb_box_office_prediction/test.csv')
y_from_data = np.array(train_df['revenue'])
y = np.log1p(y_from_data)
train_df.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg,"[{'name': 'Bold Films', 'id': 2266}, {'name': ...","[{'iso_3166_1': 'US', 'name': 'United States o...",10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de...",13092000
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,/aTXRaPrWSinhcmCrcfJK17urp3F.jpg,,"[{'iso_3166_1': 'IN', 'name': 'India'}]",3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de...",16000000
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg,,"[{'iso_3166_1': 'KR', 'name': 'South Korea'}]",2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970


In [None]:

del train_df['revenue']

train_df['homepage'] = train_df['homepage'].fillna(-1)
train_df['production_companies'] = train_df['production_companies'].fillna(-1)
train_df['belongs_to_collection'] = train_df['belongs_to_collection'].fillna(-1)
train_df['production_countries'] = train_df['production_countries'].fillna(-1)
train_df['Keywords'] = train_df['Keywords'].fillna(-1)
train_df['tagline'] = train_df['tagline'].fillna(-1)
train_df['spoken_languages'] = train_df['spoken_languages'].fillna(-1)
train_df['crew'] = train_df['crew'].fillna(-1)
train_df['cast'] = train_df['cast'].fillna(-1)
train_df['overview'] = train_df['overview'].fillna(-1)
train_df['genres'] = train_df['genres'].fillna(-1)
train_df['poster_path'] = train_df['poster_path'].fillna(-1)
train_df['runtime'] = train_df['runtime'].fillna(-1)

features1 = ['belongs_to_collection', 'release_date', 'imdb_id', 'genres', 'original_language', 'homepage', 'original_title', 'overview', 'poster_path', 'production_companies', 'production_countries', 'spoken_languages', 'status','tagline','title', 'Keywords', 'cast', 'crew']
feautres = ['budget', 'popularity', 'original_language']

In [None]:
test_df['homepage'] = test_df['homepage'].fillna(-1)
test_df['production_companies'] = test_df['production_companies'].fillna(-1)
test_df['belongs_to_collection'] = test_df['belongs_to_collection'].fillna(-1)
test_df['production_countries'] = test_df['production_countries'].fillna(-1)
test_df['Keywords'] = test_df['Keywords'].fillna(-1)
test_df['tagline'] = test_df['tagline'].fillna(-1)
test_df['spoken_languages'] = test_df['spoken_languages'].fillna(-1)
test_df['crew'] = test_df['crew'].fillna(-1)
test_df['cast'] = test_df['cast'].fillna(-1)
test_df['overview'] = test_df['overview'].fillna(-1)
test_df['genres'] = test_df['genres'].fillna(-1)
test_df['poster_path'] = test_df['poster_path'].fillna(-1)
test_df['runtime'] = test_df['runtime'].fillna(-1)
test_df['title'] = test_df['title'].fillna(-1)
test_df['release_date'] = test_df['release_date'].fillna(-1)
test_df['status'] = test_df['status'].fillna(-1)

In [None]:
test_df[test_df.isnull().any(axis=1)].head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew


In [None]:
from sklearn.model_selection import train_test_split

df_cat_train, df_cat_test, y_cat_train, y_cat_test = train_test_split(train_df, y, test_size=0.2, random_state=42)

In [None]:
!pip install catboost



In [None]:
import catboost as cb

train_pool = cb.Pool(df_cat_train, y_cat_train, cat_features=features)
test_pool = cb.Pool(df_cat_test, y_cat_test, cat_features=features)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV

clf = cb.CatBoostRegressor(n_estimators=4000, learning_rate=0.05)
clf.fit(train_pool, verbose=False)

print("train RMSE:", np.sqrt(mean_squared_error(train_pool.get_label(), clf.predict(train_pool))))
print("test RMSE:", np.sqrt(mean_squared_error(test_pool.get_label(), clf.predict(test_pool))))

train RMSE: 1.826524213982597
test RMSE: 2.108615886268768


In [None]:
for fname, fstr in sorted(
    zip(
        train_df,
        clf.get_feature_importance()
    ),
    key=lambda x: -x[1]
):
  print(fname, fstr)

budget 33.70818670881145
popularity 19.57688249424994
original_language 11.219266079278277
runtime 7.083876354225955
id 6.045581282388144
genres 3.8984772779184977
production_countries 3.885659838152616
spoken_languages 3.4646680627240785
belongs_to_collection 3.0437754625329383
production_companies 2.0439923580303376
homepage 1.7869584130742255
tagline 1.4684131595542471
Keywords 1.284751774791662
release_date 1.0115850446107155
original_title 0.27102852629407687
overview 0.10389683452181002
status 0.07487992664893219
cast 0.028120402192268294
imdb_id 0.0
poster_path 0.0
title 0.0
crew 0.0


In [None]:
test_df['revenue'] = np.expm1(clf.predict(cb.Pool(test_df, cat_features=features)))
test_df.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,3001,"[{'id': 34055, 'name': 'Pokémon Collection', '...",0,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",http://www.pokemon.com/us/movies/movie-pokemon...,tt1226251,ja,ディアルガVSパルキアVSダークライ,Ash and friends (this time accompanied by newc...,3.851534,/tnftmLMemPLduW6MRyZE0ZUD19z.jpg,-1,"[{'iso_3166_1': 'JP', 'name': 'Japan'}, {'iso_...",7/14/07,90.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Somewhere Between Time & Space... A Legend Is ...,Pokémon: The Rise of Darkrai,"[{'id': 11451, 'name': 'pok√©mon'}, {'id': 115...","[{'cast_id': 3, 'character': 'Tonio', 'credit_...","[{'credit_id': '52fe44e7c3a368484e03d683', 'de...",4445745.0
1,3002,-1,88000,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na...",-1,tt0051380,en,Attack of the 50 Foot Woman,When an abused wife grows to giant size becaus...,3.559789,/9MgBNBqlH1sG4yG2u4XkwI5CoJa.jpg,"[{'name': 'Woolner Brothers Pictures Inc.', 'i...","[{'iso_3166_1': 'US', 'name': 'United States o...",5/19/58,65.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A titanic beauty spreads a macabre wave of hor...,Attack of the 50 Foot Woman,"[{'id': 9748, 'name': 'revenge'}, {'id': 9951,...","[{'cast_id': 2, 'character': 'Nancy Fowler Arc...","[{'credit_id': '55807805c3a3685b1300060b', 'de...",2404347.0
2,3003,-1,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",-1,tt0118556,en,Addicted to Love,Good-natured astronomer Sam is devastated when...,8.085194,/ed6nD7h9sbojSWY2qrnDcSvDFko.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",5/23/97,100.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A Comedy About Lost Loves And Last Laughs,Addicted to Love,"[{'id': 931, 'name': 'jealousy'}, {'id': 9673,...","[{'cast_id': 11, 'character': 'Maggie', 'credi...","[{'credit_id': '52fe4330c3a36847f8041367', 'de...",4760961.0
3,3004,-1,6800000,"[{'id': 18, 'name': 'Drama'}, {'id': 10752, 'n...",http://www.sonyclassics.com/incendies/,tt1255953,fr,Incendies,A mother's last wishes send twins Jeanne and S...,8.596012,/sEUG3qjxwHjxkzuO7plrRHhOZUH.jpg,"[{'name': 'TS Productions', 'id': 313}, {'name...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",9/4/10,130.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,The search began at the opening of their mothe...,Incendies,"[{'id': 378, 'name': 'prison'}, {'id': 539, 'n...","[{'cast_id': 6, 'character': 'Nawal', 'credit_...","[{'credit_id': '56478092c3a36826140043af', 'de...",12598200.0
4,3005,-1,2000000,"[{'id': 36, 'name': 'History'}, {'id': 99, 'na...",-1,tt0418753,en,Inside Deep Throat,"In 1972, a seemingly typical shoestring budget...",3.21768,/n4WC3zbelz6SG7rhkWbf8m9pMHB.jpg,-1,"[{'iso_3166_1': 'US', 'name': 'United States o...",2/11/05,92.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It was filmed in 6 days for 25 thousand dollar...,Inside Deep Throat,"[{'id': 279, 'name': 'usa'}, {'id': 1228, 'nam...","[{'cast_id': 1, 'character': 'Narrator (voice)...","[{'credit_id': '52fe44ce9251416c75041967', 'de...",1359858.0


In [None]:
test_df[['id','revenue']].to_csv('t.csv', index=False)