# TV shows Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the `popularity` of a movie or TV show

<img src="image.jpg" width=300 />




The dataset contains a list of movies and TV shows with the following characteristics:
- `title`: title of the movie in english
- `original_title`: original title of the movie 
- `duration_min`: duration of the movie in minutes
- `popularity`: popularity of the movie in terms of review scores
- `release_date`: release date
- `description`: short summary of the movie
- `budget`: budget spent to produce the movie in USD
- `revenue`: movie revenue in USD 
- `original_language`: original language 
- `status`: is the movie already released or not
- `number_of_awards_won`: number of awards won for the movie
- `number_of_nominations`: number of nominations
- `has_collection`: if the movie is part of a sequel or not
- `all_genres`: genres that described the movie (can be zero, one or many!) 
- `top_countries`: countries where the movie was produced (can be zero, one or many!) 
- `number_of_top_productions`: number of top production companies that produced the film if any. 
Top production companies includes: Warner Bros, Universal Pictures, Paramount Pictures, Canal+, etc...
- `available_in_english`: whether the movie is available in english or not

## Imports

Run the following cell to load the basic packages:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nbresult import ChallengeResult

## Data collection

📝 **Load the `movie_popularity.csv` dataset from the provided this [URL](https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/tv_movies_popularity.csv)**
- First, check and remove the rows that may be complete duplicate from one another (we never know!)
- Then, drop the columns that have too much missing values
- Finally, drop the few remaining rows that have missing values
- Store the result in a `DataFrame` named `data`

In [2]:
# YOUR CODE HERE
import io
import requests
url="https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/tv_movies_popularity.csv"
s=requests.get(url).content
data=pd.read_csv(io.StringIO(s.decode('utf-8')))

In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6864 entries, 0 to 6863
Data columns (total 16 columns):
original_title               6864 non-null object
title                        6863 non-null object
release_date                 6864 non-null object
duration_min                 6864 non-null float64
description                  6864 non-null object
budget                       6864 non-null int64
revenue                      2778 non-null float64
original_language            6864 non-null object
status                       6864 non-null object
number_of_awards_won         6864 non-null int64
number_of_nominations        6864 non-null int64
has_collection               6864 non-null int64
all_genres                   6864 non-null object
top_countries                6864 non-null object
number_of_top_productions    6864 non-null int64
available_in_english         6864 non-null bool
dtypes: bool(1), float64(2), int64(5), object(8)
memory usage: 811.2+ KB


### 🧪 Run the following cell to save your results

In [4]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "data_cleaning",
    columns=data.columns,
    cleaning=sum(data.isnull().sum()),
    shape=data.shape)
result.write()

## Baseline model

### The metric

📝 **We want to predict `popularity`: Start by plotting a histogram of the target to visualize it**

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want it to:**

- Be better when greater (i.e. metric_good_model > metric_bad_model)
- Penalize **more** an error between 10 and 20 compared with an error between 110 and 120
- Said otherwise, what matter should be the **relative error ratio**, more than the absolute error difference

Hint: the histogram plotted above should give you some intuition about it

👉 Store its exact [sklearn scoring name](https://scikit-learn.org/stable/modules/model_evaluation.html) as `string` in the variable `scoring` below.

🚨 You must use this metric for the rest of the challenge

In [93]:
scoring = 'r2'

In [92]:
# YOUR CODE HERE
#plt.hist(data['popularity'],bins = 500)
#plt.show()
sns.histplot(data=data, x="popularity")

ValueError: Could not interpret value `popularity` for parameter `x`

<details>
    <summary>💡 Hint</summary>
It is around here!
<img src="scores.jpg" width=200 height=400 />
</details>

### X,y

**📝 Define `X` as the features Dataframe (keep all features) and `y` as the target Series.**

In [18]:
# YOUR CODE HERE
y= data['popularity']
data = data.drop(['popularity'], axis=1)
X = data
X

Unnamed: 0,original_title,title,release_date,duration_min,description,budget,revenue,original_language,status,number_of_awards_won,number_of_nominations,has_collection,all_genres,top_countries,number_of_top_productions,available_in_english
0,Hot Tub Time Machine 2,Hot Tub Time Machine 2,2015-02-20,93.0,"When Lou's shot in the groin, Nick and Jacob d...",14000000,12314651.0,en,Released,0,2,1,Comedy,United States of America,3,True
1,The Princess Diaries 2: Royal Engagement,The Princess Diaries 2: Royal Engagement,2004-08-06,113.0,"Now settled in Genovia, Princess Mia faces a n...",40000000,95149435.0,en,Released,1,2,1,"Comedy, Drama, Family, Romance",United States of America,1,True
2,Whiplash,Whiplash,2014-10-10,105.0,A promising young drummer enrolls at a cut-thr...,3300000,13092000.0,en,Released,97,145,0,Drama,United States of America,0,True
3,Kahaani,Kahaani,2012-03-09,122.0,A pregnant woman's search for her missing husb...,1200000,16000000.0,hi,Released,23,18,0,"Drama, Thriller",India,0,True
4,The Possession,The Possession,2012-08-30,92.0,A young girl buys an antique box at a yard sal...,14000000,85446075.0,en,Released,0,6,0,"Horror, Thriller","Canada, United States of America",0,True
5,Muppet Treasure Island,Muppet Treasure Island,1996-02-16,100.0,The Muppets' twist on the classic tale.,0,34327391.0,en,Released,0,5,1,"Action, Adventure, Comedy, Family, Music",United States of America,1,True
6,A Mighty Wind,A Mighty Wind,2003-04-16,91.0,Mockumentary captures the reunion of 1960s fol...,6000000,18750246.0,en,Released,14,28,0,"Comedy, Music",United States of America,0,True
7,Rocky,Rocky,1976-11-21,119.0,A small-time boxer gets a supremely rare chanc...,1000000,117235147.0,en,Released,20,21,1,Drama,United States of America,1,True
8,Revenge of the Nerds II: Nerds in Paradise,Revenge of the Nerds II: Nerds in Paradise,1987-07-10,98.0,The rising college nerds set out to a conventi...,0,22642033.0,en,Released,0,0,1,Comedy,United States of America,1,True
9,American Beauty,American Beauty,1999-09-15,122.0,A sexually frustrated suburban father has a mi...,15000000,356296601.0,en,Released,111,102,0,Drama,United States of America,1,True


### Basic pipeline

📝 **Check unique values per features**

In [20]:
# YOUR CODE HERE
for i in X.columns:
    print(X[i].unique())

['Hot Tub Time Machine 2' 'The Princess Diaries 2: Royal Engagement'
 'Whiplash' ... 'The Verdict' 'It Follows'
 'Vivre sa vie: film en douze tableaux']
['Hot Tub Time Machine 2' 'The Princess Diaries 2: Royal Engagement'
 'Whiplash' ... 'The Verdict' 'It Follows' 'Vivre Sa Vie']
['2015-02-20' '2004-08-06' '2014-10-10' ... '2006-01-30' '2015-06-23'
 '1962-09-20']
[ 93. 113. 105. 122.  92. 100.  91. 119.  98. 118. 145.  97.  85. 111.
  96.  87. 130.  95. 116. 110.  84.  89. 112. 117. 106. 125.  94. 127.
 123. 126. 108.  88. 167. 102. 160. 107. 144. 124. 115. 133. 129. 104.
 103. 157. 109. 135. 147. 120. 121. 177. 189. 178.  86.  90.  79. 101.
  72.  99. 141. 136. 143. 132. 139. 114. 140.  83. 148. 137. 168. 156.
 154.  77.  76. 163. 155. 150. 149. 170.  80. 128. 131. 220. 181. 179.
  81.  82. 161. 134. 193. 158. 188. 212. 142. 146. 151. 171. 165. 162.
 185. 138.  78.  75. 153. 175. 219. 183. 186. 152. 199.  68. 214. 248.
 159. 180. 187.  73. 197. 164. 169. 172.  66.  63. 213. 174.  69. 

In this baseline, let's forget about the columns below that are difficult to process

In [50]:
text = ['description', 'original_title', 'title']
dates = ['release_date'] 

We will simply scale the numerical features and one-hot-encode the categorical ones remaining

📝 **Prepare 2 `list`s of features names as `str`**:
- `numerical` which contains **only** numerical features
- `categorical` which contains **only** categorical features (exept text and dates above)

In [23]:
# YOUR CODE HERE
numerical = ['duration_min','budget','revenue','number_of_awards_won', 'number_of_nominations','number_of_top_productions']
categorical = ['original_language','status','has_collection','all_genres', 'top_countries','available_in_english']
X.columns

Index(['original_title', 'title', 'release_date', 'duration_min',
       'description', 'budget', 'revenue', 'original_language', 'status',
       'number_of_awards_won', 'number_of_nominations', 'has_collection',
       'all_genres', 'top_countries', 'number_of_top_productions',
       'available_in_english'],
      dtype='object')

In [51]:
X = X.drop(text, axis=1)
X = X.drop(dates, axis=1)
X

Unnamed: 0,duration_min,budget,revenue,original_language,status,number_of_awards_won,number_of_nominations,has_collection,all_genres,top_countries,number_of_top_productions,available_in_english
0,93.0,14000000,12314651.0,en,Released,0,2,1,Comedy,United States of America,3,True
1,113.0,40000000,95149435.0,en,Released,1,2,1,"Comedy, Drama, Family, Romance",United States of America,1,True
2,105.0,3300000,13092000.0,en,Released,97,145,0,Drama,United States of America,0,True
3,122.0,1200000,16000000.0,hi,Released,23,18,0,"Drama, Thriller",India,0,True
4,92.0,14000000,85446075.0,en,Released,0,6,0,"Horror, Thriller","Canada, United States of America",0,True
5,100.0,0,34327391.0,en,Released,0,5,1,"Action, Adventure, Comedy, Family, Music",United States of America,1,True
6,91.0,6000000,18750246.0,en,Released,14,28,0,"Comedy, Music",United States of America,0,True
7,119.0,1000000,117235147.0,en,Released,20,21,1,Drama,United States of America,1,True
8,98.0,0,22642033.0,en,Released,0,0,1,Comedy,United States of America,1,True
9,122.0,15000000,356296601.0,en,Released,111,102,0,Drama,United States of America,1,True


### Pipelining

You are going to build a basic pipeline made of a basic preprocessing and a trees-based model of your choice.

#### Preprocessing pipeline

**📝 Create a basic preprocessing pipeline for the 2 types of features above:**
- It should scale the `numerical` features
- one-hot-encode the `categorical` and `boolean` features
- drop the others
- Store your pipeline in a `basic_preprocessing` variable

In [24]:
# Execute this cell to enable a nice display for your pipelines
from sklearn import set_config; set_config(display='diagram')

In [77]:
# YOUR CODE HERE
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector

# Impute then Scale for numerical variables: 
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())])


# Encode categorical variables
cat_transformer = OneHotEncoder(handle_unknown='ignore',sparse=False)



# Paralellize "num_transformer" and "One hot encoder"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, numerical),
    ('cat_transformer', cat_transformer, categorical)]
)
    



preprocessor

In [95]:
final_pipe = Pipeline([
    ('preprocessing', preprocessor)])

basic_preprocessing = final_pipe

In [76]:

pd.DataFrame(final_pipe.fit_transform(X))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119
0,-0.796276,-0.302503,-6.887537e-01,-0.421280,-0.425453,3.547566,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.182492,0.376600,2.191277e-01,-0.358858,-0.425453,0.532212,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,-0.209015,-0.581980,-6.802338e-01,5.633718,5.242737,-0.975465,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.622938,-0.636831,-6.483617e-01,1.014441,0.208750,-0.975465,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-0.845214,-0.302503,1.127775e-01,-0.421280,-0.266902,-0.975465,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,-0.453707,-0.668174,-4.474908e-01,-0.421280,-0.306540,0.532212,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
6,-0.894153,-0.511458,-6.182186e-01,0.452637,0.605127,-0.975465,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
7,0.476123,-0.642055,4.611904e-01,0.827173,0.327664,0.532212,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,-0.551584,-0.668174,-5.755641e-01,-0.421280,-0.504728,0.532212,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
9,0.622938,-0.276384,3.081339e+00,6.507635,3.538317,0.532212,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [28]:
X

Unnamed: 0,original_title,title,release_date,duration_min,description,budget,revenue,original_language,status,number_of_awards_won,number_of_nominations,has_collection,all_genres,top_countries,number_of_top_productions,available_in_english
0,Hot Tub Time Machine 2,Hot Tub Time Machine 2,2015-02-20,93.0,"When Lou's shot in the groin, Nick and Jacob d...",14000000,12314651.0,en,Released,0,2,1,Comedy,United States of America,3,True
1,The Princess Diaries 2: Royal Engagement,The Princess Diaries 2: Royal Engagement,2004-08-06,113.0,"Now settled in Genovia, Princess Mia faces a n...",40000000,95149435.0,en,Released,1,2,1,"Comedy, Drama, Family, Romance",United States of America,1,True
2,Whiplash,Whiplash,2014-10-10,105.0,A promising young drummer enrolls at a cut-thr...,3300000,13092000.0,en,Released,97,145,0,Drama,United States of America,0,True
3,Kahaani,Kahaani,2012-03-09,122.0,A pregnant woman's search for her missing husb...,1200000,16000000.0,hi,Released,23,18,0,"Drama, Thriller",India,0,True
4,The Possession,The Possession,2012-08-30,92.0,A young girl buys an antique box at a yard sal...,14000000,85446075.0,en,Released,0,6,0,"Horror, Thriller","Canada, United States of America",0,True
5,Muppet Treasure Island,Muppet Treasure Island,1996-02-16,100.0,The Muppets' twist on the classic tale.,0,34327391.0,en,Released,0,5,1,"Action, Adventure, Comedy, Family, Music",United States of America,1,True
6,A Mighty Wind,A Mighty Wind,2003-04-16,91.0,Mockumentary captures the reunion of 1960s fol...,6000000,18750246.0,en,Released,14,28,0,"Comedy, Music",United States of America,0,True
7,Rocky,Rocky,1976-11-21,119.0,A small-time boxer gets a supremely rare chanc...,1000000,117235147.0,en,Released,20,21,1,Drama,United States of America,1,True
8,Revenge of the Nerds II: Nerds in Paradise,Revenge of the Nerds II: Nerds in Paradise,1987-07-10,98.0,The rising college nerds set out to a conventi...,0,22642033.0,en,Released,0,0,1,Comedy,United States of America,1,True
9,American Beauty,American Beauty,1999-09-15,122.0,A sexually frustrated suburban father has a mi...,15000000,356296601.0,en,Released,111,102,0,Drama,United States of America,1,True


**📝 Encode the features and store the result in the variable `X_basic_preprocessing`.**

In [78]:
# YOUR CODE HERE
X_basic_preprocessing = pd.DataFrame(final_pipe.fit_transform(X))

**❓ How many features has been generated by the preprocessing? What do you think about this number?**

In [84]:
X_basic_preprocessing.shape

(6864, 1120)

> YOUR ANSWER HERE

 1120 feature have been generated cause of the one hot encoding of feature with a lot a unique value. This number is a little high

#### Modeling pipeline

Let's add a model to our pipe. With so many features one-hot-encoded, we **need a model which can act as a feature selector**

👉 A linear model regularized with L1 penalty is a good starting point.


**📝 Create a `basic_pipeline` which encapsulate the `basic_preprocessing` pipeline + a linear model with a L1 penalty**

- store the resulting pipeline as `basic_pipeline`
- don't fine-tune it


<details>
    <summary>Hints</summary>

Choose your model from the list [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

</details>

In [97]:
# YOUR CODE HERE

from sklearn.linear_model import Ridge

final_pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('linear_regression', Ridge())])
basic_pipeline = final_pipe
final_pipe

### Cross-validated baseline

**📝 Perform a cross-validated evaluation of your baseline model using the metric you defined above. Store the results of this evaluation as an `array` of floating scores in the `basic_scores` variable.**

In [90]:
# YOUR CODE HERE
from sklearn.model_selection import cross_val_score

# Cross validate pipeline
basic_scores = cross_val_score(final_pipe,X , y, cv=5, scoring='r2')

### 🧪 Save your results

Run the following cell to save your results

In [98]:
ChallengeResult(
    'baseline',
    metric=scoring,
    features=[categorical,numerical],
    preproc=basic_preprocessing,
    preproc_shape=X_basic_preprocessing.shape,
    pipe=basic_pipeline,
    scores=basic_scores
).write()

## Feature engineering

### Time Features


👉 Let's try to improve performance using the feature `release_date`, and especially its `month` and `year`.

ℹ️ If you want to skip this section, you can move directly to the next one: _Advanced categorical features_.

**📝 Complete the custom transformer `TimeFeaturesExtractor` below**

Running
```python
TimeFeaturesExtractor().fit_transform(X[['release_date']])
``` 
should return something like

|    |   month |   year |
|---:|--------:|-------:|
|  0 |       2 |   2015 |
|  1 |       8 |   2004 |
|  2 |      10 |   2014 |
|  3 |       3 |   2012 |
|  4 |       8 |   2012 |


In [110]:
daate = data['release_date'].iloc[0]
int(daate[:4])

2015

In [144]:
X=data['release_date']
def slicend(x):
    return x[5:7]
def slicebegin(x):
    return x[:4]

month = X.apply(sliced)


In [174]:
from sklearn.base import BaseEstimator, TransformerMixin

class TimeFeaturesExtractor(BaseEstimator, TransformerMixin):
    """Extract the 2 time features from a date"""
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        """
        Params:
        X: DataFrame
        y: Series
        
        Returns a DataFrame with 2 columns containing the time features as integers extracted from the release_date.
        """
        arrm=[]
        arry=[]
        for i,d in X.iterrows():
            #print(d)
            
            year = d[:4]
            val = year[0]
            month = int(val[5:7])
            year = int(val[:4])
            arrm.append(month)
            arry.append(year)
        #print(month,year)
        d = {'month': arrm, 'year': arry}

        
        return pd.DataFrame(data=d)

In [175]:
# Try your transformer and save your new features here
X_time_features = TimeFeaturesExtractor().fit_transform(data[['release_date']])
X_time_features.head()

Unnamed: 0,month,year
0,2,2015
1,8,2004
2,10,2014
3,3,2012
4,8,2012


We still have 2 problems to solve
- `month` is cyclical: 12 should be a close to 1 as to 11, right? 
- `year` is not scaled

**📝 Build a final custom transformer `CyclicalEncoder` so that**

Running
```python
CyclicalEncoder().fit_transform(X_time_features)
``` 
should return something like this

|    |    month_cos |   month_sin |      year |
|---:|-------------:|------------:|----------:|
|  0 |  0.5         |    0.866025 | 0.0466039 |
|  1 | -0.5         |   -0.866025 | 0.0411502 |
|  2 |  0.5         |   -0.866025 | 0.0461081 |
|  3 |  6.12323e-17 |    1        | 0.0451165 |
|  4 | -0.5         |   -0.866025 | 0.0451165 |

With the cyclical encoding is done as below
- `month_cos = 2 * math.pi / 12 * X[['month']] `
- `month_sin = 2 * math.pi / 12 * X[['month']] `

And the `year` begin min-max scaled

In [219]:
from sklearn.base import BaseEstimator, TransformerMixin
import math

class CyclicalEncoder(BaseEstimator, TransformerMixin):
    """
    Encode a cyclical feature
    """
    
    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        Compute here what you need for the transform phase and store it as instance variable
        """
        self.min = X['year'].min()
        self.max = X['year'].max()
        #self.year = (X – self.min) / (self.max – self.min)
        return self

    def transform(self, X, y=None):
        """
        Compute and returns the final DataFrame
        """
        
        #print(self.min)
        month_cos = 2 * math.pi / 12 * X[['month']]
        month_sin = 2 * math.pi / 12 * X[['month']]
        #print(month_cos,month_sin)
        year = (X[['year']] - self.min) / (self.max - self.min)
        #year = X[['year']]
        d = {'month_cos': month_cos,'month_sin' :month_sin, 'year': year}
        return  pd.DataFrame(data=d)


In [218]:
# Try your transformer and save your new features here
X_time_cyclical = CyclicalEncoder().fit_transform(X_time_features)
X_time_cyclical.head()

ValueError: If using all scalar values, you must pass an index

In [33]:
# Check that this form a circle with 12 points
plt.scatter(X_time_cyclical['month_cos'],
            X_time_cyclical['month_sin'])
plt.xlabel("month_cos"); plt.ylabel("month_sin");

**📝 Enhance your `basic_pipeline` with a new preprocessing including both `TimeFeaturesExtractor` and `CyclicalFeatureExtractor`:**

- Just use `TimeFeatureExtractor` if you haven't had time to do the `Cyclical` one
- Store this new pipeline as `time_pipeline`
- Keep same estimator for now

In [220]:
# YOUR CODE HERE
time_pipeline = Pipeline([
    
    ('TimeScaler', TimeFeaturesExtractor())])

### Advanced categorical encoder to reduce the number of features

ℹ️ Most of it has already been coded for you and it shouldn't take long. Still if you want to skip it and move to the next section: _Model Tuning_

👉 We need to reduce the number of features to one-hot-encode, which arise from the high cardinality of `all_genres` and `top_countries`

In [222]:
X = data

In [223]:
X[['all_genres', 'top_countries']].nunique()

all_genres       745
top_countries    320
dtype: int64

👇 Both share a common pattern: there can be more than 1 country and more than 1 genre per movie.

In [224]:
X[['all_genres', 'top_countries']].tail()

Unnamed: 0,all_genres,top_countries
6859,"Animation, Drama, Family",United States of America
6860,"Comedy, Drama, Romance",United States of America
6861,"Comedy, Drama",United States of America
6862,"Adventure, Mystery, Science Fiction","United Kingdom, United States of America"
6863,"Horror, Mystery, Science Fiction","United Kingdom, United States of America"


👉 Run the cell below where we have coded for you a custom transformer `CustomGenreAndCountryEncoder` which: 
- Select the 10 most frequent genres and the 5 most frequent countries
- Encode `all_genres` into 10 One Hot Encoded features
- Encode `top_countries` into 5 One Hot Encoded features

In [225]:
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin

class CustomGenreAndCountryEncoder(BaseEstimator, TransformerMixin):
    """
    Encoding the all_genres and top_companies features which are multi-categorical :
    a movie has several possible genres and countries of productions!
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        compute top genres and top countries of productions from all_genres and top_countries features
        """

        # compute top 10 genres       
        list_of_genres = list(X['all_genres'].apply(lambda x: [i.strip() for i in x.split(",")] if x != [''] else []).values)
        top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(10)]

        # save top_genres in dedicated instance variable
        self.top_genres = top_genres
        
         # compute top 5 countries       
        list_of_countries = list(X['top_countries'].apply(lambda x: [i.strip() for i in x.split(",")] if x != [''] else []).values)
        top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(5)]

        # save top_countries in dedicated instance variable
        self.top_countries = top_countries

        return self

    def transform(self, X, y=None):
        """
        encoding genre and country
        """
        X_new = X.copy()
        for genre in self.top_genres:
            X_new['genre_' + genre] = X_new['all_genres'].apply(lambda x: 1 if genre in x else 0)
        X_new = X_new.drop(columns=["all_genres"])
        for country in self.top_countries:
            X_new['country_' + country] = X_new['top_countries'].apply(lambda x: 1 if country in x else 0)
        X_new = X_new.drop(columns=["top_countries"])
        return X_new

In [226]:
# Check it out
X_custom = CustomGenreAndCountryEncoder().fit_transform(X[['all_genres', 'top_countries']])
print(X_custom.shape)
X_custom.head()

(6864, 15)


Unnamed: 0,genre_Drama,genre_Comedy,genre_Thriller,genre_Action,genre_Romance,genre_Adventure,genre_Crime,genre_Horror,genre_Science Fiction,genre_Family,country_United States of America,country_United Kingdom,country_France,country_Germany,country_Canada
0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1


**📝 Compute your `final_pipeline` by integrating all these transformers** (or all those you have coded)

- `CustomGenreAndCountryEncoder`
- `TimeFeaturesExtractor`
- `CyclicalFeatureExtractor`

In [231]:
# YOUR CODE HERE
categoricales = ['original_language','status','has_collection','available_in_english']
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, numerical),
    ('TimeScaler', TimeFeaturesExtractor(),dates),
    ('custom_encode',CustomGenreAndCountryEncoder(),['all_genres', 'top_countries']),
    ('cat_transformer', cat_transformer, categoricales)]
)
final_pipe = Pipeline([
    ('preprocessing', preprocessor),
    
    ('linear_regression', Ridge())])
final_pipe

In [232]:
final_scores = cross_val_score(final_pipe,X , y, cv=5, scoring='r2')

In [234]:
final_scores

array([0.18959227, 0.17660596, 0.22923365, 0.1693959 , 0.07797017])

📝 **Compute and store its cross validated scores as `final_scores` array of floats**

- It does not necessarily improve the performance before we can try-out doing model tuning
- However, with a now limited number of features, we will be able to train more complex models in next section (ensemble...)

### 🧪 Save your result

Run the following cell to save your results.

In [235]:
ChallengeResult(
    'feature_engineering',
    X_time_features=X_time_features,
    X_time_cyclical= X_time_cyclical,
    time_pipeline=time_pipeline,
    final_pipeline=final_pipeline,
    final_scores=final_scores
).write()

# Hint: Try restarting your notebook if you obtain an error about saving a custom encoder

NameError: name 'X_time_cyclical' is not defined

## Model tuning

### Random Forest

📝 **Change the estimator of your `final_pipeline` by a Random Forest and checkout your new cross-validated score**

In [237]:
# YOUR CODE HERE
from sklearn.ensemble import RandomForestRegressor
final_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    
    ('raandom_forest', RandomForestRegressor())])
final_pipeline
cross_val_score(final_pipeline,X , y, cv=5, scoring='r2')

array([0.47309495, 0.33061254, 0.25677895, 0.16772204, 0.25863439])

### Best hyperparameters quest



**📝 Fine tune your model to try to get the best performance in the minimum amount of time!**

- Store the result of your search inside the `search` variable.
- Store your 5 cross-validated scores inside `best_scores` array of floats

In [47]:
# YOUR CODE HERE
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipeline, 
    param_grid={
        # Access any component of the pipeline, as far back as you want
        'preprocessing__num_transformer__imputer__strategy': ['mean', 'median'],
        'preprocessing__children_scaler__shrink_factor': [1,2,3,4],
        'linear_regression__alpha': [0.1, 0.5, 1, 5, 10]},
    cv=5,
    scoring="r2")

grid_search.fit(X_train, y_train)
grid_search.best_params_

**📝 Re-train your best pipeline on the whole (X,y) dataset**
- Store the trained pipeline inside the `best_pipeline` variable

In [53]:
# YOUR CODE HERE

### Prediction

Now you have your model tuned with the best hyperparameters, you are ready for a prediction.

Here is a famous TV show released in 2017:

```python
dict(
        original_title=str("La Casa de Papel"),
        title=str("Money Heist"), 
        release_date= pd.to_datetime(["2017-05-02"]), 
        duration_min=float(50),
        description=str("An unusual group of robbers attempt to carry out the most perfect robbery"), 
        budget=float(3_000_000), 
        original_language =str("es"), 
        status=str("Released"),
        number_of_awards_won =int(2), 
        number_of_nominations=int(5), 
        has_collection=int(1),
        all_genres=str("Action, Crime, Mystery"), 
        top_countries=str("Spain, France, United States of America"), 
        number_of_top_productions=int('1'),
        available_in_english=bool('True') 
)
```

**📝 Compute the predicted popularity of this TV show and store it into the `popularity` variable as a floating number.**

In [54]:
# YOUR CODE HERE

In [55]:
# YOUR CODE HERE

### 🧪 Save your results

Run the following cell to save your results.

In [56]:
ChallengeResult(
    "model_tuning",
    search=search,
    best_pipeline=best_pipeline,
    best_scores = best_scores,
    popularity=popularity
).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**