<table><tr>
<td> <img src="https://upload.wikimedia.org/wikipedia/fr/thumb/e/e5/Logo_%C3%A9cole_des_ponts_paristech.svg/676px-Logo_%C3%A9cole_des_ponts_paristech.svg.png" width="200"  height="200" hspace="200"/> </td>
<td> <img src="https://pbs.twimg.com/profile_images/1156541928193896448/5ihYIbCQ_200x200.png" width="200" height="200" /> </td>
</tr></table>

<br/>

<h1><center>Session 11 - Models lifecycle and production deployment</center></h1>



<font size="3">This session is divided into **2** parts:
- **Code packaging**
- **Testing**

In each of these parts, some **guidelines** and **hints** are given for each task. 
Do not hesitate to check the links to documentation to understand the functions you use. 
    
The goal of this session is to **package your code** you used during the development phase and to implement a few tests on the data before infering on new data
</font>

In [1]:
import os
ROOT_DIRPATH = '/Users/yaguethiam/Ponts/french-box-office/notebooks/session11/solution'

In [2]:
from pathlib import Path
import sys
sys.path.append(ROOT_DIRPATH)# set the path to where you put the folder of your package

In [3]:
#this is called a magic in jupyter notebook, it is helpful when you add code to your scripts 
#while using this notebook to test them because you wont need to restart the kernel everytime you change a function
%load_ext autoreload
%autoreload 2

In [None]:
#install loguru if you dont already have it

In [38]:
!pip install loguru

Collecting loguru
  Downloading loguru-0.5.3-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 5.4 MB/s eta 0:00:011
[?25hInstalling collected packages: loguru
Successfully installed loguru-0.5.3


### Functions

In [15]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMRegressor
from loguru import logger

#preprocessing
def clean_data(data, drop_2020=True):
    print("cleaning data..")
    data = data.dropna()
    data.drop(['title'], axis = 1, inplace = True)
    if drop_2020:
        data = data.query("year != 2020")
    data = data.sort_values(by='release_date')
    data.release_date = pd.to_datetime(data.release_date)
    data.index = data.release_date
    data = data.drop(columns=['index', 'release_date', 'year'], errors='ignore')
    return data


def train_test_split_by_date(df: pd.DataFrame, split_date_val: str, split_date_test: str):
    """Split dataset according to a split date in format "YYYY-MM-DD"
    - train: [:split_date_1[
    - validation: [split_date_1: split_date_2[
    - test: [split_date_2:[
    """
    train = df.loc[:split_date_val].copy()
    validation = df.loc[split_date_val:split_date_test].copy()
    test = df.loc[split_date_test:].copy()
    return train, validation, test


def get_x_y(dataset):
    target = dataset.sales
    target = target.astype(float)
    features = dataset.drop(columns = ['sales'], errors='ignore')
    return features, target


def transform_target(target, forward = True):
    if forward == True: target_tf = [np.log(x) for x in target]
    else: target_tf = [np.exp(x) for x in target]
    return target_tf

#training
def train(lr, features, target, transformer = None):
    print(f"start fitting a {lr.__class__}...")
    if transformer:
        lr = lr.fit(features, transformer(target, forward = True))
    predicted_target = lr.predict(features)
    if transformer:
        predicted_target = transformer(predicted_target, forward= False)
    print(get_evaluation_metrics(target, predicted_target))
    return lr


def predict(model, features, transformer=None):
    print("Predicting on new data..")
    predicted_target = model.predict(features)
    if transformer:
        predicted_target = transformer(predicted_target, forward=False)
    return predicted_target


def save_model(model: LGBMRegressor, filepath: str):
    model.booster_.save_model(filepath, num_iteration=model.best_iteration_)
    print(f'Model saved to {filepath}')


#evaluate
def get_evaluation_metrics(y_test, y_pred, y_train=None) -> dict:
    metrics = {
        'mape': mean_absolute_percentage_error(y_test, y_pred),
        'rmse': mean_squared_error(y_test, y_pred, squared=False),
        'mae': mean_absolute_error(y_test, y_pred),
    }
    return metrics


def mean_absolute_percentage_error(y_true, y_pred):
    """in percent"""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred)/y_true)) * 100


def prettify_metrics(metrics: dict) -> str:
    output = [f"Evaluation:\n{'-'*10}"]
    for name, metric in metrics.items():
        output.append((f'- {name.upper()}: {round(metric, 2)}'))
    return '\n'.join(output) +'\n'


def evaluate(lr, features, target, transformer=None, ret=False):
    predicted_target = lr.predict(features)
    if transformer:
        predicted_target = transformer(predicted_target, forward=False)
    
    print(get_evaluation_metrics(target, predicted_target))
    if ret==True:
        return get_evaluation_metrics(target, predicted_target)

#utils
def load_dataset(path: str) -> pd.DataFrame:
    print(f"loading raw data {path}...")
    data = pd.read_csv(path)
    return data


def get_features_list(data):
    print(f"getting features list...")
    return set(data.columns)


def save_results(preditions_df, path):
    print(f"Saving predictions in {path} ...")
    preditions_df.to_csv(path)

### Code packaging: the training workflow

The goal of this part is to build a training flow that is easy to run when needed. It should be like:

`from bin.train import training_workflow
training_workflow()`

The functions above are all you need to build the training workflow. We have provided you a folder containing a structure for you to package your application in a way you can make the call above.

Copy and paste the functions where you think they should be and import them where you need them.

- TIPS:
When you package your code, use a logger instead of print to show comments. 
Example:

`from loguru import logger
logger.info("print something")`

#### configs

In [16]:
TRAINING_DATASET_FILEPATH = os.path.join(ROOT_DIRPATH, 'data', "processed_dataset.csv")
LGBM_MODEL_FILEPATH = os.path.join(ROOT_DIRPATH, "models", "light_gbm_model.txt")
START_VALIDATION_DATE = '2018-01-01'
START_TEST_DATE = '2020-01-01'

FEATURE_IMPORTANCE = [
    "runtime",
    "mean_5_popularity",
    "mean_3_popularity",
    "budget",
    "actor_1_sales",
    "mean_sales_actor",
    "max_sales_actor",
    "actor_3_sales",
    "actor_2_sales",
    "month",
    "cos_month",
    "Comédie",
    "Drame",
    "is_part_of_collection",
    "rolling_sales_collection",
    "prod_FR",
    "Action",
    "prod_OTHER",
    "available_lang_fr",
    "original_lang_fr",
    "holiday",
    "Romance",
    "original_lang_en",
    "prod_US",
    "Familial",
    "nb_movie_collection",
    "Horreur",
    "available_lang_other",
    "prod_GB",
    "Other",
    "original_lang_other",
    "available_lang_it",
    "Fantastique",
    "available_lang_en",
    "vacances_zone_c",
    "vacances_zone_a",
    "available_lang_es",
    "vacances_zone_b",
    "prod_BE",
    "available_lang_de",
    "original_lang_ja",
    "prod_CA",
    "original_lang_it",
    "prod_DE",
    "available_lang_ja",
    "jour_ferie",
    "original_lang_es",
]
BEST_K_FEATURES = 36  # K best features sorted by feature importance

# LightGBM hyperparameters
LGBM_BEST_PARAMS = {
    "max_depth": 70,
    "n_estimators": 80,
    "num_leaves": 31,
} 

In [17]:
#load data and clean dataset
raw_data = load_dataset(TRAINING_DATASET_FILEPATH)
data = clean_data(raw_data, drop_2020=False)
train_data, validation_data, _ = train_test_split_by_date(data,
                                                          START_VALIDATION_DATE,
                                                          START_TEST_DATE)

loading raw data /Users/yaguethiam/Ponts/french-box-office/notebooks/session11/solution/data/processed_dataset.csv...
cleaning data..


In [18]:
#get train and validation data and select only the best k features we selected during dev phase
train_x, train_y = get_x_y(train_data)
validation_x, validation_y = get_x_y(validation_data)

train_x = train_x[FEATURE_IMPORTANCE[:BEST_K_FEATURES]]
validation_x = validation_x[FEATURE_IMPORTANCE[:BEST_K_FEATURES]]

In [19]:
#run a LGBM regressor
lgbm = LGBMRegressor(**LGBM_BEST_PARAMS)
print(f"Training LightGBM using hyper-parameters: {LGBM_BEST_PARAMS}")
lgbm = train(lgbm, train_x, train_y, transformer=transform_target)

Training LightGBM using hyper-parameters: {'max_depth': 70, 'n_estimators': 80, 'num_leaves': 31}
start fitting a <class 'lightgbm.sklearn.LGBMRegressor'>...
{'mape': 187.99827186360915, 'rmse': 234171.0957984909, 'mae': 102878.43197478513}


In [20]:
#validate
print("Evaluate on validation set ...")
evaluate(lgbm, validation_x, validation_y, transformer=transform_target)

Evaluate on validation set ...
{'mape': 376.8063413834246, 'rmse': 238484.95219472746, 'mae': 103569.43339518228}


In [21]:
#save the model
save_model(lgbm, LGBM_MODEL_FILEPATH)

Model saved to /Users/yaguethiam/Ponts/french-box-office/notebooks/session11/solution/models/light_gbm_model.txt


### Test input data

The goal of this part is to write simple tests for the input data in live mode
- 1: build a test to check whether all the features expected are present in the input data
- 2: build a test to check whether the column runtime only contain positive number

Example of a test

`def test_sum():
    assert sum([1, 2, 3]) == 6, "Should be 6"`

- `assert` : python keyword to check an affirmation
- `sum([1, 2, 3]) == 6` is what you want to check
- `"Should be 6"` : a message to display only when the test fails

You can also give a input data to your test function( example give as input the data you want to test)

We have provided 3 datasets for inference. Run the tests on these dataset and comment your results

In [6]:
#build the test

In [None]:
#run tests on infer_data_1.csv, infer_data_2.csv and infer_data_1.csv

### Infer on new data

The goal of this part is to use the model and infer on new data. 
From the tests above, select a suitable data and run the model on it. Then save the results in a csv file with the title and the sales predicted.
Running the inference workflow should be as easy as :

`from bin.inference import inference_workflow
inference_workflow(str(Path(ROOT_PATH,'data/test_data/test_valid.csv')))`

In [22]:
path_data = Path(ROOT_DIRPATH, 'data/test_data/infer_data_3.csv')

In [23]:
raw_data = load_dataset(path_data)
data = clean_data(raw_data, drop_2020=False)
test_x, _ = get_x_y(data)
test_x = test_x[FEATURE_IMPORTANCE[:BEST_K_FEATURES]]

loading raw data /Users/yaguethiam/Ponts/french-box-office/notebooks/session11/solution/data/test_data/infer_data_3.csv...
cleaning data..


In [24]:
#load model
model = lgb.Booster(model_file= str(Path(ROOT_DIRPATH, 'models/light_gbm_model.txt')))
predicted_target = predict(model, test_x, transformer=transform_target)

Predicting on new data..


In [25]:
#save results
raw_data['predicted_sales'] = predicted_target
print(predicted_target[:10])
save_results(raw_data[['title', 'predicted_sales']], Path(Path(path_data).parent,'prediction_'+Path(path_data).name))

[82799.60235992783, 102150.97954882764, 19370.508751992387, 62662.726585521865, 229001.89858818284, 40805.87188563293, 32789.463494470954, 140873.31504568606, 32720.16701623868, 233246.48634825862]
Saving predictions in /Users/yaguethiam/Ponts/french-box-office/notebooks/session11/solution/data/test_data/prediction_infer_data_3.csv ...
