<div style="text-align: center">
  <img alt="AIcrowd" src="https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/master/pricing-game-banner.png">
</div>

# How to use this notebook 📝

1. **Copy the notebook**. This is a shared template and any edits you make here will not be saved. _You should copy it into your own drive folder._ For this, click the "File" menu (top-left), then "Save a Copy in Drive". You can edit your copy however you like.
2. **Link it to your AICrowd account**. In order to submit your code to AICrowd, you need to provide your account's API key (see [_"Configure static variables"_](#static-var) for details).
3. **Stick to the function definitions**. The submission to AICrowd will look for the pre-defined function names:
  - `fit_model`
  - `save_model`
  - `load_model`
  - `predict_expected_claim`
  - `predict_premium`

    Anything else you write outside of these functions will not be part of the final submission (including constants and utility functions), so make sure everything is defined within them, except for:
4. **Define your preprocessing**. In addition to the functions above, anything in the cell labelled [_"Define your data preprocessing"_](#data-preprocessing) will also be imported into your final submission. 

# Your pricing model 🕵️

In this notebook, you can play with the data, and define and train your pricing model. You can then directly submit it to the AICrowd, with some magic code at the end.

### Baseline logistic regression 💪
You can also play with a baseline logistic regression model [implemented here](https://colab.research.google.com/drive/1iDgDgWUw9QzOkbTYjeyY3i3DGuCoghs3?usp=sharing). 

# Setup the notebook 🛠

In [1]:
!bash <(curl -sL https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/master/python/setup.sh)
from aicrowd_helpers import *

⚙️ Installing AIcrowd utilities...
  Running command git clone -q https://gitlab.aicrowd.com/yoogottamk/aicrowd-cli /tmp/pip-req-build-bkhrtuon
✅ Installed AIcrowd utilities


# Configure static variables 📎
<a name="static-var"></a>

In order to submit using this notebook, you must visit this URL https://aicrowd.com/participants/me and copy your API key. 

Then you must set the value of `AICROWD_API_KEY` wuth the value.

In [2]:
import sklearn

class Config:
  TRAINING_DATA_PATH = 'training.csv'
  MODEL_OUTPUT_PATH = 'model.pkl'
  AICROWD_API_KEY = 'eaab81e0ad4d64a6b0e7ec99a89205f6'  # You can get the key from https://aicrowd.com/participants/me
  ADDITIONAL_PACKAGES = [
    'numpy',  # you can define versions as well, numpy==0.19.2
    'pandas',
    'scikit-learn==' + sklearn.__version__,
    "tqdm",
    "xgboost",
    ""
  ]

# Download dataset files 💾

In [3]:
# Make sure to offically join the challenge and accept the challenge rules! Otherwise you will not be able to download the data
%download_aicrowd_dataset

💾 Downloading dataset...
Verifying API Key...
API Key valid
Saved API Key successfully!
✅ Downloaded dataset


# Packages 🗃

<a name="packages"></a>

Import here all the packages you need to define your model. **You will need to include all of these packages in `Config.ADDITIONAL_PACKAGES` for your code to run properly once submitted.**

In [4]:
%%track_imports

import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import itertools
import json

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate


In [5]:
import importlib
import global_imports
importlib.reload(global_imports)
from global_imports import *  # do not change this

# Loading the data 📲

In [6]:
df = pd.read_csv(Config.TRAINING_DATA_PATH)
X_train = df.drop(columns=['claim_amount'])
y_train = df['claim_amount']

## How does the data look like? 🔍

In [7]:
X_train.sample(n=4)

Unnamed: 0,id_policy,year,pol_no_claims_discount,pol_coverage,pol_duration,pol_sit_duration,pol_pay_freq,pol_payd,pol_usage,drv_sex1,drv_age1,drv_age_lic1,drv_drv2,drv_sex2,drv_age2,drv_age_lic2,vh_make_model,vh_age,vh_fuel,vh_type,vh_speed,vh_value,vh_weight,population,town_surface_area
134547,PL056308,3.0,0.036,Max,8,3,Yearly,No,WorkPrivate,M,61.0,40.0,Yes,F,57.0,15.0,kvcddisqpkysmvvo,4.0,Gasoline,Tourism,149.0,17233.0,1012.0,420.0,338.4
22576,PL029650,1.0,0.0,Max,5,2,Monthly,No,WorkPrivate,F,47.0,1.0,No,0,,,zoypfizhpbtpjwpv,11.0,Gasoline,Tourism,145.0,10896.0,660.0,720.0,220.1
161086,PL031286,3.0,0.0,Max,26,3,Yearly,No,Retired,F,85.0,42.0,No,0,,,nilvygybpajtnxnr,3.0,Gasoline,Tourism,170.0,11405.0,1197.0,420.0,125.8
117096,PL090945,3.0,0.0,Max,11,5,Biannual,No,WorkPrivate,M,53.0,29.0,No,0,,,nilvygybpajtnxnr,4.0,Diesel,Tourism,170.0,11405.0,1197.0,230.0,46.6


In [8]:
y_train.sample(n=4)

171272    0.0
165027    0.0
146346    0.0
40376     0.0
Name: claim_amount, dtype: float64

# Training the model 🚀

You must first define your first function: `fit_model`. This function takes training data as arguments, and outputs a "model" object -- that you define as you wish. For instance, this could be an array of parameter values.

## Define your data preprocessing

<a name="data-preprocessing"></a>

You can add any class or function in this cell for preprocessing. Just make sure that you use the functions here in the `fit_model`, `predict_expected_claim` and `predict_premium` functions if necessary. *italicised text*

In [9]:
%%aicrowd_include

# This magical command saves all code in this cell to a utils module.
# include your preprocessing functions and classes here.

from sklearn.preprocessing import LabelBinarizer


class NormalizeData:
    '''
    Class used to normalize a dataset according to a standard normal 
    distribution.

    Methods
    -------
    fit : Use the training dataset to calculate the mean and standard deviation
        used for the normalisation of new data.

    transform : Use the parameters calculated in the fit method to normalize 
        new data.
    '''
    def __init__(self):
        self.x_means = 0
        self.x_std = 0

    def fit(self, x_train):
        x_float = x_train.select_dtypes(include=['float', 'int']).drop(
            columns=['year', 'pol_no_claims_discount'])
        self.x_means = x_float.mean()
        self.x_std = x_float.std()
        return self

    def transform(self, x_raw):
        for idx in x_raw:
            if idx in self.x_means.index:
                x_raw[idx] = (x_raw[idx] - self.x_means[idx]) / self.x_std[idx]
        return x_raw


class Compress_vh_make_model:
    '''
    Class used to group the labels with low frequency from the feature 
    vh_make_model.

    Methods
    -------
    fit : Use the training dataset to calculate the mean and standard deviation
        used for the normalisation of new data.

    transform : Use the parameters calculated in the fit method to normalize 
        new data.
    '''
    def __init__(self, n_occurences=30):
        self.n_occ = n_occurences

    def fit(self, x_train):
        self.models_counts = x_train.vh_make_model.value_counts()
        self.models_to_group = self.models_counts[
                self.models_counts < self.n_occ].keys()
        return self
    
    def transform(self, x_raw):
        # Add a new feature according to the model count
        df_counts = pd.DataFrame(
            list(zip(self.models_counts.index, self.models_counts)),
            columns =['vh_make_model', 'vh_model_counts']
            )
        x_raw = x_raw.merge(
            right = df_counts, 
            on = 'vh_make_model', 
            how ='left'
            )
        x_raw['vh_model_counts'] = x_raw['vh_model_counts'].fillna(value=1)

        # Compressing the column vh_make_model by grouping low frequency's models.
        mask_model_to_group = x_raw.vh_make_model.isin(self.models_to_group)
        x_raw.loc[mask_model_to_group, 'vh_make_model'] = 'other_models'

        return x_raw


# class ImputeVhInformations:
#     '''
#     Imputation class for missing values
#     '''
#     def __init__(self):
#         self.speed_mean = 0
#         self.weight_mean = 0
#         self.age_mean = 0

#     def prefit(self, x_train):
#         self.speed_mean = x_train['vh_speed'].mean()
#         self.weight_mean = x_train['vh_weight'].mean()
#         self.age_mean = x_train['vh_age'].mean()
#         return self

#     def pretransform(self, x_raw):
#         x = x_raw.copy()
#         x.loc[x.vh_speed.isnull(), 'vh_speed'] = self.speed_mean
#         x.loc[x.vh_weight.isnull(), 'vh_weight'] = self.weight_mean
#         x.loc[x.vh_age.isnull(), 'vh_age'] = self.age_mean
#         return x

#     def fit(self, x_train):
#         # Starting to impute speed and weight according to the model mean
#         self.prefit(x_train)
#         x = self.pretransform(x_train)

#         # Training a mixed linear regression to impute the vh_value
#         vh_columns = ['vh_age', 'vh_fuel', 'vh_type']
#         x_imputation = x.loc[x.vh_value.notnull(), vh_columns]
#         le = LabelEncoder()
#         vh_make_model = le.fit_transform(x_train['vh_make_model'])

#         self.vh_value_imputer = MixedLM(
#             endog = np.array(x['vh_value']),
#             exog = np.array(x_imputation),
#             groups = np.array(vh_make_model),
#             exog_re = np.array(x['vh_age'])
#             )
#         self.vh_value_imputer.fit()

#         return self

#     def transform(self, x_raw):
#         # Imputing speed and weight according to the model mean
#         x = self.pretransform(x_raw)

#         # Predict the missing vh_values
#         vh_columns = ['vh_age', 'vh_fuel', 'vh_type']
#         rows_to_predict = x.vh_value.isnull()

#         x_to_predict = x.loc[rows_to_predict, vh_columns]
#         x.loc[rows_to_predict, 'vh_value'] = self.vh_value_imputer.predict(x_to_predict)

#         return x


class Preprocess_X_data:
    """
    Class to preprocess the features of the dataset

    Methods
    -------
    add_new_features : Method to include new features

    impute_missing_values : Method to deal with missing values

    fit : Use the training data set to specify the parameters of the 
        prepocessing.

    transform : Use the parameters from the fit method to preprocess new data.

    """
    def __init__(self, n_occurences_vh_make_model=30, drop_id=False):
        self.normalizer = NormalizeData()
        self.compress_models = Compress_vh_make_model(
            n_occurences=n_occurences_vh_make_model
            )
        self.cols_to_binarize = ['pol_payd', 'drv_sex1', 'drv_drv2']
        self.cols_to_one_hot_encode = [
            'pol_coverage', 'pol_usage', 'drv_sex2', 
            'vh_make_model', 'vh_fuel', 'vh_type'
            ]
        self.drop_id = drop_id

    def add_new_features(self, x_raw):
        x = x_raw.copy()
        # Adding new features
        x.insert(
            loc=len(x.columns),
            column='pop_density',
            value = x.population / x.town_surface_area
            )
        x.insert(
            loc=len(x.columns),
            column='vh_speed_drv_age_ratio',
            value = x.vh_speed / x.drv_age1
            )
        x.insert(
            loc=len(x.columns),
            column='potential_force_impact',
            value = x.vh_speed * x.vh_weight
            )

        # Droping not necessay variables
        x = x.drop(columns='pol_pay_freq')
        if self.drop_id:
            x = x.drop(columns='id_policy')
        return x

    def impute_missing_values(self, x_raw):
        x = x_raw.copy()
        # Adding missing indicators
        x['vh_age_NA'] = x['vh_age'].isnull()
        x['vh_value_NA'] = x['vh_value'].isnull()

        # Impute missing values
        x = x.fillna(0)
        return x

    def fit(self, x_train):
        # Adding new features
        x_train = self.add_new_features(x_train)

        # Compressing the vh_make_model column
        self.compress_models.fit(x_train)
        x_train = self.compress_models.transform(x_train)

        # Normalization
        self.normalizer.fit(x_train)

        return self

    def transform(self, x_raw):
        # Adding new features
        x_prep = self.add_new_features(x_raw)

        # Compressing the vh_make_model column
        x_prep = self.compress_models.transform(x_prep)

        # Normalization
        colnames = x_prep.columns
        x_prep = self.normalizer.transform(x_prep)
        x_prep = pd.DataFrame(x_prep, columns=colnames)

        # Impute missing values
        x_prep = self.impute_missing_values(x_prep)

        # Binarize columns with only two categories
        lb = LabelBinarizer()
        for col in self.cols_to_binarize:
            x_prep[col] = lb.fit_transform(x_prep[col])

        # One-Hot-Encode the other categorical columns
        x_prep = pd.get_dummies(
            data=x_prep,
            prefix = self.cols_to_one_hot_encode,
            columns = self.cols_to_one_hot_encode,
            drop_first=True,
            dtype='int8'
            )

        return x_prep

    def fit_transform(self, x_raw):
        return self.fit(x_raw).transform(x_raw)


In [10]:
import importlib
import utils
importlib.reload(utils)
from utils import *  # do not change this

## Define the training logic

In [11]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials


def model_score(x_train, x_test, y_train, y_test, xgb_params):
    """
    Function that calculate the evaluation score for the hyperparameters 
    optimisation purpose.
    """
    # Convert de features dataframes into DMatrix so it can be use to train an 
    # XGBoost
    dmatrix_train = xgb.DMatrix(x_train)
    dmatrix_train.set_label(y_train)
    dmatrix_valid = xgb.DMatrix(x_test)
    dmatrix_valid.set_label(y_test)

    # Train the XGBoost model
    results_dict = {}
    model = xgb.train(
        xgb_params,
        dtrain=dmatrix_train,
        num_boost_round=4000,
        early_stopping_rounds=50,
        evals=[(dmatrix_train, "train"), (dmatrix_valid, "eval")],
        evals_result=results_dict
    )

    return float(list(results_dict["eval"].values())[0][-1])


def hyperparameters_optimization(X_raw, y_raw):

    xgb_params_space = {
        # Definition of the model to train
        "objective": "reg:tweedie",
        "tweedie_variance_power": hp.uniform("tweedie_variance_power", 1, 2),
        "booster": 'gbtree',
        # Evaluation metric
        "eval_metric": "rmse",
        # Parameters for gbtree booster
        "learning_rate": hp.uniform("learning_rate", 0.001, 0.2),
        "n_estimators": hp.choice("n_estimators", range(1, 1000)),
        'gamma' : hp.uniform("gamma", 0, 0.4),
        'lambda': hp.choice("lambda", range(1, 1000)),
        "alpha": hp.choice("alpha", range(1, 1000)),
        "min_child_weight": hp.choice("min_child_weight", range(1, 12)),
        "max_depth": hp.choice("max_depth", range(1, 10)),
        "scale_pos_weight": hp.choice("scale_pos_weight", range(1, 1000)),
        'tree_method': 'gpu_hist',
        # Additionnal parameters for the training function
        "colsample_bytree": hp.uniform("colsample_bytree", 0.5, 0.9),
        "subsample": hp.uniform("subsample", 0.5, 0.9),
    }

    # Preprocessing
    print('preprocessing')
    preprocessing = Preprocess_X_data(
        n_occurences_vh_make_model=50,
        drop_id=True
        )
    x = preprocessing.fit_transform(X_raw)
    x = np.array(x)
    
    X_train, X_test, y_train, y_test = train_test_split(
        x, y_raw,
        test_size=0.33,
        shuffle=True,
        random_state=4000
    )


    def calculate_rmse(para):
        rmse = model_score(X_train, X_test, y_train, y_test, para)
        return {'loss': rmse, 'status': STATUS_OK}


    trials = Trials()
    print('Start')
    best = fmin(
        fn=calculate_rmse, 
        space=xgb_params_space, 
        algo=tpe.suggest, 
        max_evals=100, 
        trials=trials
        )
    print('best:')
    print(best)
    with open('xgb_best_params.json', 'w') as outfile:
        json.dump(best, outfile)


optim_parameters = False
if optim_parameters:
    hyperparameters_optimization(X_train, y_train)


In [12]:
def fit_model(X_raw, y_raw):
    """Model training function: given training data (X_raw, y_raw), train this pricing model.

    Parameters
    ----------
    X_raw : Pandas dataframe, with the columns described in the data dictionary.
        Each row is a different contract. This data has not been processed.
    y_raw : a Numpy array, with the value of the claims, in the same order as contracts in X_raw.
        A one dimensional array, with values either 0 (most entries) or >0.

    Returns
    -------
    self: this instance of the fitted model. This can be anything, as long as it is compatible
        with your prediction methods.

    """    
    xgb_params = {
        # Definition of the model to train
        "objective": ["reg:tweedie"],
        "tweedie_variance_power" : [1.13],
        "booster" : ['gbtree'],
        # Evaluation metric
        "eval_metric": ["rmse"],
        # Parameters for gbtree booster
        'colsample_bytree': [0.85],
        'gamma': [0.2],
        'learning_rate': [0.18],
        'max_depth': [8],
        'min_child_weight': [2],
        'subsample': [0.6],
        'lambda':[1],
        'alpha':[0],
        'tree_method':['gpu_hist'],
        # Additionnal parameters for the training function
        "early_stopping_rounds": [25],
        "num_boost_round": [4000]
    }

    # Preprocessing
    preprocessing = Preprocess_X_data(
        n_occurences_vh_make_model=50,
        drop_id=False
        )
    x = preprocessing.fit_transform(X_raw)
    x = x.drop(columns='id_policy', errors='ignore')

    # Split the data in train and validation dataset according to the year
    x_train, x_valid, y_train, y_valid = train_test_split(
        x, y_raw, 
        test_size=0.10,
        shuffle=True,
        random_state=2020
        )
    
    # Convert de features dataframes into DMatrix so it can be use to train an 
    # XGBoost
    dmatrix_train = xgb.DMatrix(x_train.values)
    dmatrix_train.set_label(y_train)
    dmatrix_valid = xgb.DMatrix(x_valid.values)
    dmatrix_valid.set_label(y_valid)

    # Transform xgb_params as a list of every combinations of parameters that 
    # needs to be tried during the gridsearch.
    keys, values = zip(*xgb_params.items())
    param_list = [dict(zip(keys, v)) for v in itertools.product(*values)]
    print(f"Number of combinations of parameters to try: {len(param_list)}")

    # Train the XGBoost model
    results_list = list()
    for i, params_dict in enumerate(param_list):
        results_dict = {}
        model = xgb.train(
            params_dict,
            dtrain = dmatrix_train,
            num_boost_round = params_dict["num_boost_round"],
            early_stopping_rounds = params_dict["early_stopping_rounds"],
            evals = [(dmatrix_train, "train"), (dmatrix_valid, "eval")],
            evals_result = results_dict
        )
        results_list.append({
            "eval": float(list(results_dict["eval"].values())[0][-1]),
            "train": float(list(results_dict["train"].values())[0][-1]),
            "params": params_dict
        })
        print(f"Trained model #{i + 1} out of {len(param_list)}")

    print(results_list)

    return model, preprocessing


## Train your model

In [14]:
trained_model, preprocessing = fit_model(X_train, y_train)

Number of combinations of parameters to try: 1
[0]	train-rmse:754.638	eval-rmse:508.727
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 25 rounds.
[1]	train-rmse:753.953	eval-rmse:507.751
[2]	train-rmse:752.373	eval-rmse:505.627
[3]	train-rmse:749.997	eval-rmse:502.772
[4]	train-rmse:747.374	eval-rmse:500.231
[5]	train-rmse:744.363	eval-rmse:498.096
[6]	train-rmse:740.17	eval-rmse:496.701
[7]	train-rmse:730.127	eval-rmse:495.771
[8]	train-rmse:718.546	eval-rmse:495.325
[9]	train-rmse:702.631	eval-rmse:494.901
[10]	train-rmse:696.813	eval-rmse:494.724
[11]	train-rmse:675.76	eval-rmse:494.71
[12]	train-rmse:661.48	eval-rmse:494.721
[13]	train-rmse:650.406	eval-rmse:494.841
[14]	train-rmse:637.966	eval-rmse:494.867
[15]	train-rmse:619.508	eval-rmse:494.926
[16]	train-rmse:603.739	eval-rmse:495.006
[17]	train-rmse:585.9	eval-rmse:495.079
[18]	train-rmse:573.147	eval-rmse:495.213
[19]	train-rmse:564.477	eval

**Important note**: your training code should be able to run in under 10 minutes (since this notebook is re-run entirely on the server side). 

If you run into an issue here we recommend using the *zip file submission* (see the [challenge page](https://www.aicrowd.com/challenges/insurance-pricing-game/#how-to%20submit)). In short, you can simply do this by copy-pasting your `fit_model`, `predict_expected_claim` and `predict_premium` functions to the `model.py` file.

Note that if you want to perform extensive cross-validation/hyper-parameter selection, it is better to do them offline, in a separate notebook.

## Saving your model

You can save your model to a file here, so you don't need to retrain it every time.

In [15]:
def save_model(model_path):  # some models such xgboost models or keras models don't pickle very reliably. Please use the package provided saving functions instead. 
    trained_model.save_model(model_path)

In [16]:
save_model(Config.MODEL_OUTPUT_PATH)

If you need to load it from file, you can use this code:

In [17]:
def load_model(model_path): # some models such xgboost models or keras models don't pickle very reliably. Please use the package provided saving functions instead. 
  model = xgb.Booster()
  return model.load_model('model_path')

In [18]:
trained_model, preprocessing = load_model(Config.MODEL_OUTPUT_PATH)

XGBoostError: ignored

# Predicting the claims 💵

The second function, `predict_expected_claim`, takes your trained model and a dataframe of contracts, and outputs a prediction for the (expected) claim incurred by each contract. This expected claim can be seen as the probability of an accident multiplied by the cost of that accident.

This is the function used to compute the _RMSE_ leaderboard, where the model best able to predict claims wins.

In [None]:
def predict_expected_claim(model, X_raw, preprocessing):
    """Model prediction function: predicts the expected claim based on the pricing model.

    This functions estimates the expected claim made by a contract (typically, as the product
    of the probability of having a claim multiplied by the expected cost of a claim if it occurs),
    for each contract in the dataset X_raw.

    This is the function used in the RMSE leaderboard, and hence the output should be as close
    as possible to the expected cost of a contract.

    Parameters
    ----------
    model: a Python object that describes your model. This can be anything, as long
        as it is consistent with what `fit` outpurs.
    X_raw : Pandas dataframe, with the columns described in the data dictionary.
        Each row is a different contract. This data has not been processed.

    Returns
    -------
    avg_claims: a one-dimensional Numpy array of the same length as X_raw, with one
        expected claim per contract (in same order). These expected claims must be POSITIVE (>0).
    """
    # Preprocessing
    x_clean = preprocessing.transform(X_raw)
    policy_id = x_clean.pop('id_policy')
    x_clean = xgb.DMatrix(x_clean.values)

    # predictions
    expected_claim = model.predict(x_clean)
    expected_claim = pd.DataFrame({
        "policy_id": policy_id.values, 
        "expected_claim": expected_claim
        })

    return expected_claim

To test your function, run it on your training data:

In [None]:
expected_claims = predict_expected_claim(trained_model, X_train, preprocessing)
expected_claims['claim_amounts'] = y_train
print(expected_claims)
sse = (expected_claims.expected_claim - expected_claims.claim_amounts)**2
print("\nRMSE: {:,.2f}".format(np.mean(sse)**0.5))

       policy_id  expected_claim  claim_amounts
0       PL000000       67.574654            0.0
1       PL042495       64.129890            0.0
2       PL042496      147.388367            0.0
3       PL042497       47.504368            0.0
4       PL042498       79.004623            0.0
...          ...             ...            ...
228211  PL008818      121.844688            0.0
228212  PL055033      117.793633            0.0
228213  PL061619      168.159470            0.0
228214  PL060903       27.699331            0.0
228215  PL052240      119.040894            0.0

[228216 rows x 3 columns]

RMSE: 519.50


In [None]:
# Leaderboard predictions
# Year 1
x = X_train[X_train.year == 1]
y = y_train[X_train.year == 1]
trained_model, preprocessing = fit_model(x, y)
expected_claims = predict_expected_claim(trained_model, x, preprocessing)
df_expectations = expected_claims.rename(columns={'expected_claim':'year 1'})

# subsequent years
for year in [2,3,4]:
    x = X_train[X_train.year <= year]
    y = y_train[X_train.year <= year]
    trained_model, preprocessing = fit_model(x, y)

    new_x = X_train[X_train.year == year]
    expected_claims = predict_expected_claim(trained_model, new_x, preprocessing)
    df_expectations = df_expectations.merge(
        expected_claims.rename(columns={'expected_claim':'year '+ str(year)}), 
        on='policy_id', 
        how='outer'
        )


In [None]:
print(df_expectations)
sse = np.array([])
for i, col in enumerate(df_expectations.columns[1:]):
    y = np.array(y_train[X_train.year == i+1])
    y_predicted = np.array(df_expectations[col].values)
    sse = np.append(sse, (y_predicted - y)**2)

print("\nRMSE: {:,.2f}".format(sse.mean() ** 0.5))

      policy_id      year 1      year 2      year 3      year 4
0      PL000000  138.475601  206.211472   53.509453   59.929661
1      PL042495   42.412300   63.934818   42.294338   46.201229
2      PL042496   31.804327   58.094189   29.981810   14.776361
3      PL042497   80.391930   78.044586   76.000435   57.384850
4      PL042498  115.126030   37.049774   30.784359   11.189279
...         ...         ...         ...         ...         ...
57049  PL002373  131.588135   86.257858   87.813972   98.504311
57050  PL004062    8.538408   46.636429   60.274456   30.683107
57051  PL006847  175.723526  126.055153  172.869980  105.745323
57052  PL012984   17.307497    5.385138    6.551529    2.292538
57053  PL008560   40.298351   40.566414   43.254398   33.405273

[57054 rows x 5 columns]

RMSE: 708.70


# Pricing contracts 💰💰

The third and final function, `predict_premium`, takes your trained model and a dataframe of contracts, and outputs a _price_ for each of these contracts. **You are free to set this prices however you want!** These prices will then be used in competition with other models: contracts will choose the model offering the lowest price, and this model will have to pay the cost if an accident occurs.

This is the function used to compute the _profit_ leaderboard: your model will participate in many markets of size 10, populated by other participants' model, and we compute the average profit of your model over all the markets it participated in.

In [None]:
def predict_premium(model, X_raw, preprocessing):
    """Model prediction function: predicts premiums based on the pricing model.

    This function outputs the prices that will be offered to the contracts in X_raw.
    premium will typically depend on the average claim predicted in 
    predict_average_claim, and will add some pricing strategy on top.

    This is the function used in the average profit leaderboard. Prices output here will
    be used in competition with other models, so feel free to use a pricing strategy.

    Parameters
    ----------
    model: a Python object that describes your model. This can be anything, as long
        as it is consistent with what `fit` outpurs.
    X_raw : Pandas dataframe, with the columns described in the data dictionary.
        Each row is a different contract. This data has not been processed.

    Returns
    -------
    prices: a one-dimensional Numpy array of the same length as X_raw, with one
        price per contract (in same order). These prices must be POSITIVE (>0).
    """
    expected_claims = predict_expected_claim(model, X_raw, preprocessing)
    
    return  expected_claims.expected_claim * 1.2

To test your function, run it on your training data.

In [None]:
prices = predict_premium(trained_model, X_train, preprocessing)

#### Profit on training data

In order for your model to be considered in the profit competition, it needs to make nonnegative profit over its training set. You can check that your model satisfies this condition below:

In [None]:
print('Income:', prices.sum())
print('Losses:', y_train.sum())

if prices.sum() < y_train.sum():
    print('Your model loses money on the training data! It does not satisfy market rule 1: Non-negative training profit.')
    print('This model will be disqualified from the weekly profit leaderboard, but can be submitted for educational purposes to the RMSE leaderboard.')
else:
    print('Your model passes the non-negative training profit test!')

Income: 29998799.62003931
Losses: 26057988.080000006
Your model passes the non-negative training profit test!


# Ready? Submit to AIcrowd 🚀

If you are satisfied with your code, run the code below to send your code to the AICrowd servers for evaluation! This requires the variable `trained_model` to be defined by your previous code.

**Make sure you have included all packages needed to run your code in the [_"Packages"_](#packages) section.**

**NOTE**: If you submit the baseline RMSE model without any change whatsoever, your model will not be entered into the market. 

In [None]:
%aicrowd_submit

🚀 Preparing to submit...
⚙️ Collecting the submission code...
💾 Preparing the submission zip file...
adding: utils.py (deflated 73%)
adding: predict_premium.py (deflated 54%)
adding: predict.py (deflated 52%)
adding: fit_model.py (deflated 59%)
adding: config.json (deflated 8%)
adding: save_model.py (deflated 37%)
adding: predict_expected_claim.py (deflated 54%)
adding: global_imports.py (deflated 48%)
adding: model.pkl (deflated 60%)
adding: load_model.py (deflated 35%)
adding: requirements.txt (stored 0%)
Verifying API Key...
API Key valid
Saved API Key successfully!
╭─────────────────────────╮
│ Successfully submitted! │
╰─────────────────────────╯
Important links
┌──────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/insurance-pricing-game/submissions/116135              │
│                  │                                                                              