## The project: Predict travel insurance claims

We use the "Travel Insurance" dataset from Zahier Nasrudin, published on Kaggle. It contains data from a third-party insurance servicing company based in Singapore. The data contains information on travel insurance holders, some of the holder's attributes, and some attributes of the insurance products purchased by the holders. The target is a binary variable, stating whether a policyholder filed a claim against the insurance company. <br>
Link to data: https://www.kaggle.com/datasets/mhdzahier/travel-insurance

In [None]:
#!pip install sagemaker

In [None]:
# For .info() method to run below, need to older version of numpy
!pip install numpy==1.18.1

In [None]:
import sagemaker
import pandas as pd
import numpy as np
from platform import python_version
import zipfile
from sklearn.model_selection import train_test_split
import os
import sagemaker
import boto3

In [None]:
python_version(), np.__version__

### (a) Download data from Kaggle into Jupyter NB instance folder, load data into Jupyter NB environment

(1) Download the authentication json file ('kaggle.json') from Kaggle & upload it to the notebook file directory <br>
(2) Run the following code in bash terminal to download the travel insurance dataset from Kaggle

In [None]:
# pip install kaggle
# mkdir ~/.kaggle
# cp kaggle.json ~/.kaggle/
# chmod 600 .kaggle/kaggle.json
# cd ml_eng_capstone
# kaggle datasets download -d mhdzahier/travel-insurance

Load data persisted on Jupyter notebook instance into Jupyter notebook environment

In [None]:
with zipfile.ZipFile('travel-insurance.zip', 'r') as zip_ref:
    zip_ref.extractall()
travel_insurance_df = pd.read_csv('travel insurance.csv')

### (b) Describe & clean data

In [None]:
travel_insurance_df.head()

In [None]:
travel_insurance_df.info()

Describe numerical values:

In [None]:
print('Duration:')
print(travel_insurance_df['Duration'].describe())
print()
print('Commision (in value):')
print(travel_insurance_df['Commision (in value)'].describe())
print()
print('Age:')
print(travel_insurance_df['Age'].describe())

Drop rows with negative duration:

In [None]:
len(travel_insurance_df[travel_insurance_df['Duration']<0])

In [None]:
index_neg_duration = travel_insurance_df[travel_insurance_df['Duration']<0].index
travel_insurance_df.drop(index_neg_duration, inplace=True)
travel_insurance_df = travel_insurance_df.reset_index().drop(labels='index', axis=1)

Replace NAs (only in Gender column) by string 'UNKNOWN'

In [None]:
travel_insurance_df.fillna('UNKNOWN',inplace=True)

In [None]:
##Remove rows with missing data:
#travel_insurance_df = travel_insurance_df.dropna()
#travel_insurance_df = travel_insurance_df.reset_index().drop(labels='index', axis=1)

Overview over data:

In [None]:
no_instances = travel_insurance_df.shape[0]
no_features = len(travel_insurance_df.columns) - 1
target_shares = round(travel_insurance_df['Claim'].value_counts()/len(travel_insurance_df),3)
print("No. of instances: " + f"{no_instances:,}")
print("No. of columns: " + str(no_features))
print("Share of targets: \n" + str(target_shares))
travel_insurance_df.head()

### (c) Prep data, save on Jupyter NB instance, upload to S3

Recode target ('Claim') into numerical variable:

In [None]:
dict_label = {'Yes' : 1, 'No' : 0}
travel_insurance_df['Claim'] = travel_insurance_df['Claim'].replace(dict_label)

Replace categorical features through one-hot encoding:

In [None]:
def one_hot(df):
    #Function performs one-hot encoding with features of datatype object (string)
    #Last dummy column of each categorical is excluded to avoid perfect collinearity
    #NOTE: Categorical features already encoded as integers are NOT identified by this function!
    dtypes_ser = df.dtypes
    dtypes_df = dtypes_ser.to_frame().reset_index()
    dtypes_df = dtypes_df.rename(columns = {'index':'column', 0:'dtype'})
    categ_list = list(dtypes_df['column'][dtypes_df['dtype']=='object'])
    for feat in categ_list:
        one_hot = pd.get_dummies(df[feat], prefix=feat, drop_first=True)
        df = df.join(one_hot)
        df.drop(feat, inplace=True, axis=1)
    return df

In [None]:
travel_insurance_df = one_hot(travel_insurance_df)

In [None]:
travel_insurance_df.info()

Train-test split <br>
(Note: test data is without label)

In [None]:
travel_insurance_df_train, travel_insurance_df_test = train_test_split(travel_insurance_df, test_size = 0.2, 
                                                                 stratify = travel_insurance_df['Claim'], 
                                                                 shuffle = True, 
                                                                 random_state = 1)
travel_insurance_df_test = travel_insurance_df_test.drop(labels='Claim', axis = 1)

In [None]:
travel_insurance_df_train.shape, travel_insurance_df_test.shape

In [None]:
travel_insurance_train = np.array(travel_insurance_df_train)
travel_insurance_test = np.array(travel_insurance_df_test)

Save train and test data to S3

In [None]:
sm_session = sagemaker.Session()
sm_role = sagemaker.get_execution_role()
bucket = sm_session.default_bucket()

In [None]:
sm_session, sm_role, bucket

In [None]:
data_dir = '../ml_eng_capstone/data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
travel_insurance_df_train.to_csv(data_dir + '/' + 'train.csv', header = False, index = False)
travel_insurance_df_test.to_csv(data_dir + '/' + 'test.csv')

In [None]:
prefix = 'travel_insurance_claim_data'
train_path_s3 = sm_session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
test_path_s3 = sm_session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)

In [None]:
bucket_list = []
for i in boto3.resource('s3').Bucket(bucket).objects.all():
    bucket_list.append(i)
bucket_list

In [None]:
# Delete data files in s3://sagemaker-us-east-1-786251868139/travel_insurance_claim_data/
# boto3.resource('s3').Bucket(bucket).objects.all().delete()

### (d)	Train Random Forest w custom scikit-learn estimator (baseline A)

In [None]:
!pygmentize source/train.py

In [None]:
from sagemaker.sklearn.estimator import SKLearn

In [None]:
est_rf_base = SKLearn(entry_point = 'train.py',
                       source_dir = 'source',
                       role = sm_role,
                       framework_version = '0.23-1',
                       py_version = 'py3',
                       instance_count = 1,
                       instance_type = 'ml.m4.xlarge',
                       output_path = 's3://{}/{}/output'.format(bucket, prefix),
                       sagemaker_session = sm_session
                       #hyperparameters = {'n_estimators':100, 'min_samples_split':2, 'min_samples_leaf':1, 'max_depth':None, 'max_leaf_nodes':None}
                     )

In [None]:
est_rf_base.fit({'train' : train_path_s3})

### (e) Train SVM w custom scikit-learn estimator (baseline B)

### (f) Test baseline models with batch transform

### (g) Train Random Forest w re-sampled training data (SMOTE-Tomek)

### (h) Train SVM w re-sampled training data (SMOTE-Tomek)

### (i) Test models with re-sampled training data with batch transform

### (j) Train Random Forest w re-sampled training data + hyperparameter tuning

### (k) Train SVM w re-sampled training data + hyperparameter tuning

### (l) Deploy models from (j), (k) behind multi-model endpoint

### (m) Run A/B Test with multi-model endpoint