## The project: Predict travel insurance claims

We use the "Travel Insurance" dataset from Zahier Nasrudin, published on Kaggle. It contains data from a third-party insurance servicing company based in Singapore. The data contains information on travel insurance holders, some of the holder's attributes, and some attributes of the insurance products purchased by the holders. The target is a binary variable, stating whether a policyholder filed a claim against the insurance company. <br>
Link to data: https://www.kaggle.com/datasets/mhdzahier/travel-insurance

In [None]:
#!pip install sagemaker

In [1]:
# For .info() method to run below, need to older version of numpy
!pip install numpy==1.18.1

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
import sagemaker
import pandas as pd
import numpy as np
from platform import python_version
import zipfile
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import os
import sagemaker
import boto3
from sagemaker.sklearn.estimator import SKLearn

In [3]:
python_version(), np.__version__

('3.7.10', '1.18.1')

### (a) Download data from Kaggle into Jupyter NB instance folder, load data into Jupyter NB environment

(1) Download the authentication json file ('kaggle.json') from Kaggle & upload it to the notebook file directory <br>
(2) Run the following code in bash terminal to download the travel insurance dataset from Kaggle

In [None]:
# pip install kaggle
# mkdir ~/.kaggle
# cp kaggle.json ~/.kaggle/
# chmod 600 .kaggle/kaggle.json
# cd ml_eng_capstone
# kaggle datasets download -d mhdzahier/travel-insurance

Load data persisted on Jupyter notebook instance into Jupyter notebook environment

In [None]:
with zipfile.ZipFile('travel-insurance.zip', 'r') as zip_ref:
    zip_ref.extractall()
travel_insurance_df = pd.read_csv('travel insurance.csv')

### (b) Inspect & clean data

In [None]:
travel_insurance_df.head()

In [None]:
travel_insurance_df.info()

In [None]:
travel_insurance_df.isnull().sum()

Describe numerical values:

In [None]:
print('Duration:')
print(travel_insurance_df['Duration'].describe())
print()
print('Commision (in value):')
print(travel_insurance_df['Commision (in value)'].describe())
print()
print('Age:')
print(travel_insurance_df['Age'].describe())

In [None]:
travel_insurance_sub = travel_insurance_df[['Duration', 'Net Sales', 'Commision (in value)', 'Age']]
for i, col in enumerate(travel_insurance_sub):
    plt.figure(i)
    sns.distplot(travel_insurance_sub[col])

Duration: Drop rows with negative values

In [None]:
len(travel_insurance_df[travel_insurance_df['Duration']<0])

In [None]:
index_neg_duration = travel_insurance_df[travel_insurance_df['Duration']<0].index
travel_insurance_df.drop(index_neg_duration, inplace=True)
travel_insurance_df = travel_insurance_df.reset_index().drop(labels='index', axis=1)
travel_insurance_df.shape

Duration: Drop rows with extremely high values (upward outliers)

In [None]:
pd.set_option('display.max_rows', 50)
travel_insurance_df['Duration'].value_counts().sort_index(ascending = False).head(40)

In [None]:
index_high_duration = travel_insurance_df[travel_insurance_df['Duration']>500].index
travel_insurance_df.drop(index_high_duration, inplace=True)
travel_insurance_df = travel_insurance_df.reset_index().drop(labels='index', axis=1)

In [None]:
travel_insurance_df.shape

In [None]:
travel_insurance_df['Age'].value_counts().sort_index(ascending = False).head(10)

Upward outliers in age (118) will be replaced by next best "realistic" value (88), effectively introducing an age cap at 88

In [None]:
travel_insurance_df['Age'] = np.where(travel_insurance_df['Age'] == 118, 88, travel_insurance_df['Age'])

Replace NAs (only in Gender column) by string 'UNKNOWN'

In [None]:
travel_insurance_df.fillna('UNKNOWN',inplace=True)

In [None]:
##Remove rows with missing data:
#travel_insurance_df = travel_insurance_df.dropna()
#travel_insurance_df = travel_insurance_df.reset_index().drop(labels='index', axis=1)

Overview over data:

In [None]:
no_instances = travel_insurance_df.shape[0]
no_features = len(travel_insurance_df.columns) - 1
target_shares = round(travel_insurance_df['Claim'].value_counts()/len(travel_insurance_df),3)
print("No. of instances: " + f"{no_instances:,}")
print("No. of columns: " + str(no_features))
print("Share of targets: \n" + str(target_shares))
travel_insurance_df.head()

In [None]:
travel_insurance_df.columns

In [None]:
feat_list = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Destination']
for feat in feat_list:
    print('Value count for feature: ' + feat)
    print(travel_insurance_df[feat].value_counts().head(50))
    print()

In [None]:
#'Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Destination'
travel_insurance_df['Destination'].value_counts().head(20)

### (c) Prep data, save on Jupyter NB instance, upload to S3

Recode target ('Claim') into numerical variable:

In [None]:
dict_label = {'Yes' : 1, 'No' : 0}
travel_insurance_df['Claim'] = travel_insurance_df['Claim'].replace(dict_label)

Correlation analysis of categorical features

In [None]:
#pd.crosstab(travel_insurance_df['Agency'], travel_insurance_df['Agency Type'])

Replace categorical features through one-hot encoding:

In [None]:
travel_insurance_df.columns

Categorical features are transformed into dummy variables. Given the non-ordinal nature of the categorical features ('Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Destination', 'Gender') we use one-hot encoding instead of label encoding. The last dummy column of each categorical feature is excluded to avoid perfect collinearity.

In [None]:
def one_hot(df):lit
    #Function performs one-hot encoding with features of datatype object (string)lit
    #Last dummy column of each categorical feature is excluded to avoid perfect collinearity
    #NOTE: Categorical features already encoded as integers are NOT identified by this function!
    dtypes_ser = df.dtypes
    dtypes_df = dtypes_ser.to_frame().reset_index()
    dtypes_df = dtypes_df.rename(columns = {'index':'column', 0:'dtype'})
    categ_list = list(dtypes_df['column'][dtypes_df['dtype']=='object'])
    for feat in categ_list:
        one_hot = pd.get_dummies(df[feat], prefix=feat, drop_first=True)
        df = df.join(one_hot)
        df.drop(feat, inplace=True, axis=1)
    return df

In [None]:
travel_insurance_df = one_hot(travel_insurance_df)

In [None]:
travel_insurance_df.info()

Train-test split <br>
(Note: test data is without label)

In [None]:
travel_insurance_df_train, travel_insurance_df_test = train_test_split(travel_insurance_df, test_size = 0.2, 
                                                                 stratify = travel_insurance_df['Claim'], 
                                                                 shuffle = True, 
                                                                 random_state = 1)
travel_insurance_df_test_x = travel_insurance_df_test.drop(labels='Claim', axis = 1)
travel_insurance_df_test_y = travel_insurance_df_test['Claim']

In [None]:
travel_insurance_df_train.shape, travel_insurance_df_test_x.shape

Save train and test data to S3

In [4]:
sm_session = sagemaker.Session()
sm_role = sagemaker.get_execution_role()
bucket = sm_session.default_bucket()

In [5]:
sm_session, sm_role, bucket

(<sagemaker.session.Session at 0x7f454f75d890>,
 'arn:aws:iam::786251868139:role/c20300a265023u1382356t1w7-SageMakerNotebookInstanc-OA3L97SSKD0B',
 'sagemaker-us-east-1-786251868139')

In [58]:
data_dir = '../ml_eng_capstone/data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
travel_insurance_df_train.to_csv(data_dir + '/' + 'train.csv', header = False, index = False)
travel_insurance_df_test_x.to_csv(data_dir + '/' + 'test.csv', header = False, index = False)

In [59]:
prefix = 'travel_insurance_claim_data'
train_path_s3 = sm_session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
test_path_s3 = sm_session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)

In [None]:
bucket_list = []
for i in boto3.resource('s3').Bucket(bucket).objects.all():
    bucket_list.append(i)
bucket_list

In [None]:
#Delete data files in s3://sagemaker-us-east-1-786251868139/travel_insurance_claim_data/
#boto3.resource('s3').Bucket(bucket).objects.all().delete()

### (d)	Train Random Forest w custom scikit-learn estimator (baseline A)

In [None]:
!pygmentize source/train_rf.py

In [None]:
est_rf_base = SKLearn(entry_point = 'train_rf.py',
                       source_dir = 'source',
                       role = sm_role,
                       framework_version = '0.23-1',
                       py_version = 'py3',
                       instance_count = 1,
                       instance_type = 'ml.m4.xlarge',
                       output_path = 's3://{}/{}/output'.format(bucket, prefix),
                       sagemaker_session = sm_session
                       #hyperparameters = {'n_estimators':100, 'min_samples_split':2, 'min_samples_leaf':1, 'max_depth':None, 'max_leaf_nodes':None}
                     )

In [None]:
est_rf_base.fit({'train' : train_path_s3})

### (e) Train SVM w custom scikit-learn estimator (baseline B)

In [None]:
!pygmentize source/train_svm.py

In [None]:
est_svm_base = SKLearn(entry_point = 'train_svm.py',
                       source_dir = 'source',
                       role = sm_role,
                       framework_version = '0.23-1',
                       py_version = 'py3',
                       instance_count = 1,
                       instance_type = 'ml.m4.xlarge',
                       output_path = 's3://{}/{}/output'.format(bucket, prefix),
                       sagemaker_session = sm_session
                       #hyperparameters = {'C':1, 'gamma':0.01}
                     )

In [None]:
est_svm_base.fit({'train' : train_path_s3})

### (f) Test baseline models with batch transform

#### (f-1) Test RF model (baseline)

##### Batch transform with SageMaker SDK (needs estimator_obj in notebook environment)

In [None]:
transform_obj_rf = est_rf_base.transformer(instance_count = 1, 
                                           instance_type = 'ml.m4.xlarge')
transform_obj_rf.transform(test_path_s3, content_type = 'text/csv', split_type = 'Line')

In [None]:
transform_obj_rf.output_path, data_dir

In [None]:
# Copy predictions from batch transform job to Jupyter NB instance folder & rename file
#!aws s3 cp --recursive $transform_obj_rf.output_path $data_dir
#!mv data/test.csv.out data/base_rf_test.csv.out

In [None]:
predictions_base_rf = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
print('Accuracy is: ' + str(round(accuracy_score(travel_insurance_df_test_y, predictions_base_rf.transpose()),4)))
print('Recall is: ' + str(round(recall_score(travel_insurance_df_test_y, predictions_base_rf.transpose()),4)))

##### Batch transform with AWS Python SDK Boto3

In [None]:
# Training job name: sagemaker-scikit-learn-2022-05-15-15-09-22-292

In [69]:
training_job_info = sm_session.sagemaker_client.describe_training_job(TrainingJobName='sagemaker-scikit-learn-2022-05-15-15-09-22-292')
model_artifacts_paths3 = training_job_info['ModelArtifacts']['S3ModelArtifacts']
training_image = training_job_info['AlgorithmSpecification']['TrainingImage']

In [73]:
primary_container = {"Image" : training_image, 
                     "ModelDataUrl" : model_artifacts_paths3}
model_name= training_job_info['TrainingJobName'] + '-model'
model_info = sm_session.sagemaker_client.create_model(ModelName = model_name,
                                                      ExecutionRoleArn = sm_role,
                                                      PrimaryContainer = primary_container)

In [75]:
transform_job_name = training_job_info['TrainingJobName'] + '-transform-job'
transform_output_path = "s3://{}/{}/batch-transform/".format(sm_session.default_bucket(),prefix)

In [77]:
transform_request = {
    "TransformJobName" : transform_job_name,
    "ModelName" : model_name,
    "MaxConcurrentTransforms": 1,
    "MaxPayloadInMB" : 6,
    "BatchStrategy" : "MultiRecord",
    "TransformOutput" : {
        "S3OutputPath" : transform_output_path
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line",
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": test_path_s3,
            }
        }
    },
    "TransformResources": {
        "InstanceType": "ml.m4.xlarge",
        "InstanceCount": 1
    }
}

In [None]:
transform_response = sm_session.sagemaker_client.create_transform_job(**transform_request)
transform_desc = sm_session.wait_for_transform_job(transform_job_name)

.............................................

#### (f-2) Test SVM model (baseline)

Due to Kernel shutdown need to create batch transform job from training job run before (model artifacts stored on S3) 

In [13]:
training_job_info = sm_session.sagemaker_client.describe_training_job(TrainingJobName='sagemaker-scikit-learn-2022-05-15-15-30-29-785')
model_artifacts_paths3 = training_job_info['ModelArtifacts']['S3ModelArtifacts']
training_image = training_job_info['AlgorithmSpecification']['TrainingImage']

In [28]:
primary_container = {"Image" : training_image, 
                     "ModelDataUrl" : model_artifacts_paths3}
model_name= training_job_info['TrainingJobName'] + '-model'
model_info = sm_session.sagemaker_client.create_model(ModelName = model_name,
                                                      ExecutionRoleArn = sm_role,
                                                      PrimaryContainer = primary_container)

In [63]:
transform_job_name = training_job_info['TrainingJobName'] + '-transform-job'
transform_output_path = "s3://{}/{}/batch-transform/".format(sm_session.default_bucket(),prefix)

In [64]:
transform_request = {
    "TransformJobName" : transform_job_name,
    "ModelName" : model_name,
    "MaxConcurrentTransforms": 1,
    "MaxPayloadInMB" : 6,
    "BatchStrategy" : "MultiRecord",
    "TransformOutput" : {
        "S3OutputPath" : transform_output_path
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line",
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": test_path_s3,
            }
        }
    },
    "TransformResources": {
        "InstanceType": "ml.m4.xlarge",
        "InstanceCount": 1
    }
}

In [65]:
transform_response = sm_session.sagemaker_client.create_transform_job(**transform_request)
transform_desc = sm_session.wait_for_transform_job(transform_job_name)

In [67]:
# Copy predictions from batch transform job to Jupyter NB instance folder
!aws s3 cp --recursive $transform_obj_svm.output_path $data_dir

Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: the following arguments are required: paths


In [None]:
predictions_base_svm = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)

In [None]:
transform_obj_svm.wait()

### (g) Train Random Forest w re-sampled training data (SMOTE-Tomek)

### (h) Train SVM w re-sampled training data (SMOTE-Tomek)

### (i) Test models with re-sampled training data with batch transform

#### (i-1) Test RF model (re-sampled)

#### (i-2) Test SVM model (re-sampled)

### (j) Train Random Forest w re-sampled training data + hyperparameter tuning

### (k) Train SVM w re-sampled training data + hyperparameter tuning

### (l) Deploy models from (j), (k) behind multi-model endpoint

### (m) Run A/B Test with multi-model endpoint