# Insurance Claim Features and Model Clarify

In this notbook:
- load data from feature store
- building training set from FS
- build Pure Premium Modeling using AWS xgboost algo
- Run SageMaker Clarify processing job

What we can predict in this dataset?
1. __Claim Amount:__ total claims amount per policy holder.
1. __Claim Frequency:__ Number of claims per policy holder per exposure unit `Claim Frequency = Claim Count / Exposure`.
1. __Claim Severity:__ the average claim amount per claim for each policy holder per exposure unit `Claim Severity = Claim Cost / Claim Frequency`.
1. __Avg Claim amount:__ `Avg Claim amount = Claim Amount / Claim Count`
1. __Loss Cost:__ `Loss Cost = Claim Frequency x Claim Severity`
1. __Pure Premium:__ the mean of the total claim amount per exposure unit (the average loss per exposure) `PurePremium  = Claim Amount / Exposure`.

In [136]:
# !conda update scikit-learn -y
!pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /opt/conda/lib/python3.7/site-packages (0.24.1)


In [178]:
import sklearn
sklearn.__version__ 

'0.24.1'

In [179]:
print(__doc__)

from functools import partial

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.metrics import mean_tweedie_deviance
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer

from sklearn.metrics import mean_absolute_error, mean_squared_error, auc

Automatically created module for IPython interactive environment


In [180]:
import boto3
import sagemaker
from sagemaker.session import Session


region = boto3.Session().region_name

boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

#### S3 Bucket Setup For The OfflineStore

SageMaker FeatureStore writes the data in the OfflineStore of a FeatureGroup to a S3 bucket owned by you. To be able to write to your S3 bucket, SageMaker FeatureStore assumes an IAM role which has access to it. The role is also owned by you.
Note that the same bucket can be re-used across FeatureGroups. Data in the bucket is partitioned by FeatureGroup.

Set the default s3 bucket name and it will be referenced throughout the notebook.

In [181]:
feature_store_session.default_bucket()

'sagemaker-ca-central-1-314997521033'

In [182]:
# You can modify the following to use a bucket of your choosing
default_s3_bucket_name = feature_store_session.default_bucket()
prefix = 'sagemaker-featurestore-insurance'

print(default_s3_bucket_name)

sagemaker-ca-central-1-314997521033


In [183]:
from sagemaker import get_execution_role

# You can modify the following to use a role of your choosing. See the documentation for how to create this.
role = get_execution_role()
print (role)

arn:aws:iam::314997521033:role/service-role/AmazonSageMaker-ExecutionRole-20201209T151511


In [184]:
!aws s3 ls s3://sagemaker-ca-central-1-314997521033/sagemaker-featurestore-insurance/079329190341/sagemaker/us-east-1/offline-store/insurance-policy-feature-group-06-20-23-32/

Now let's wait for the data to appear in our offline store before moving forward to creating a dataset. This will take approximately 5 minutes.

In [185]:
list_of_FG =sagemaker_client.list_feature_groups()
insurance_policy_feature_group_name = list_of_FG['FeatureGroupSummaries'][-1]['FeatureGroupName']
insurance_policy_feature_group_name

'insurance-policy-feature-group-12-02-28-15'

In [186]:
from sagemaker.feature_store.feature_group import FeatureGroup

insurance_policy_feature_group = FeatureGroup(name=insurance_policy_feature_group_name, sagemaker_session=feature_store_session)

In [187]:
account_id = boto3.client('sts').get_caller_identity()["Account"]
print(account_id)

insurance_policy_feature_group_s3_prefix = prefix + '/' + account_id + '/sagemaker/' + region + '/offline-store/' + insurance_policy_feature_group_name + '/data'
print(insurance_policy_feature_group_s3_prefix)

314997521033
sagemaker-featurestore-insurance/314997521033/sagemaker/ca-central-1/offline-store/insurance-policy-feature-group-12-02-28-15/data


## Build Training Dataset

SageMaker FeatureStore automatically builds the Glue Data Catalog for FeatureGroups (you can optionally turn it on/off while creating the FeatureGroup). In this example, we want to create one training dataset with FeatureValues from both identity and transaction FeatureGroups. This is done by utilizing the auto-built Catalog. We run an Athena query that joins the data stored in the offline store in S3 from the 2 FeatureGroups. 

In [188]:
insurance_policy_query = insurance_policy_feature_group.athena_query()

insurance_policy_table = insurance_policy_query.table_name

query_string = 'SELECT * FROM "'+insurance_policy_table+'"' #+insurance_policy_table
print('Running ' + query_string)

# run Athena query. The output is loaded to a Pandas dataframe.
#dataset = pd.DataFrame()
insurance_policy_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/'+prefix+'/query_results/')
insurance_policy_query.wait()
dataset = insurance_policy_query.as_dataframe()

dataset

Running SELECT * FROM "insurance-policy-feature-group-12-02-28-15-1618194501"


Unnamed: 0,vehage_bin0_0_4_0,vehage_bin4_0_10_0,vehage_bin10_0_100_0,drivage_bin18_0_36_0,drivage_bin36_0_50_0,drivage_bin50_0_99_0,vehbrand_b1,vehbrand_b10,vehbrand_b11,vehbrand_b12,...,exposure,bonusmalus,claimamount,purepremium,frequency,avgclaimamount,eventtime,write_time,api_invocation_time,is_deleted
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.00,50.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
1,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.00,50.0,69.41,69.41,1.0,69.41,1.618194e+09,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.15,90.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.43,50.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.83,90.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.00,50.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:31:41.211,2021-04-12 02:31:40.000,False
59996,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.75,50.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:31:41.211,2021-04-12 02:31:40.000,False
59997,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.52,50.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:31:41.211,2021-04-12 02:31:40.000,False
59998,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.03,60.0,0.00,0.00,0.0,0.00,1.618194e+09,2021-04-12 02:31:41.211,2021-04-12 02:31:40.000,False


In [189]:
# Prepare query results for training.
query_execution = insurance_policy_query.get_query_execution()
query_result = 's3://'+default_s3_bucket_name+'/'+prefix+'/query_results/'+query_execution['QueryExecution']['QueryExecutionId']+'.csv'
print(query_result)

s3://sagemaker-ca-central-1-314997521033/sagemaker-featurestore-insurance/query_results/95b17086-0447-4028-ac54-432196437fba.csv


In [190]:
#!aws s3 ls s3://sagemaker-us-east-1-079329190341/sagemaker-featurestore-insurance/query_results/

In [191]:
df_features = pd.read_csv(query_result)

In [192]:
df_features.head()

Unnamed: 0,vehage_bin0_0_4_0,vehage_bin4_0_10_0,vehage_bin10_0_100_0,drivage_bin18_0_36_0,drivage_bin36_0_50_0,drivage_bin50_0_99_0,vehbrand_b1,vehbrand_b10,vehbrand_b11,vehbrand_b12,...,exposure,bonusmalus,claimamount,purepremium,frequency,avgclaimamount,eventtime,write_time,api_invocation_time,is_deleted
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,50.0,0.0,0.0,0.0,0.0,1618194000.0,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
1,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.0,50.0,69.41,69.41,1.0,69.41,1618194000.0,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.15,90.0,0.0,0.0,0.0,0.0,1618194000.0,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.43,50.0,0.0,0.0,0.0,0.0,1618194000.0,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.83,90.0,0.0,0.0,0.0,0.0,1618194000.0,2021-04-12 02:36:43.409,2021-04-12 02:31:43.000,False


In [193]:
s3_client = boto3.client('s3', region_name=region)


In [194]:
#df_features.columns = feature_names +['PurePremium','Frequency','AvgClaimAmount','eventtime','write_time','api_invocation_time','is_deleted']

In [195]:
# Select useful columns for training with target column as the first.
dataset = df_features.iloc[:,np.r_[df_features.columns.get_loc('purepremium'), 0:60]]

# Write to csv in S3 without headers and index column.
dataset.to_csv('dataset.csv', header=False, index=False)
s3_client.upload_file('dataset.csv', default_s3_bucket_name, prefix+'/training_input/dataset.csv')
dataset_uri_prefix = 's3://'+default_s3_bucket_name+'/'+prefix+'/training_input/';

dataset

Unnamed: 0,purepremium,vehage_bin0_0_4_0,vehage_bin4_0_10_0,vehage_bin10_0_100_0,drivage_bin18_0_36_0,drivage_bin36_0_50_0,drivage_bin50_0_99_0,vehbrand_b1,vehbrand_b10,vehbrand_b11,...,region_r91,region_r93,region_r94,area_a,area_b,area_c,area_d,area_e,area_f,density
0,0.00,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.270902
1,69.41,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.098575
2,0.00,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.179327
3,0.00,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.745227
4,0.00,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.240934
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,0.00,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.996113
59996,0.00,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.045355
59997,0.00,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.234368
59998,0.00,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.113961


In [203]:
dataset.head()

Unnamed: 0,purepremium,vehage_bin0_0_4_0,vehage_bin4_0_10_0,vehage_bin10_0_100_0,drivage_bin18_0_36_0,drivage_bin36_0_50_0,drivage_bin50_0_99_0,vehbrand_b1,vehbrand_b10,vehbrand_b11,...,region_r91,region_r93,region_r94,area_a,area_b,area_c,area_d,area_e,area_f,density
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.270902
1,69.41,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.098575
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.179327
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.745227
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.240934


# Pure Premium Modeling

#### Pure Premium Modeling using xgboost

In [197]:
training_image=sagemaker.image_uris.retrieve("xgboost", region, "1.0-1")
training_image

'341280168497.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3'

In [198]:
#!conda install -c conda-forge xgboost -y

In [199]:
training_output_path='s3://' + default_s3_bucket_name+'/'+prefix + '/training_output'

from sagemaker.estimator import Estimator
training_model = Estimator(training_image,
                           role, 
                           instance_count=1, 
                           instance_type='ml.m5.2xlarge',
                           volume_size = 5,
                           max_run = 3600,
                           input_mode= 'File',
                           output_path=training_output_path,
                           sagemaker_session=feature_store_session)

In [200]:
training_model.set_hyperparameters(objective = "reg:tweedie",
                                   num_round = 50)

In [201]:
training_model.fit(inputs=data_channels, logs=True)

2021-04-25 22:39:49 Starting - Starting the training job...
2021-04-25 22:39:51 Starting - Launching requested ML instancesProfilerReport-1619390389: InProgress
......
2021-04-25 22:41:13 Starting - Preparing the instances for training......
2021-04-25 22:42:14 Downloading - Downloading input data...
2021-04-25 22:42:42 Training - Training image download completed. Training in progress.
2021-04-25 22:42:42 Uploading - Uploading generated training model[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:tweedie to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[22:42:39] 6

In [202]:
predictor = training_model.deploy(model_name='xgboost-insurance-model', initial_instance_count = 1, instance_type = 'ml.m5.xlarge')

-------------!

# Dectect Posttraining Data nad Model Bias

In [204]:
sagemaker_session = sagemaker.Session()
clarify_prefix = 'sagemaker_clarify'

In [205]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

dataset_path = f"s3://{default_s3_bucket_name}/{prefix}/training_input/dataset.csv"
analysis_config_path = f"s3://{default_s3_bucket_name}/{clarify_prefix}/analysis_config.json"
analysis_result_path = f"s3://{default_s3_bucket_name}/{clarify_prefix}/output"

config_input = ProcessingInput(
                    input_name="analysis_config",
                    source=analysis_config_path,
                    destination="/opt/ml/processing/input/config")
data_input = ProcessingInput(
                    input_name="dataset",
                    source=dataset_path,
                    destination="/opt/ml/processing/input/data")
result_output = ProcessingOutput(
                    source="/opt/ml/processing/output",
                    destination=analysis_result_path,
                    output_name="analysis_result")  

In [206]:
from sagemaker import clarify
clarify_processor = clarify.SageMakerClarifyProcessor(role=role,
                                                      instance_count=1,
                                                      instance_type='ml.c4.xlarge',
                                                      sagemaker_session=sagemaker_session)

In [207]:
dataset_path = f"s3://{default_s3_bucket_name}/sagemaker-featurestore-insurance/training_input/dataset.csv"
analysis_config_path = f"s3://{default_s3_bucket_name}/{clarify_prefix}/analysis_config.json"
analysis_result_path = f"s3://{default_s3_bucket_name}/{clarify_prefix}/output"

bias_data_config = clarify.DataConfig(s3_data_input_path=dataset_path,
                                      s3_output_path=analysis_result_path,
                                      label='purepremium',
                                      headers=dataset.columns.to_list(),
                                      dataset_type='text/csv')

model_config = clarify.ModelConfig(model_name="xgboost-insurance-model",
                                   instance_type='ml.c5.xlarge',
                                   instance_count=1,
                                   accept_type='text/csv')

predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)

In [208]:
bias_config = clarify.BiasConfig(label_values_or_threshold=[1],
                                facet_name='density',
                                facet_values_or_threshold=[0])

In [209]:
clarify_processor.run_bias(data_config=bias_data_config,
                           bias_config=bias_config,
                           model_config=model_config,
                           model_predicted_label_config=predictions_config,
                           pre_training_methods='all',
                           post_training_methods='all')


Job Name:  Clarify-Bias-2021-04-25-22-58-19-188
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ca-central-1-314997521033/sagemaker-featurestore-insurance/training_input/dataset.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ca-central-1-314997521033/Clarify-Bias-2021-04-25-22-58-19-188/input/analysis_config/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-ca-central-1-314997521033/sagemaker_clarify/output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}

In [210]:
sagemaker_client.delete_endpoint(predictor.endpoint_name)

TypeError: delete_endpoint() only accepts keyword arguments.