# Auto Insurance Fraud Detection - Part 1 : Data Prep to Feature Store 

## ML Life-cycle Detailed View
In this end to end example, we will build a model to predict fraudulent insurance claims and deploy it with SageMaker so it can be accessed to provide realtime predictions. While a deployed model is the end-product of this guide, the purpose of this guide is to walk you through the entire machine leanring (ML) lifecyle using SageMaker and AWS. 

![title](images/ml-lifecycle-detailed.png)

The goal of this end-to-end example is to show how SageMaker Servies and Features can be used to support each task in the ML Lifecycle. We will show you how all the various pieces fit together as one cohesive workflow. 

### Exploratory Data Science and Scalable MLOps

Note that there are typically two workflows, the first is the more *exploratory, manual data science workflow* where experiments are conducted and various techniques and strategies are tested. Then, once you have established your data prep transformations, featurizations and  training algorithms, even the testing of various hyperparameters for model tuning> Then you can start with the second thread where you *rely on MLOps or the ML Engineering part of your team* to streamline the process, make it more repetable and scalable by putting it into an automated pipeline. 

### Car Insurance Claims : Data Sets and Problem Domain

The inputs for building our model and workflow are two tables of insurance data: a claims table and a customers table. This data was synthetically generated is provided to you in its raw state for pre-processing with SageMaker Data Wrangler. However, completing the Data Wragnler step is not required to continue with the rest of this notebook. If you wish, you may use the `claims_preprocessed.csv` and `customers_preprocessed.csv` in the `data` directory as they are exact copies of what Data Wragnler would output.




<a id='all-up-overview'></a>

## [Overview](./0-AutoClaimFraudDetection.ipynb)
* ### [Notebook 0 : Architecture](./0-AutoClaimFraudDetection.ipynb)
* ### [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)
  * #### [Architecture](#arch)
  * #### [Getting started](#aud-getting-started)
  * #### [DataSets](#aud-datasets)
  * #### [SageMaker Feature Store](#aud-feature-store)
  * #### [Create train and test datasets](#aud-dataset)
* ### [Notebook 2](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)
  * #### [Train a model using XGBoost](#aud-train-model)
  * #### [Model lineage with artifacts and associations](#model-lineage)
  * #### [Evaluate the model for bias with Clarify](#check-bias)
  * #### [Deposit Model and Lineage in SageMaker Model Registry](#model-registry)
* ### [Notebook 3](./3-mitigate-bias-train-model2-registry-e2e.ipynb)
  * #### Train a version 2.0 model
* ### [Notebook 4](./4-deploy-run-inference-e2e.ipynb)
  * #### Deploy an approved model and make prediction
* ### [Notebook 5](./5-pipeline-e2e.ipynb)
  * #### SageMaker Pipeline
  * #### Cleanup

<a id='arch'> </a>
### Architecture for Data Prep, Process and Store Features
![Data Prep and Store](./images/e2e-1-pipeline-v3b.png)

### Loading stored variables
If you ran this notebook before, you may want to re-use the resources you aready created with AWS. Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything printed then it's probably the first time you are running the notebook! 

In [2]:
%store -r
%store

Stored variables and their in-db values:
bucket                          -> 'sagemaker-us-east-2-738335684114'
claims_fg_name                  -> 'fraud-detect-demo-claims'
claims_table                    -> 'fraud-detect-demo-claims-1610061189'
col_order                       -> ['fraud', 'total_claim_amount', 'incident_month', 
customers_fg_name               -> 'fraud-detect-demo-customers'
customers_table                 -> 'fraud-detect-demo-customers-1610061192'
database_name                   -> 'sagemaker_featurestore'
hyperparameters                 -> {'max_depth': '3', 'eta': '0.2', 'objective': 'bin
model_1_name                    -> 'fraud-detect-demo-xgboost-pre-smote'
mpg_name                        -> 'fraud-detect-demo'
prefix                          -> 'fraud-detect-demo'
test_data_uri                   -> 's3://sagemaker-us-east-2-738335684114/fraud-detec
train_data_uri                  -> 's3://sagemaker-us-east-2-738335684114/fraud-detec
training_job_1_name       

### Install required and/or update third-party libraries

In [2]:
!python -m pip install -Uq pip
!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.23.1 boto3==1.16.48

### Import libraries

In [3]:
import json
import time
import boto3
import string
import sagemaker
import pandas as pd
import awswrangler as wr

from sagemaker.feature_store.feature_group import FeatureGroup

<a id='aud-getting-started'></a>
## Getting started: Creating Resources

[overview](#all-up-overview)
___
In order to successfully run this notebook you will need to create some AWS resources. 
First, an S3 bucket will be created to store all the data for this tutorial. 
Once created, you will then need to create an AWS Glue role using the IAM console then attach a policy to the S3 bucket to allow FeatureStore access to this notebook. If you've already run this notebook and are picking up where you left off, then running the cells below should pick up the resources you already created without creating any additional resources.

#### Add FeatureStore policy to Studio's execution role

![title](images/iam-policies.png)


1. In a separate brower tab go to the IAM section of the AWS Console
2. Navigate to the Roles section and select the execution role you're using for your SageMaker Studio user
    * If you're not sure what role you're using, run the cell below to print it out
3. Attach the <font color='yellow'> AmazonSageMakerFeatureStoreAccess </font> policy to this role. Once attached, the changes take  effect immediately.

In [4]:
print('SageMaker Role:', sagemaker.get_execution_role().split('/')[-1])

SageMaker Role: AmazonSageMaker-ExecutionRole-20210113T201603


### Set region, boto3 and SageMaker SDK variables

In [5]:
#You can change this to a region of your choice
region = "us-east-2"

In [6]:
boto3.setup_default_session(region_name=region)

boto_session = boto3.Session(region_name=region)

s3_client = boto3.client('s3', region_name=region)

sagemaker_boto_client = boto_session.client('sagemaker')

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_boto_client)

sagemaker_role = sagemaker.get_execution_role()

account_id = boto3.client('sts').get_caller_identity()["Account"]

### Create a directory in the SageMaker default bucket for this tutorial

In [7]:
if 'bucket' not in locals():
    bucket = sagemaker_session.default_bucket()
    prefix = 'fraud-detect-demo'
    %store bucket
    %store prefix
    print(f'Creating bucket: {bucket}...')

try:
    s3_client.create_bucket(Bucket=bucket, ACL='private', CreateBucketConfiguration={'LocationConstraint': region})
    print('Create S3 bucket: SUCCESS')
    
except Exception as e:
    if e.response['Error']['Code'] == 'BucketAlreadyOwnedByYou':
        print(f'Using existing bucket: {bucket}/{prefix}')
    else:
        raise(e)

Using existing bucket: sagemaker-us-east-2-738335684114/fraud-detect-demo


In [8]:
#======> Tons of output_paths
traing_job_output_path = f's3://{bucket}/{prefix}/training_jobs'
bias_report_1_output_path = f's3://{bucket}/{prefix}/clarify-bias-1'
bias_report_2_output_path = f's3://{bucket}/{prefix}/clarify-bias-2'
explainability_output_path = f's3://{bucket}/{prefix}/clarify-explainability'

train_data_uri = f's3://{bucket}/{prefix}/data/train/train.csv'
test_data_uri = f's3://{bucket}/{prefix}/data/test/test.csv'

#=======> variables used for parameterizing the notebook run
train_instance_count = 1
train_instance_type = "ml.m4.xlarge"

claify_instance_count = 1
clairfy_instance_type = 'ml.c5.xlarge'

predictor_instance_count = 1
predictor_instance_type = "ml.c5.xlarge"

### Upload raw data to S3
Before you can preprocess the raw data with Data Wrangler, it must exist in S3.

In [9]:
s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/data/raw/claims.csv')
s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/data/raw/customers.csv')

### Update attributes within the  `.flow` file 
DataWrangler will generate a .flow file. It contains a reference to an S3 bucket used during the Wrangling. This may be different from the one you have as a default in this notebook eg if the Wrangling was done by someone else, you will probably not have access to their bucket and you now need to point to your own S3 bucket so you can actually load the .flow file into Wrangler or access the data.

After running the cell below you can open the `claims.flow` and `customers.flow` files and export the data to S3 or you can continue the guide using the provided `data/claims_preprocessed.csv` and `data/customers_preprocessed.csv` files.

In [10]:
claims_flow_template_file = "claims_flow_template"

with open(claims_flow_template_file, 'r') as f:
    variables   = {'bucket': bucket, 'prefix': prefix}
    template    = string.Template(f.read())
    claims_flow = template.substitute(variables)
    claims_flow = json.loads(claims_flow)

with open('claims.flow', 'w') as f:
    json.dump(claims_flow, f)
    
customers_flow_template_file = "customers_flow_template"

with open(customers_flow_template_file, 'r') as f:
    variables      = {'bucket': bucket, 'prefix': prefix}
    template       = string.Template(f.read())
    customers_flow = template.substitute(variables)
    customers_flow = json.loads(customers_flow)
    
with open('customers.flow', 'w') as f:
    json.dump(customers_flow, f)

### Load preprocessed data from Data Wrangler job
If you ran the Data Wrangler jobs from  `claims.flow` and `customers.flow`, you can load your preprocessed data here. If you did not run the Data Wrangler job, you can still get started by loading the pre-made data sets from the `/data` directory of this example.



<a id='aud-datasets'></a>
#### DataSets and Feature Types

[overview](#all-up-overview)

In [11]:
claims_dtypes = {'policy_id': int,
 'incident_severity': int,
 'num_vehicles_involved': int,
 'num_injuries': int,
 'num_witnesses': int,
 'police_report_available': int,
 'injury_claim': float,
 'vehicle_claim': float,
 'total_claim_amount': float,
 'incident_month': int,
 'incident_day': int,
 'incident_dow': int,
 'incident_hour': int,
 'fraud': int,
 'driver_relationship_self': int,
 'driver_relationship_na': int,
 'driver_relationship_spouse': int,
 'driver_relationship_child': int,
 'driver_relationship_other': int,
 'incident_type_collision': int,
 'incident_type_breakin': int,
 'incident_type_theft': int,
 'collision_type_front': int,
 'collision_type_rear': int,
 'collision_type_side': int,
 'collision_type_na': int,
 'authorities_contacted_police': int,
 'authorities_contacted_none': int,
 'authorities_contacted_fire': int,
 'authorities_contacted_ambulance': int,
 'event_time': float}

customers_dtypes = {'policy_id': int,
 'customer_age': int,
 'customer_education': int,
 'months_as_customer': int,
 'policy_deductable': int,
 'policy_annual_premium': int,
 'policy_liability': int,
 'auto_year': int,
 'num_claims_past_year': int,
 'num_insurers_past_5_years': int,
 'customer_gender_male': int,
 'customer_gender_female': int,
 'policy_state_ca': int,
 'policy_state_wa': int,
 'policy_state_az': int,
 'policy_state_or': int,
 'policy_state_nv': int,
 'policy_state_id': int,
 'event_time': float}

In [12]:
#======> This is your DataFlow output path if you decide to redo the work in DataFlow on your own
flow_output_path = 'YOUR_PATH_HERE'

try:
    # this will try to load the exported dataframes from the claims and customers .flow files
    claims_s3_path = f'{flow_output_path}/claims_output'
    customers_s3_path = f'{flow_output_path}/customers_output'
    
    claims_preprocessed = wr.s3.read_csv(
        path=claims_s3_path, 
        dataset=True, 
        index_col=0, 
        dtype=claims_dtypes)
    
    customers_preprocessed = wr.s3.read_csv(
        path=customers_s3_path, 
        dataset=True, 
        index_col=0, 
        dtype=customers_dtypes)

except:
    # if the Data Wrangler job was not run, the claims and customers dataframes will be loaded from local copies
    timestamp = pd.to_datetime('now').timestamp()
    print('Unable to load Data Wrangler output. Loading pre-made dataframes...')
    
    claims_preprocessed = pd.read_csv(
        filepath_or_buffer='data/claims_preprocessed.csv', 
        dtype=claims_dtypes)
    
    # a timestamp column is required by the feature store, so one is added with a current timestamp
    claims_preprocessed['event_time'] = timestamp
    
    customers_preprocessed = pd.read_csv(
        filepath_or_buffer='data/customers_preprocessed.csv', 
        dtype=customers_dtypes)
    
    customers_preprocessed['event_time'] = timestamp
    
    print('Complete')

Unable to load Data Wrangler output. Loading pre-made dataframes...
Complete


We now have a set of Pandas DataFrames that contain the customer and claim data, with the correct data types. When Dat Wrangler encodes a feature as one-hot-encoded feature, it will default to float data types for those resulting features (one feature --> many columns for the one hot encoding). 

<font color ='red'> Note: </font> the reason for explicitly converting the data types for categorical features generated by Data Wrangler, is to ensure they are of type integer so that Clarify will treat them as categorical variables. 

<a id='aud-feature-store'></a>
## SageMaker Feature Store

[overview](#all-up-overview)
___
Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent. SageMaker Feature Store keeps track of the metadata of stored features (e.g. feature name or version number) so that you can query the features for the right attributes in batches or in real time using Amazon Athena, an interactive query service. SageMaker Feature Store also keeps features updated, because as new data is generated during inference, the single repository is updated so new features are always available for models to use during training and inference.

A feature store consists of an offline componet stored in S3 and an online component stored in a low-latency database. The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we will create a feature groups for our Claims and Customers datasets. After inserting the claims and customer data into their respective feature groups, you need to query the offline store with Athena to build the training dataset.

You can reference the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) for more information about the SageMaker Feature Store.


In [15]:
featurestore_runtime = boto_session.client(
    service_name='sagemaker-featurestore-runtime', 
    region_name=region
)

feature_store_session = sagemaker.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_boto_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

### Configure the feature groups
The datatype for each feature is set by passing a dataframe and inferring the proper datatype. Feature data types can also be set via a config variable, but it will have to match the correspongin Python data type in the Pandas dataframe when it's ingested to the Feature Group.

In [16]:
claims_fg_name = f'{prefix}-claims'
customers_fg_name = f'{prefix}-customers'
%store claims_fg_name 
%store customers_fg_name

claims_feature_group = FeatureGroup(
    name=claims_fg_name, 
    sagemaker_session=feature_store_session)

customers_feature_group = FeatureGroup(
    name=customers_fg_name, 
    sagemaker_session=feature_store_session)

claims_feature_group.load_feature_definitions(data_frame=claims_preprocessed);
customers_feature_group.load_feature_definitions(data_frame=customers_preprocessed);

Stored 'claims_fg_name' (str)
Stored 'customers_fg_name' (str)


### Create the feature groups
You must tell the Feature Group which columns in the dataframe correspond to the required record indentifier and event time features.

In [17]:
record_identifier_feature_name = 'policy_id'
event_time_feature_name = 'event_time'

try:
    claims_feature_group.create(
        s3_uri=f"s3://{bucket}/{prefix}",
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=sagemaker_role,
        enable_online_store=True
    )
    print(f'Create "claims" feature group: SUCCESS')
except Exception as e:
    code = e.response.get('Error').get('Code')
    if code == 'ResourceInUse':
        print(f'Using existing feature group: {claims_fg_name}')
    else:
        raise(e)

try:
    customers_feature_group.create(
        s3_uri=f"s3://{bucket}/{prefix}",
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=sagemaker_role,
        enable_online_store=True
    )
    print(f'Create "customers" feature group: SUCCESS')
except Exception as e:
    code = e.response.get('Error').get('Code')
    if code == 'ResourceInUse':
        print(f'Using existing feature group: {customers_fg_name}')
    else:
        raise(e)

Using existing feature group: fraud-detect-demo-claims
Using existing feature group: fraud-detect-demo-customers


### Wait until feature group creation has fully completed

In [18]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")
    
wait_for_feature_group_creation_complete(feature_group=claims_feature_group)
wait_for_feature_group_creation_complete(feature_group=customers_feature_group)

FeatureGroup fraud-detect-demo-claims successfully created.
FeatureGroup fraud-detect-demo-customers successfully created.


### Ingest records into the Feature Groups
After the Feature Groups have been created, we can put data into each store by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to s3 in chunks. The files will be written to the offline store within a few minutes of ingestion.

In [19]:
if 'claims_table' in locals():
    print("You may have already ingested the data into your Feature Groups. If you'd like to do this again, you can run the ingest methods outside of the 'if/else' statement.")

else:
    claims_feature_group.ingest(
    data_frame=claims_preprocessed, max_workers=3, wait=True
    );

    customers_feature_group.ingest(
        data_frame=customers_preprocessed, max_workers=3, wait=True
    );

You may have already ingested the data into your Feature Groups. If you'd like to do this again, you can run the ingest methods outside of the 'if/else' statement.


### Wait for offline store data to become available
This usually takes 5-8 minutes

In [20]:
claims_feature_group_s3_prefix = f'{prefix}/{account_id}/sagemaker/{region}/offline-store/{claims_fg_name}/data'
customers_feature_group_s3_prefix = f'{prefix}/{account_id}/sagemaker/{region}/offline-store/{customers_fg_name}/data'

offline_store_contents = None
while (offline_store_contents is None):
    objects_in_bucket = s3_client.list_objects(Bucket=bucket, Prefix=customers_feature_group_s3_prefix)
    if ('Contents' in objects_in_bucket and len(objects_in_bucket['Contents']) > 1):
        offline_store_contents = objects_in_bucket['Contents']
    else:
        print('Waiting for data in offline store...')
        time.sleep(60)
    
print('\nData available.')


Data available.


<pre>

</pre>

<a id='aud-dataset'></a>
## Create train and test datasets

[overview](#all-up-overview)
___
Once the data is available in the offline store, it will automatically be cataloged and loaded into an Athena table (this is done by default, but can be turned off). In order to build our training and test datasets, you will submit a SQL query to join the the Claims and Customers tables created in Athena.

In [21]:
claims_query = claims_feature_group.athena_query()
customers_query = customers_feature_group.athena_query()

claims_table = claims_query.table_name
customers_table = customers_query.table_name
database_name = customers_query.database
%store claims_table
%store customers_table
%store database_name

feature_columns = list( set(claims_preprocessed.columns) ^ set(customers_preprocessed.columns) )
feature_columns_string = ", ".join(f'\"{c}\"' for c in feature_columns)
feature_columns_string = f'"{claims_table}".policy_id as policy_id, ' + feature_columns_string

query_string = f"""
SELECT DISTINCT {feature_columns_string}
FROM "{claims_table}" LEFT JOIN "{customers_table}" 
ON "{claims_table}".policy_id = "{customers_table}".policy_id
"""

Stored 'claims_table' (str)
Stored 'customers_table' (str)
Stored 'database_name' (str)


In [22]:
claims_query.run(query_string=query_string, output_location=f's3://{bucket}/{prefix}/query_results')
claims_query.wait()
dataset = claims_query.as_dataframe()

In [28]:
dataset.to_csv("./data/claims_customer.csv")

In [23]:
col_order = ['fraud'] + list(dataset.drop(['fraud', 'policy_id'], axis=1).columns)
%store col_order

train = dataset.sample(frac=.80, random_state=0)[col_order]
test = dataset.drop(train.index)[col_order]

Stored 'col_order' (list)


### Write train, test data to S3

In [24]:
train.to_csv('data/train.csv', index=False)
test.to_csv('data/test.csv', index=False)
dataset.to_csv('data/dataset.csv', index=True)

In [25]:
s3_client.upload_file(Filename='data/train.csv', Bucket=bucket, Key=f'{prefix}/data/train/train.csv')
s3_client.upload_file(Filename='data/test.csv', Bucket=bucket, Key=f'{prefix}/data/test/test.csv')
%store train_data_uri
%store test_data_uri

Stored 'train_data_uri' (str)
Stored 'test_data_uri' (str)


<pre>

</pre>

In [26]:
train.head(5)

Unnamed: 0,fraud,incident_type_theft,policy_state_ca,policy_deductable,num_witnesses,policy_state_or,incident_month,customer_gender_female,num_insurers_past_5_years,customer_gender_male,...,driver_relationship_other,policy_state_id,incident_hour,vehicle_claim,incident_type_collision,policy_annual_premium,policy_state_az,policy_state_wa,collision_type_rear,collision_type_front
19134,0,0,1,750,2,0,5,1,1,0,...,0,0,18,27774.0,1,3000,0,0,1,0
4981,0,0,1,750,2,0,6,1,1,0,...,0,0,20,24000.0,1,3000,0,0,1,0
16643,0,0,1,750,2,0,6,0,1,1,...,0,0,9,31452.0,1,2850,0,0,1,0
19117,0,0,1,750,2,0,11,1,1,0,...,0,0,21,14153.0,1,2900,0,0,0,1
5306,0,0,0,750,0,0,8,0,1,0,...,0,0,10,6000.0,1,3000,0,0,0,1


In [27]:
test.head(5)

Unnamed: 0,fraud,incident_type_theft,policy_state_ca,policy_deductable,num_witnesses,policy_state_or,incident_month,customer_gender_female,num_insurers_past_5_years,customer_gender_male,...,driver_relationship_other,policy_state_id,incident_hour,vehicle_claim,incident_type_collision,policy_annual_premium,policy_state_az,policy_state_wa,collision_type_rear,collision_type_front
10,0,0,1,750,1,0,9,1,1,0,...,0,0,1,37000.0,1,3000,0,0,0,1
19,0,0,0,750,1,0,5,0,1,1,...,0,0,18,34500.0,1,2700,0,1,1,0
28,0,0,1,750,0,0,4,0,1,1,...,0,0,18,12000.0,0,3000,0,0,0,0
40,0,0,0,750,5,0,10,1,5,0,...,0,0,15,14000.0,1,2750,1,0,0,1
43,0,0,0,750,1,0,8,1,1,0,...,0,0,3,35000.0,1,3000,0,1,1,0
