# Fiddler examples have moved! [Deprecation Notice]

Dear user thank you for using fiddler product, we appreciate your time! We have moved the examples to a new github repo located at the following link


***
# [New fiddler-examples repo](https://github.com/fiddler-labs/fiddler-examples)
***

# Fraud Detection Usecase Walkthrough

# Fraud detection management with Fiddler

Machine learning based fraud detection models have been proven to be more effective than human when it comes to detecting fraud. However, if left unattended, the performance of fraud detection models can degrade over time leading to big losses for the company and dissatisfied customers.
The Fiddler MPM platform provides a variety of tools which can be used to monitor, explain, analyze, and improve the performance of your fraud detection model.


## Step 1: Model Setup on the Fiddler Platform

Please refer to out Quick Start Guide for a detailed walkthrough of how to setup Fiddler platform with your data

Please refer to our API documentation for advanced suctionality and access to the features through API

### 0. Imports

In [None]:
!pip install -q fiddler-client==1.0.2;

import numpy as np
import pandas as pd
import fiddler as fdl

print(f"Running client version {fdl.__version__}")

### 1. Connect to Fiddler

Before you can add your model with Fiddler, you'll need to connect using our API client.


---


**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

Organizatioin ID and Authorization Token can be obtained from the Fiddler platform under 'Settings' section.

In [None]:
URL = ''
ORG_ID = ''
AUTH_TOKEN = ''

In [None]:
client = fdl.FiddlerApi(
    url=URL,
    org_id=ORG_ID,
    auth_token=AUTH_TOKEN
)

### 2. Upload a baseline dataset

#### Create Project

In [None]:
PROJECT_ID = 'fraud_detection'
MODEL_ID = 'fraud_detection_model'
DATASET_ID = 'fraud_detection_data'

In [None]:
client.create_project(PROJECT_ID)

#### Baseline Data

In [None]:
PATH_TO_BASELINE_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-samples/master/content_root/tutorial/business-use-cases/fraud-detection/baseline_data.csv'

baseline_df = pd.read_csv(PATH_TO_BASELINE_CSV)
baseline_df = baseline_df.head(10000)
baseline_df

construct a `DatasetInfo` object to be used as a schema for keeping track of the **data types**, **data ranges**, and **unique values** information

In [None]:
dataset_info = fdl.DatasetInfo.from_dataframe(baseline_df, max_inferred_cardinality=100)
dataset_info

#### Upload Data

In [None]:
client.upload_dataset(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    dataset={
        'baseline': baseline_df
    },
    info=dataset_info
)

#### Add Model

In [None]:
# Specify task
model_task = 'binary'

if model_task == 'regression':
    model_task = fdl.ModelTask.REGRESSION
    
elif model_task == 'binary':
    model_task = fdl.ModelTask.BINARY_CLASSIFICATION

elif model_task == 'multiclass':
    model_task = fdl.ModelTask.MULTICLASS_CLASSIFICATION

    
# Specify column types
target = 'is_fraud'
outputs = ['pred_is_fraud']
decision_cols = None
features = ['category', 
            'amt', 
            'gender', 
            'city_pop', 
            'trans_num', 
            'total_cc_amt', 
            'uniq_cat_card', 
            'time_diff_days', 
            'time_since_last_trx', 
            'uniq_merchant_card', 
            'age']

    
# Generate ModelInfo
model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    dataset_id=DATASET_ID,
    model_task=model_task,
    target=target,
    outputs=outputs,
    decision_cols=decision_cols,
    features=features
)
model_info

#### Add Model
This adds the model schema to Fiddler. It unlocks Monitoring and some Analyze tools in the platform.

In [None]:
client.add_model(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    model_id=MODEL_ID,
    model_info=model_info
)

#### Add surrogate model
This creates a surrogate model on the platform and unlocks the XAI tools in Fiddler. It used the model schema specified in the `add_model` call.

In [None]:
client.add_model_surrogate(
    project_id=PROJECT_ID,
    model_id=MODEL_ID,
)

#### Publish Production Events

In [None]:
path_to_batch = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-samples/master/content_root/tutorial/business-use-cases/fraud-detection/production_data.csv'

prod_df = pd.read_csv(path_to_batch)

client.publish_events_batch(
    project_id=PROJECT_ID,
    model_id=MODEL_ID,
    batch_source=prod_df,
    timestamp_field = 'trans_date_trans_time',
    
)

### Check Data on Fiddler Platform

Once the model has been added, we can access it on the Platform. On the model landing page we can see Model Details which we mentioned in the ModelInfo object including-


1.   Model Task
2.   Model Description
3.   Data Info like Target column, Output column, Decision Columns etc.








<img src="images/gif/DatasetReady2.gif" width=900 height=600 />

## Troubleshoot your Fraud Detection Model with Fiddler

In the example below, we will show how you can monitor the performance of your fraud detection model and in case of a performance drop, take steps to mitigate it. Consequently, we will suggest steps that you can take to make sure similar issues do not impact your ML model going forward.

Overall, we will be taking the following steps for troubleshooting

1.   Monitor Drift for various features
2.   Monitor performance metrics associated with Fraud Detection like Recall, False-Positive Rate 
3.   Monitor Data Integrity Issues like range violations
4.   Provide point explanations to the mislabelled points
5.   Get to the root cause of the issues










### Check Monitoring Panel on Fiddler Platform

#### Data Drift

Once the production events are published, we can monitor drift for the model output in the ‘drift’ tab i.e. - pred_is_fraud, which is the probability value of a case being fraud. Here we can see that the prediction value of pred_is_fraud increased from February 15 to February 16. 


<img src="images/gif/MonitorDrift2.gif" width=900 height=600 />

### Monitor Performance Metrics

Next, in order to check if the performance has degraded, we can check the performance metrics in the ‘Performance’ tab. Here we will monitor ‘Recall’ and ‘FPR’ of the model. We can see that the recall has gone down and FPR has gone up in the same period.

<img src="images/gif/ModelPerformance2.gif" width=900 height=600 />

<img src="images/png/ModelPerformance1.png" width=900 height=600 />

### Data Integrity

The performance drop could be due to change in the quality of the data. In order to check that we can go to the ‘Data Integrity’ tab to look for Missing Value Violations, Type Violations, Range Violations, etc. We can see the columns ‘Category’ suffers range violations. Since this is a ‘categorical’ column, it is likely that there is a new value which the model did not encounter during training.

<img src="images/gif/DataIntegrity2.gif" width=900 height=600 />

<img src="images/png/DataIntegrity1.png" width=900 height=600 />

### Check the impact of drift

We can go back to the ‘Data Drift’ tab to measure how much the data integrity issue has impacted the prediction. We can select the bin in which the drift increased. The table below shows the Feature Impact, Feature Drift and Prediction Drift Impact values for the selected bin. We can see that even though the Feature Impact for ‘Category’ value is less than ‘Amt’ (Amount) value, because of the drift, it’s Prediction Drift Impact is more. 


<img src="images/gif/DriftImpact2.gif" width=900 height=600 />

<img src="images/png/DriftImpact1.png" width=900 height=600 />

We will now move on to check the difference between the production and baseline data for this bin. For this we can click on ‘Export bin and feature to Analyze’. Which will land us on the Analyze tab.

### Root Cause Analysis in the ‘Analyze’ tab
The analyze tab pre-populated the left side of the tab with the query based on our selection. We can also write custom queries to slice the data for analysis.


<img src="images/gif/RCA2.gif" width=900 height=600 />

<img src="images/png/RCA3.png" width=900 height=600 />

One the right hand side of the tab we can build charts on the tabular data based on the results of our custom query. For this RCA we will build a ‘Feature Distribution’ chart on the ‘Category’ column to check the distinct values and also measure the percentage of each value. We can see there are 15 distinct values along with their percentages.

<img src="images/png/RCA4.png" width=600 height=600 />

Next, we will compare the Feature Distribution chart in production data vs the baseline data to find out about the data integrity violation. We can modify the query to obtain data for baseline data and produce a ‘Feature Distribution’ chart for the same.

<img src="images/png/RCA5.png" width=600 height=600 />

We can see that the baseline data has just 14 unique values and ‘insurance’ is not present in baseline data. This ‘Category’ value wasn’t present in the training data and crept in production data likely causing performance degradation.
Next, we can perform a ‘point explanation’ for one such case where the ‘Category’ value was ‘Insurance’ and the prediction was incorrect to measure how much did ‘Category’ column contributed to the prediction by looking at its SHAP value.


<img src="images/png/RCA6.png" width=900 height=600 />

We can click on the bulb sign beside the row to produce a point explanation. If we look at the example <number>, we can see that the output probability value was <val> (predicted as fraud according to the threshold of 0.5) but the actual value was ‘not fraud’. 

The bulb icon will take us to ‘Explain’ tab. Here we can see that ‘category’ value contributed to the model predicting the case as ‘fraud’.


<img src="images/png/RCA7.png" width=900 height=600 />

### Actions
We discovered that the prediction drift and performance drop was due to the introduction of a new value in the ‘Category’ column. We can take steps so that we could identify this kind of issue in future before it can result in business impact.

#### Setting up Alerts
In the ‘Analyze’ tab, we can set up alerts to notify us of as soon as a certain data issue happens. For example, for the case we discussed, we can set up alert as shown below to alert us when the range violation increases beyond a certain threshold (e.g.-5%).


These alerts can further influence retraining of the ML model, we can retrain the model including the new data so the newly trained model contains the ‘insurance’ category value. This should result in improved performance.


<img src="images/gif/Alert2.gif" width=900 height=600 />

## Conclusion
Undetected Fraud Cases can lead to losses for company and customers, not to mention damage reputation and relationship with customers. Fiddler’s MPM platforms can be used to identify the pitfalls in your ML model and mitigate them before they have an impact on your business.
In this walkthrough we investigated one such issue with a fraud detection model where a data integrity issue caused performance of the ML model to drop. 
Fiddler can be used to keep the health of your Fraud Detection Model up by - 

1.   Monitoring the drift of the performance metric
2.   Monitoring various performance metrics associated with the model
3.   Monitoring data integrity issues which could harm the model performance
4.   Investigating the features which have drifted/ compromised and analyzing them to mitigate the issue
5.   Performing a Root Cause Analysis to identify the exact cause and fixing it
6.   Diving into point explanations to identify how much the issue has an impact on a particular data point
7.   Setting up alerts to make sure the issue does not happen again

We discovered there was an issue with the ‘Category’ column, wherein a new value was discovered in the production data. This led to performance drop in the data likely due to the range violation. We suggest two steps to mitigate this issue-

1.   Setting up an ‘alerts’ to identify similar issues in data integrity
2.   Retraining the ML model after including the new data (with the ground truth labels) to teach the model of the new values




