# Fiddler Quickstart notebook for a Class Imbalance Example

Many ML use cases, like fraud detection and facial recognition, suffer from what is known as the class imbalance problem.  This problem exists where a vast majority of the inferences seen by the model belong to only one class, known as the majority class.  This makes detecting drift in the minority class very difficult as the "signal" is completely outweighed by the shear number of inferences seen in the majority class.  The following notebook showcases how Fiddler uses a class weighting paramater to deal with this problem. This notebook will onboard two identical models -- one without class imbalance weighting and one with class imbalance weighting -- to illustrate how drift signals in the minority class are easier to detect once properly amplified by Fiddler's unique class weighting approach.

1. Connect to Fiddler
2. Upload a baseline dataset for a fraud detection use case
3. Onboard two fraud models to Fiddler -- one with class weighting and one without
4. Publish production events to both models with synthetic drift in the minority class
5. Get Insights -- compare the two onboarding approaches in Fiddler

## 0. Imports

In [None]:
!pip install -q fiddler-client;

import numpy as np
import pandas as pd
import fiddler as fdl
import sklearn
import datetime
import time

print(f"Running client version {fdl.__version__}")

RANDOM_STATE = 42

## 1. Connect to Fiddler

In [None]:
URL = ''  # Make sure to include the full URL (including https://).
TOKEN = ''

In [None]:
fdl.init(
    url=URL,
    token=TOKEN
)

In [None]:
PROJECT_NAME = 'imbalance_cc_fraud'

project = fdl.Project(
    name=PROJECT_NAME
)

project.create()

In [None]:
PATH_TO_SAMPLE_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/data/v3/imbalance_data_sample.csv'

sample_df = pd.read_csv(PATH_TO_SAMPLE_CSV)
sample_df.head()

In [None]:
sample_df['Class'].value_counts()
print('Percentage of minority class: {}%'.format(round(sample_df['Class'].value_counts()[1]*100/sample_df.shape[0], 4)))

## 3. Onboard two fraud models to Fiddler -- one with class weighting and one without

Now, we will add two models:
1. With class weight parameters
2. Without class weight parameters

Below, we first create a `ModelSpec` object and then onboard (add) the two models to Fiddler -- the first model onboarded without weights undefined, the second with weights defined.

In [None]:
model_spec = fdl.ModelSpec(
    inputs=set(sample_df.columns) - set(['Class', 'prediction_score', 'timestamp']),
    outputs=['prediction_score'],
    targets=['Class'],
    metadata=['timestamp']
)

In [None]:
timestamp_column = 'timestamp'

In [None]:
model_task = fdl.ModelTask.BINARY_CLASSIFICATION

task_params_weighted = fdl.ModelTaskParams(
    target_class_order=[0, 1],
    binary_classification_threshold=0.4,
    class_weights=sklearn.utils.class_weight.compute_class_weight(class_weight='balanced',
        classes=np.unique(sample_df['Class']),
        y=sample_df['Class']).tolist()
)

task_params_unweighted = fdl.ModelTaskParams(
    target_class_order=[0, 1],
    binary_classification_threshold=0.4,
)

In [None]:
MODEL_NAMES = ['imbalance_cc_fraud', 'imbalance_cc_fraud_weighted']

for model_name in MODEL_NAMES:
    model = fdl.Model.from_data(
        name=model_name,
        project_id=project.id,
        source=sample_df,
        spec=model_spec,
        task=model_task,
        task_params=task_params_unweighted if model_name == 'imbalance_cc_fraud' else task_params_weighted,
        event_ts_col=timestamp_column
    )

    model.create()

## 4. Publish production events to both models with synthetic drift in the minority class

In [None]:
PATH_TO_EVENTS_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/data/v3/imbalance_production_data.csv'

production_df = pd.read_csv(PATH_TO_EVENTS_CSV)

# Shift the timestamps of the production events to be as recent as today
production_df['timestamp'] = production_df['timestamp'] + (int(time.time() * 1000) - production_df['timestamp'].max())
production_df.head()

In [None]:
print('Percentage of minority class: {}%'.format(round(production_df['Class'].value_counts()[1]*100/production_df.shape[0], 4)))

We see that the percentage of minority class in production data is > 3 times than that of baseline data. This should create a big drift in the predictions.

We will now publish the same production/event data for both of the models -- the one with class weights and the one without class weights.

In [None]:
for model_name in MODEL_NAMES:
    model.publish(production_df)

## 5. Get Insights -- compare the two onboarding approaches in Fiddler

**You're all done!**


In the Fiddler UI, we can the model without the class weights defined the output/prediction drift in the minority class is very hard to detect (`<=0.05`) because it is trumped byt the overwhelming volume of events in the majority class.  If we declare class weights then we see a higher drift which is more correct respresentation if the production data where the ratio of minority is class is 3x.

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/imabalance_data_1.png" />
        </td>
    </tr>
</table>



---


**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.