# Fiddler Quick Start Guide

This guide will walk you through the basic onboarding steps required to use Fiddler for production model monitoring and explainability. API documentation can be found [here](https://docs.fiddler.ai/api-reference/python-package/)

# Step Zero: Packages and Imports

To avoid import misses, we will have most package imports in this section. 

In [None]:
import fiddler as fdl
import pandas as pd
import pathlib
import shutil
import yaml

# Step One: Client Setup

First, we need to initialize the client object by specifying:
- The `url`: url is the fiddler URL that you have been provided to access. Usually of the form ‘XXXXX.fiddler.ai’. Contact us if you don’t have it
- The `org_id`: organization id is an identifier for the account. See Fiddler_URL/settings/general to find this id (listed as "Organization ID")
<img src="images/org_id.png" width=800 height=800 />
- The `auth_token`: this token is used to authenticate access. See Fiddler_URL/settings/credentials to find, create, or change this token
<img src="images/auth_token.png" width=800 height=800 />

You can also save this config as a file called `fiddler.ini` in the same folder as the notebook/script. That saves you from specifying the parameters in every notebook and script.
<img src="images/fiddler_ini.png" width=800 height=800 />


In [None]:
!pip install fiddler-client==0.6.18;

In [None]:
%%writefile fiddler.ini

[FIDDLER]
url = https://your-org@fiddler.ai
org_id = [YOUR ORG URL]
auth_token = [YOUR ORG TOKEN]

In [None]:
import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

# Step Two: Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

For the purposes of a full quick start, it is best to create a `project_id` with a unique name to best track your progress.

In [None]:
project_id = 'quickstart'

In [None]:
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

# Step Three: Upload Baseline Data

Here we will upload the datasets that will serve as baselines for various product capabilities, including monitoring of model performance, prediction & feature drift, and data errors; generating prediction-level (point) and model-level (global) explanations; and calculating various bias metrics.

We recommend using the model's training set for the most faithful and actionable metrics. In addition to the model's features and labels, Fiddler requires a few additional attributes to unlock its full suite of capabilities:

*   Model predictions (Mandatory: serves as a baseline for prediction drift)
*   Model decisions* (Optional: used to monitor model decsions over time, e.g. loan approved vs denied. The data uploaded initially can be random)
*  Model metadata* (Any additional fields relevant for model analysis. In the event you intend to use Fiddler to detect model bias, include any relevant protected attributes here, e.g. gender, race, age)

## Load dataset

Load the data you are going to use for training your model. For this tutorial, we will be using an auto insurance dataset that can be found [here](https://www.kaggle.com/somjee/auto-insurance-customerlifetimevalue?select=data.csv). 

**Note**: We are also adding a `high_value` field to act as our decision column.

In [None]:
# https://www.kaggle.com/somjee/auto-insurance-customerlifetimevalue?select=data.csv
df = pd.read_csv('/app/fiddler_samples/samples/datasets/auto_insurance/data.csv')
df = df.rename(columns={"State": "Location State"})
df.columns = [x.lower().replace(' ', '_') for x in df.columns]

# Adding a decision column to our data. In this case, we deem a 'high_value' customer as
# one with customer_lifetime_value >= 5000
df = df.assign(high_value=['Yes' if x >= 5000 else 'No' for x in df['customer_lifetime_value']])

df.head()

## Split Dataset into Train/Test

Now we will split our dataset into a train/test set to be used in training our model.

In [None]:
df_train = df.sample(frac=0.8,random_state=200)
df_test = df.drop(df_train.index)

## Upload dataset

In [None]:
dataset_id = 'auto_insurance'
dataset_id

Now, we will create a schema for our dataset, and upload the dataset to Fiddler. 

If the `dataset_id` was uploaded previously, we can fetch and use the schema from there.

In [None]:
# Retrieve dataset if already uploaded
if dataset_id in client.list_datasets(project_id):
    df_schema = client.get_dataset_info(project_id, dataset_id)
else:
    df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'train': df_train,
                 'test': df_test},
        dataset_id=dataset_id,
        info=df_schema)


df_schema

# Step Four: Register Model

## Create Model Schema

As you may have noticed, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.
- We can infer the model task from the target column, or it can explicitly set. Currently we support three model types:
    - Regression
    - Binary Classification
    - Multi-class Classification

In [None]:
model_id = 'ltv_regressor'
target = 'customer_lifetime_value'
continuous_features = ['income', 'monthly_premium_auto', 'months_since_last_claim', 'months_since_policy_inception',
                        'number_of_open_complaints', 'number_of_policies', 'total_claim_amount']
categorical_features = ['location_state', 'employmentstatus', 'policy_type', 'policy', 'vehicle_class','vehicle_size']

feature_columns = list(continuous_features + categorical_features)
metadata_cols = ['gender']
decision_cols = ['high_value']
outputs = ['predicted_customer_lifetime_value']

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, dataset_id),
    target=target, 
    features=feature_columns,
    metadata_cols=metadata_cols,
    decision_cols=decision_cols,
    outputs=outputs,
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.REGRESSION,
    display_name='Gradient Boosting Regressor',
    description='this is a GradientBoostingRegressor model from the tutorial',
)

model_info

## Register model

In [None]:
# delete model if it is already available
client.delete_model(project_id, model_id, delete_prod=True)

# register model
client.register_model(project_id, model_id, dataset_id, model_info)

# Step Five: Simulate Monitoring Traffic

## Streaming data example

In this step, we will be simulating traffic to send for our model monitoring by using [publish_event](https://docs.fiddler.ai/api-reference/python-package/#publish-event). This will be the equivalent of running our model separately on some data, and either sending to Fiddler then, or saving this information to a log and sending at a later point.

For this demonstration, we will be going with a streaming approach. We will utilize a log containing rows with fields corresponding to:

- inputs 
- predictions
- labels (targets)
- decisions

We can find the fields that will be utilized by consulting our `ModelInfo` object:

```
ModelInfo:
      display_name: Gradient Boosting Regressor \
      description: this is a GradientBoostingRegressor model from the tutorial
      input_type: ModelInputType.TABULAR
      model_task: ModelTask.REGRESSION
-->   inputs:
                                   column     dtype count(possible_values)  \
        0                  location_state  CATEGORY                      5   
        1                employmentstatus  CATEGORY                      5   
        2                          income   INTEGER                          
        3            monthly_premium_auto   INTEGER                          
        4         months_since_last_claim   INTEGER                          
        5   months_since_policy_inception   INTEGER                          
        6       number_of_open_complaints   INTEGER                          
        7              number_of_policies   INTEGER                          
        8                     policy_type  CATEGORY                      3   
        9                          policy  CATEGORY                      9   
        10             total_claim_amount     FLOAT                          
        11                  vehicle_class  CATEGORY                      6   
        12                   vehicle_size  CATEGORY                      3                    
-->   outputs
                                      column  dtype count(possible_values)  \
        0  predicted_customer_lifetime_value  FLOAT                          

          is_nullable value_range  
        0       False       * - *  
      metadata:
           column     dtype  count(possible_values) is_nullable value_range
        0  gender  CATEGORY                       2       False            
-->   decisions:
               column     dtype  count(possible_values) is_nullable value_range
        0  high_value  CATEGORY                       2       False            
-->   targets: [Column(name="customer_lifetime_value", data_type=DataType.FLOAT, possible_values=None, 
                is_nullable=False, value_range_min=1898.007675, value_range_max=83325.38119)]
                  misc:{}
    
```



In [None]:
event_log = pd.read_csv('/app/fiddler_samples/samples/datasets/auto_insurance/event_log.csv')
event_log.head()

Now we will publish these rows as events. To most accurately simulate this as a time-series event, we will also be calling a function to generate a timestamp in the last 2 weeks. Real data will ideally have a timestamp related to when the event took place; otherwise, the current time will be used.

**Note**: The timestamp must be in UTC milliseconds. See [here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details

In [None]:
import datetime
import time
from IPython.display import clear_output
from random import sample, randint
NUM_EVENTS_TO_SEND = 500

def getTimestampsFromPastTwoWeeks():
    """
    Generate a list of timestamps from the past two weeks. Timestamp is in 
    milliseconds since epoch in UTC.
    """
    TWO_WEEKS_MS = 604800 * 2 * 1000
    current_time_in_ms = round(time.time() * 1000)
    lower = current_time_in_ms - TWO_WEEKS_MS
    upper = current_time_in_ms
    length = NUM_EVENTS_TO_SEND
    timestamps = [lower + x*(upper-lower)/length for x in range(length)]

    return timestamps

        
# Convert this dataframe into a list of dictionary events, where each event is its own dictionary
event_list_dict = event_log.sample(n=NUM_EVENTS_TO_SEND).to_dict(orient='records') 
event_ms_time_stamps = getTimestampsFromPastTwoWeeks()

for ind, event_dict in enumerate(event_list_dict):
    event_ms_time_stamp = event_ms_time_stamps[ind]
    result = client.publish_event(project_id, model_id, event_dict, event_time_stamp=event_ms_time_stamp)
    
    clear_output(wait = True)
    readable_timestamp = datetime.datetime.fromtimestamp(event_ms_time_stamp/1000.0)
    
    print(f'Sending {ind+1} / {NUM_EVENTS_TO_SEND} \n{readable_timestamp} UTC: \n{event_dict}')
    time.sleep(0.01)
    

[**Note**: In the case that labels are ingested in a future point, an event can be updated by calling:

- `res = fiddler_api.publish_event(project_id, model_id, event, event_id: customer, update_event=True, event_time_stamp=row['__occurred_at'])`

By setting the `update_event` flag to be true, the event identifed by `event_id` will be updated with whatever additional information you pass in through `event`, including a target label. See [here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details.]

## Log Batch data example

Another option is to publish a batch of logs. Currently, we support batch publishing a **Pandas Dataframe**, or a **Parquet file hosted on an S3 bucket**. For this example, we will go with the second option.

This Parqet file contains rows containing fields that are corresponding to:

- inputs 
- predictions
- labels (targets)
- decisions

We can find the fields that will be utilized by consulting our `ModelInfo` object (more info [here](#Streaming-data-example))

For the purposes of this tutorial, we have a Parquet file uploaded to a public S3 bucket. While normally you would need to pass in an `auth_context` to access a private bucket, we do not require this step for the public S3 bucket. Commented code is left in to show how you would access a private S3 bucket.

Note: the Parquet file also contains a **timestamp** column that we explicitly feed into the function to accurately reflect the time that our events occured. This **timestamp** can either be in the form:
- `TIMESTAMP %Y-%m-%d %H:%M:%S.%f` (e.g. `2021-01-31 03:32:53.142000`)
- `EPOCH_TIME` (e.g. `1613087108`)

**The Parquet file we will be using has pre-configured timestamps that cover the entirety of January 2021.**

In [None]:
"""
# For this tutorial, we will be accessing a public S3 bucket. To read private S3 buckets, the following
#  credentials dictionary will be needed
auth_context = {'aws_access_key_id': '___',
            'aws_secret_access_key': '___',
            'aws_session_token': '___'
            }
"""

In [None]:
# We'll only publish one file for this tutorial, but the list can be expanded to publish as many parquet
# files as desired
# This Parquet file has pre-configured timestamps that cover the entirety of January 2021.
par_files = ['s3://fiddler-ai-public/datasets/quick_start_events.parquet']

for par_file in par_files:
    client.publish_events_batch(
            project_id,
            model_id,
            par_file,
            # auth_context=auth_context, # If using a private S3 bucket, uncomment this line and complete above cell
            timestamp_field='timestamp'
            )

## Seeing Monitoring Traffic
We can now consult our Fiddler instance to visualize our monitoring results. We can see our newly created project within the Projects Overview section:

<img src="images/qs_projects.png" width=1000 height=1000 />

Within our project, we can click `gradient_boosting_regressor` to see our model we created. From there, we can see the traffic that reflects the events we sent by going to the Monitor Section at the top:

<img src="images/qs_monitoring.png" width=1000 height=1000 />

For a walkthrough to learn more about navigating the product, please consult our [Product Tour](https://docs.fiddler.ai/product-tour/)