# Fiddler Quick Start Guide

This guide will walk you through the basic onboarding steps required to use Fiddler for production model monitoring and explainability. API documentation can be found [here](https://docs.fiddler.ai/api-reference/python-package/)

<img src="images/QS_flowchart.png" width=100 height=100 />

In [1]:
# Importing all the required packages at the beginning
import pandas as pd
import pathlib
import shutil
import yaml
import datetime
import time
from IPython.display import clear_output
from random import sample, randint

import fiddler as fdl

### Step Zero: Client Setup (Connecting to the URL)

First, we need to initialize the client object by specifying:
- The `url`: url is the fiddler URL that you have been provided to access. Usually of the form ‘XXXXX.fiddler.ai’. Contact us if you don’t have it
- The `org_id`: organization id is an identifier for the account. See Fiddler_URL/settings/general to find this id (listed as "Organization ID")
<img src="images/org_id.png" width=800 height=800 />
- The `auth_token`: this token is used to authenticate access. See Fiddler_URL/settings/credentials to find, create, or change this token
<img src="images/auth_token.png" width=800 height=800 />

You can also save this config as a file called `fiddler.ini` in the same folder as the notebook/script. That saves you from specifying the parameters in every notebook and script.


In [None]:
%%writefile fiddler.ini

[FIDDLER]
url = https://trial.fiddler.ai
org_id = company_name
auth_token = 43_character_length_token

In [2]:
#client = fdl.FiddlerApi(url='https://trial.fiddler.ai', org_id='your_org_id', auth_token="your_auth_token")
client = fdl.FiddlerApi()

### Step One: Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

In [3]:
project_id = 'quickstart' #This must only contain lowercase letters, numbers or underscore
client.create_project(project_id)

{'project_name': 'quickstart'}

### Step Two: Upload Baseline Data

Here we will upload the dataset (training data or a representive sample of the same) that will serve as baselines for various product capabilities, including monitoring and explainability of the models. For this tutorial, we will be using a cleaned version of auto insurance dataset that can be found [here](https://www.kaggle.com/somjee/auto-insurance-customerlifetimevalue?select=data.csv). We are predicting whether a customer would be high value or not. 


In [4]:
df = pd.read_csv("../samples/datasets/auto_insurance/data_cleaned.csv")
df.head()

Unnamed: 0,location_state,response,coverage,education,effective_to_date,employmentstatus,gender,income,location_code,marital_status,...,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,high_value,Campaign_A,probability_high_value
0,Washington,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,...,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize,0,No,0.00871
1,Arizona,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,...,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize,1,Yes,0.992777
2,Nevada,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,...,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize,1,Yes,0.996348
3,California,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,...,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize,1,Yes,0.993976
4,Washington,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,...,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize,0,No,0.008389


In [5]:
# Uploading the dataset
dataset_id = 'auto_insurance' #This must only contain lowercase letters, numbers or underscore
client.upload_dataset(project_id=project_id,dataset_id = dataset_id, dataset = {'data' : df}, 
                      info = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000) )

Uploading the dataset auto_insurance ...


{'row_count': 9134,
 'col_count': 25,
 'log': ['Importing dataset auto_insurance',
  'Creating table for auto_insurance',
  'Importing data file: data.csv']}

### Step Three: Register Model

#### Create Model Schema

As you may have noticed, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.
- We can infer the model task from the target column, or it can explicitly set. Currently we support three model types:
    - Regression
    - Binary Classification
    - Multi-class Classification

In [6]:
model_id = 'high_value_clf' #This must only contain lowercase letters, numbers or underscore


outputs = ['probability_high_value'] # Output of our model
target = 'high_value' # we're predicting whether the customer is high value (1) or not (0)
decision_cols = ['Campaign_A'] # Based on the predicted high_value - should we send the customer this campaign
input_features = df.drop(['probability_high_value', 'high_value','Campaign_A'], axis = 1).columns

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, dataset_id),
    features = input_features,
    target=target, 
    decision_cols=decision_cols,
    outputs=outputs,
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION,
    display_name='High value prediction model',
    description='This is a Binary Classification model from the tutorial',
)

model_info

ModelInfo:
  display_name: High value prediction model
  description: This is a Binary Classification model from the tutorial
  input_type: ModelInputType.TABULAR
  model_task: ModelTask.BINARY_CLASSIFICATION
  preferred_explanation: None
  custom_explanation_names: []
  inputs:
                               column     dtype count(possible_values)  \
    0                  location_state  CATEGORY                      5   
    1                        response  CATEGORY                      2   
    2                        coverage  CATEGORY                      3   
    3                       education  CATEGORY                      5   
    4               effective_to_date  CATEGORY                     59   
    5                employmentstatus  CATEGORY                      5   
    6                          gender  CATEGORY                      2   
    7                          income   INTEGER                          
    8                   location_code  CATEGORY       

#### Register model

In [7]:
# register model
client.register_model(project_id, model_id, dataset_id, model_info)

Loading dataset info ...
Validating model info ...
Generating model ...
Running tests ...
All tests passed ..
Model output provided


'Model successfully registered on Fiddler. \n Visit https://trial.fiddler.ai/projects/quickstart '

### Step Four: Simulate Monitoring Traffic

#### Streaming data example

In this step, we will be simulating traffic and monitoring it using [publish_event](https://docs.fiddler.ai/api-reference/python-package/#publish-event). This will be the equivalent of running our model separately on some data, and either sending to Fiddler then (streaming approach), or saving this information to a log and sending at a later point (using a batch upload of the logs).

For this demonstration, we will be going with a streaming approach. We will utilize a log containing rows with fields corresponding to:

- inputs / features
- outputs / predictions
- targets / ground truth / labels
- decisions

In [8]:
event_log = pd.read_csv('../samples/datasets/auto_insurance/event_log.csv')
event_log.head()

Unnamed: 0,location_state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,location_code,...,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,high_value,probability_high_value,Campaign_A
0,California,8429.491575,No,Basic,Master,2/1/11,Employed,M,98796,Suburban,...,Personal Auto,Personal L3,Offer4,Agent,499.2,SUV,Medsize,1,0.515639,Yes
1,Oregon,2709.539796,No,Basic,College,1/17/11,Employed,F,58828,Suburban,...,Personal Auto,Personal L3,Offer1,Agent,446.751243,Four-Door Car,Small,0,0.0,No
2,Washington,19644.76296,No,Basic,College,2/12/11,Employed,F,67326,Suburban,...,Corporate Auto,Corporate L1,Offer1,Agent,331.2,Four-Door Car,Medsize,1,0.515524,Yes
3,Arizona,2710.918992,No,Extended,College,2/10/11,Unemployed,M,0,Suburban,...,Corporate Auto,Corporate L3,Offer4,Call Center,547.2,Two-Door Car,Medsize,0,0.0,No
4,California,19337.90103,No,Extended,College,2/8/11,Employed,M,26488,Suburban,...,Personal Auto,Personal L2,Offer1,Web,1321.584957,Luxury SUV,Medsize,1,0.517045,Yes


Now we will publish these rows as events. To most accurately simulate this as a time-series event, we will be sending in each row by adding a generated timestamp with it. Real data will ideally have a timestamp related to when the event took place; otherwise, the current time will be used.

**Note**: The timestamp must be in UTC milliseconds. See [here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details

In [9]:

NUM_EVENTS_TO_SEND = 288*5 # an event every 5 minutes sent for 10 days
FIVE_MINUTES_MS = 300000
ONE_DAY_MS = 8.64e+7
start_date = round(time.time() * 1000) - (ONE_DAY_MS * 5)
        
# Convert this dataframe into a list of dictionary events, where each event is its own dictionary
event_list_dict = event_log.sample(n=NUM_EVENTS_TO_SEND).to_dict(orient='records') 

for ind, event_dict in enumerate(event_list_dict):
    event_ms_time_stamp = start_date + ind * FIVE_MINUTES_MS
    client.publish_event(project_id, model_id, event_dict, event_time_stamp=event_ms_time_stamp)
    
    clear_output(wait = True)
    readable_timestamp = datetime.datetime.fromtimestamp(event_ms_time_stamp/1000.0)
    
    print(f'Sending {ind+1} / {NUM_EVENTS_TO_SEND} \n{readable_timestamp} UTC: \n{event_dict}')
    time.sleep(0.01)
    
    
    

Sending 1440 / 1440 
2021-05-06 13:40:09.912000 UTC: 
{'location_state': 'Oregon', 'customer_lifetime_value': 5391.970996, 'response': 'No', 'coverage': 'Basic', 'education': 'College', 'effective_to_date': '2/21/11', 'employmentstatus': 'Employed', 'gender': 'F', 'income': 41662, 'location_code': 'Urban', 'marital_status': 'Divorced', 'monthly_premium_auto': 69, 'months_since_last_claim': 27, 'months_since_policy_inception': 73, 'number_of_open_complaints': 0, 'number_of_policies': 6, 'policy_type': 'Corporate Auto', 'policy': 'Corporate L3', 'renew_offer_type': 'Offer1', 'sales_channel': 'Agent', 'total_claim_amount': 217.973168, 'vehicle_class': 'Four-Door Car', 'vehicle_size': 'Medsize', 'high_value': 1, 'probability_high_value': 0.4912975570321921, 'Campaign_A': 'Yes', '__event_type': 'execution_event', '__occurred_at': 1620333609912.0}


[**Note**: In the case that labels are ingested in a future point, an event can be updated by calling publish_event with update_event = True. See [here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details.]

## Seeing Monitoring Traffic
We can now consult our Fiddler instance to visualize our monitoring results. We can see our newly created project within the Projects Overview section:

<img src="images/qs_projects_list.png" width=1000 height=1000 />

Within our project, we can click `gradient_boosting_regressor` to see our model we created. From there, we can see the traffic that reflects the events we sent by going to the Monitor Section at the top:

<img src="images/qs_monitoring_look.png" width=1000 height=1000 />

For a walkthrough to learn more about navigating the product, please consult our [Product Tour](https://docs.fiddler.ai/product-tour/)