# Fiddler examples have moved! [Deprecation Notice]

Dear user thank you for using fiddler product, we appreciate your time! We have moved the examples to a new github repo located at the following link


***
# [New fiddler-examples repo](https://github.com/fiddler-labs/fiddler-examples)
***

# Fiddler Quick Start Guide

This guide will walk you through the basic onboarding steps required to use Fiddler for production model monitoring and explainability. API documentation can be found [here](https://docs.fiddler.ai/api-reference/python-package/)

# Step Zero: Packages and Imports

To avoid import misses, we will have most package imports in this section. 

For the purposes of this tutorial, we will be training and utilizing a model that uses `scikit-learn` version `0.21.2`. We have provided a cell that can be used to verify the current version of `scikit-learn` to ensure a smooth tutorial experience.

In [None]:
# !pip install scikit-learn==0.21.2

In [None]:
import fiddler as fdl
import pandas as pd
import pathlib
import pickle
import sklearn
import shutil
import yaml

In [None]:
# Check to see if `scikit-learn` version is `0.21.2`
import sklearn

if sklearn.__version__ != '0.21.2':
    raise Exception('Please use sklearn version 0.21.2')

# Step One: Client Setup

Here we will install the [Fiddler Python package](https://pypi.org/project/fiddler-client/) and establish an API connection to our Fiddler instance.

This Python client is a powerful way to:
- Upload the dataset and model to Fiddler
- Ingest production events to Fiddler

This can be done from a Jupyter Notebook or any python editor that you use to load data and build models.

<img src="images/qs_d1.png" width=700 height=700 />

First, we need to initialize the client object by specifying:
- The `url`: url is the fiddler URL that you have been provided to access. Usually of the form ‘XXXXX.fiddler.ai’. Contact us if you don’t have it
- The `org_id`: organization id is an identifier for the account. See Fiddler_URL/settings/general to find this id (listed as "Organization ID")
<img src="images/org_id.png" width=800 height=800 />
- The `auth_token`: this token is used to authenticate access. See Fiddler_URL/settings/credentials to find, create, or change this token
<img src="images/auth_token.png" width=800 height=800 />

You can also save this config as a file called `fiddler.ini` in the same folder as the notebook/script. That saves you from specifying the parameters in every notebook and script.
<img src="images/fiddler_ini.png" width=800 height=800 />


In [None]:
!pip install fiddler-client==0.6.8;

In [None]:
%%writefile fiddler.ini

[FIDDLER]
url = http://host.docker.internal:4100
org_id = onebox
auth_token = YOUR_TOKEN_HERE

In [None]:
import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

Fiddler has three primary constructs, namely projects, datasets and models. This diagram illustrates the relationship between the three.
<img src="images/qs_d2.png" width=600 height=600 />

The Fiddler client provides a number of methods. API documentation can be found [here](https://docs.fiddler.ai/api-reference/python-package/)


# Step Two: Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

For the purposes of a full quick start, it is best to create a `project_id` with a unique name to best track your progress.

In [None]:
project_id = 'quickstart'

In [None]:
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

# Step Three: Upload Baseline Data

Here we will upload the datasets that will serve as baselines for various product capabilities, including monitoring of model performance, prediction & feature drift, and data errors; generating prediction-level (point) and model-level (global) explanations; and calculating various bias metrics.

We recommend using the model's training and test set for the most faithful and actionable metrics. In addition to the model's features and labels, Fiddler requires a few additional attributes to unlock its full suite of capabilities:

*   Model predictions (Mandatory: serves as a baseline for prediction drift)
*   Model decisions* (Optional: used to monitor model decsions over time, e.g. loan approved vs denied. The data uploaded initially can be random)
*  Model metadata* (Any additional fields relevant for model analysis. In the event you intend to use Fiddler to detect model bias, include any relevant protected attributes here, e.g. gender, race, age)

## Load dataset

Load the data you are going to use for training your model. For this tutorial, we will be using an auto insurance dataset that can be found [here](https://www.kaggle.com/somjee/auto-insurance-customerlifetimevalue?select=data.csv). 

**Note**: In the next cell, we will be making some modifications to the field names to better fit model training in later steps. We will also be adding a decisions column to our dataset. In total, we will be:
- Renaming field `State` to `Location State` (State is a reserved word for models, so this change was required for model training)
- Addition of a new `high_value` field to act as our decision column
- Renaming columns to be a more library-friendly schema (lowercase and replacing spaces with underscores)

In [None]:
# https://www.kaggle.com/somjee/auto-insurance-customerlifetimevalue?select=data.csv
df = pd.read_csv('/app/fiddler_samples/samples/datasets/auto_insurance/data.csv')
df = df.rename(columns={"State": "Location State"})
df.columns = [x.lower().replace(' ', '_') for x in df.columns]

# Adding a decision column to our data. In this case, we deem a 'high_value' customer as
# one with customer_lifetime_value >= 5000
df = df.assign(high_value=['Yes' if x >= 5000 else 'No' for x in df['customer_lifetime_value']])

df.head()

## Split Dataset into Train/Test

Now we will split our dataset into a train/test set to be used in training our model.

In [None]:
df_train = df.sample(frac=0.8,random_state=200)
df_test = df.drop(df_train.index)

## Upload dataset

To upload a model, you first need to upload a sample of the data of the model’s inputs, targets, and additional metadata that might be useful for model analysis. This data sample helps us (among other things) to infer the model schema and the data types and values range of each feature.
- This sample has to be a flat table that can be loaded as a pandas DF (```upload_dataset()```).
- This input data sample is used for many downstream functions in Fiddler, including: shapley value methods, what-if (ICE) plots, PDP plots, drift, outliers, and data integrity.
- We suggest uploading a sample of the model’s training data as it’s the most meaningful for the tasks listed above. For example, model outliers should be ideally based on the training data as that’s the data the model has seen. 
- You can upload multiple datasets with string identifiers, but we currently do not ascribe any meaning to those. For example: ```dataset={'data': df}``` or ```dataset={'train': train_df, 'test': test_df}```.
- Currently we support two input types:
    - Tabular
    - Single string text, meaning text data in a single column

In [None]:
dataset_id = 'auto_insurance'
dataset_id

Now, we will create a schema for our dataset, and upload the dataset to Fiddler. 

If the `dataset_id` is already uploaded previously, we can fetch and use the schema from there. In the case that you wish to delete and reupload a dataset (e.g. different fields, different `max_inferred_cardinality`, etc.), we've included some commented code that shows how to do so:

In [None]:
# Retrieve dataset if already uploaded
if dataset_id in client.list_datasets():
    df_schema = client.get_dataset_info(dataset_id)
else:
    df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
    upload_result = client.upload_dataset(
        dataset={'train': df_train,
                 'test': df_test},
        dataset_id=dataset_id,
        info=df_schema)

"""
# Delete dataset and reupload fresh
if dataset_id in client.list_datasets():
    client.delete_dataset(dataset_id)
else:
    df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
    upload_result = client.upload_dataset(
        dataset={'train': df_train,
                 'test': df_test},
        dataset_id=dataset_id,
        info=df_schema)
"""

df_schema

# Step Four: Create and Train a Model

## Create Model Schema

As you may have noticed, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.
- We can infer the model task from the target column, or it can explicitly set. Currently we support three model types:
    - Regression
    - Binary Classification
    - Multi-class Classification

In [None]:
target = 'customer_lifetime_value'
continuous_features = ['income', 'monthly_premium_auto', 'months_since_last_claim', 'months_since_policy_inception',
                        'number_of_open_complaints', 'number_of_policies', 'total_claim_amount']
categorical_features = ['location_state', 'employmentstatus', 'policy_type', 'policy', 'vehicle_class','vehicle_size']

feature_columns = list(continuous_features + categorical_features)
metadata_cols = ['gender']
decision_cols = ['high_value']

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(dataset_id),
    target=target, 
    features=feature_columns,
    metadata_cols=metadata_cols,
    decision_cols=decision_cols,
    display_name='Gradient Boosting Regressor',
    description='this is a GradientBoostingRegressor model from the tutorial',
)

model_info

## Train model

Build and train your model. For this model, we will be creating a Pipeline that will transform the data passed in, and then run that data through a gradient boosting regressor.

In [None]:
# references
# https://scikit-learn.org/stable/modules/preprocessing.html
# https://songxia-sophia.medium.com/two-machine-learning-algorithms-to-predict-xgboost-neural-network-with-entity-embedding-caac68717dea
# https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
import sklearn.pipeline
from sklearn.compose import ColumnTransformer

target = 'customer_lifetime_value'
continuous_features = ['income', 'monthly_premium_auto', 'months_since_last_claim', 'months_since_policy_inception',
                        'number_of_open_complaints', 'number_of_policies', 'total_claim_amount']
categorical_features = ['location_state', 'employmentstatus', 'policy_type', 'policy', 'vehicle_class','vehicle_size']

category_transformer = sklearn.pipeline.Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[('cat', category_transformer, categorical_features),
                                               ('cont', StandardScaler(), continuous_features)])

model = GradientBoostingRegressor(learning_rate=0.1,
                                  n_estimators=100,
                                  max_depth=7)
model_pipeline = sklearn.pipeline.Pipeline(steps=[('preprocessor', preprocessor),
                                                  ('model', model)])

model_pipeline.fit(df_train.loc[:, df_train.columns !=  target], df_train[target])

# Step Five: Create Model Directory and Uploading Model

Before uploading our model, we need to save the model and any pre-processing step you had on the input features (for example Categorical encoder, Tokenization, ...).  
We currently support the following stored model formats:
- For sklearn API based models, pickled models, or any storage format that you can load in the package.py (details below).
- For TF, we support TF Saved Model and Keras .h5   

Note:
- Keras models have to have their input tensor differentiable if Integrated Gradients support is desired
- We also need to save the data preprocessing pipeline code, if any. This will be accessed in the package.py

In total, we will have a `model.yaml`, a `model.pkl`, and a `package.py` file within our model directory.

In [None]:
# Create model directory
import pathlib
import shutil

model_id = 'gradient_boosting_regressor'
model_dir = pathlib.Path(model_id)
shutil.rmtree(model_dir, ignore_errors=True)
model_dir.mkdir()

## Save model and schema (`model.pkl` and `model.yaml`)

In [None]:
# Write model schema file to model directory
import yaml
import pickle

# Saving model
with open(model_dir / 'model.pkl', 'wb') as pkl_file:
    pickle.dump(model_pipeline, pkl_file)

# Saving schema
with open(model_dir / 'model.yaml', 'w') as yaml_file:
    yaml.dump({'model': model_info.to_dict()}, yaml_file)

## Create `package.py`

`package.py` is a Python module that

- Facilitates model loading
- Implements interfaces necessary for the Fiddler platform to interact with models.

This provides the flexibility to enable Fiddler to support a wide variety of complex models. More information can be found [here](https://docs.fiddler.ai/api-reference/package-py/)

In [None]:
%%writefile gradient_boosting_regressor/package.py

import pandas as pd
from pathlib import Path
import pickle as pkl
import os

PACKAGE_PATH = Path(__file__).parent
TARGET = 'customer_lifetime_value'
PREDICTION = 'predicted_customer_lifetime_value'

class GBRegressor:
    """A Gradient Boosting Regressor predictor for auto_insurance data.
       This loads the predictor once and runs for each call to predict.
    """

    def __init__(self, model_path, output_column=None):
        """
        :param model_path: The directory where the model is saved.
        :param output_column: list of column name(s) for the output.
        """
        self.model_path = model_path
        self.output_column = output_column

        file_path = os.path.join(self.model_path, 'model.pkl')
        with open(file_path, 'rb') as file:
            self.model = pkl.load(file)

    def predict(self, input_df):
        return pd.DataFrame(
            self.model.predict(input_df.loc[:, input_df.columns != TARGET]), 
            columns=self.output_column
        )

def get_model():
    return GBRegressor(model_path=PACKAGE_PATH, output_column=[PREDICTION])

## Validate Model Package
 
This step finds issues with the `package.py` composed above to enable easy debugging.

In [None]:
from fiddler import PackageValidator

validator = PackageValidator(model_info, df_schema, model_dir)
passed, errors = validator.run_chain()

## Upload Model

Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the [upload_model_package](https://docs.fiddler.ai/api-reference/python-package/#upload-model-package) to upload this entire directory in one shot. We need the following for uploading a model:
- The `path` to the directory
- The `project_id` to which the model belongs
- The `model_id`, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID
- The `dataset` which the model is linked to (optional)  

In [None]:
# Let's first delete the model if it already exists in the project
if model_id in client.list_models(project_id):
    client.delete_model(project_id, model_id)
    print('Model deleted')
    
client.upload_model_package(artifact_path=model_dir, project_id=project_id, model_id=model_id)
f"Project '{project_id} now contains model '{model_id}'"

Let's trigger the model to run its predictions by calling the following function:

In [None]:
client.trigger_model_predictions(project_id, model_id, dataset_id)

## Run model

Now, let's test out our model by interfacing with the client and calling [run model](https://docs.fiddler.ai/api-reference/python-package/#run-model).

In [None]:
y_test_true = df_test['customer_lifetime_value']
y_test_pred = client.run_model(project_id, model_id, df_test, log_events=True)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

test_mse = mean_squared_error(y_true, y_pred)
test_r2 = r2_score(y_true, y_pred)

print(f'MSE: {test_mse}\nR2: {test_r2}')

# Step Six: Simulate Monitoring Traffic

In this step, we will be simulating traffic to send for our model monitoring by using [publish_event](https://docs.fiddler.ai/api-reference/python-package/#publish-event). This will be the equivalent of running our model separately on some data, and either sending to Fiddler then, or saving this information to a log and sending at a later point.

For this demonstration, we will be going with a log-related approach. This log contains rows corresponding to:

- inputs 
- predictions
- labels (targets)
- decisions

We can find the fields that will be utilized by consulting our `ModelInfo` object:

```
ModelInfo:
      display_name: Gradient Boosting Regressor \
      description: this is a GradientBoostingRegressor model from the tutorial
      input_type: ModelInputType.TABULAR
      model_task: ModelTask.REGRESSION
-->   inputs:
                                   column     dtype count(possible_values)  \
        0                  location_state  CATEGORY                      5   
        1                employmentstatus  CATEGORY                      5   
        2                          income   INTEGER                          
        3            monthly_premium_auto   INTEGER                          
        4         months_since_last_claim   INTEGER                          
        5   months_since_policy_inception   INTEGER                          
        6       number_of_open_complaints   INTEGER                          
        7              number_of_policies   INTEGER                          
        8                     policy_type  CATEGORY                      3   
        9                          policy  CATEGORY                      9   
        10             total_claim_amount     FLOAT                          
        11                  vehicle_class  CATEGORY                      6   
        12                   vehicle_size  CATEGORY                      3                    
-->   outputs
                                      column  dtype count(possible_values)  \
        0  predicted_customer_lifetime_value  FLOAT                          

          is_nullable value_range  
        0       False       * - *  
      metadata:
           column     dtype  count(possible_values) is_nullable value_range
        0  gender  CATEGORY                       2       False            
-->   decisions:
               column     dtype  count(possible_values) is_nullable value_range
        0  high_value  CATEGORY                       2       False            
-->   targets: [Column(name="customer_lifetime_value", data_type=DataType.FLOAT, possible_values=None, 
                is_nullable=False, value_range_min=1898.007675, value_range_max=83325.38119)]
                  misc:{}
    
```



In [None]:
event_log = pd.read_csv('/app/fiddler_samples/samples/datasets/auto_insurance/event_log.csv')
event_log.head()

In this step, we will be simulating traffic to send for our model monitoring by using 
[publish_event](https://docs.fiddler.ai/api-reference/python-package/#publish-event). 
This will be the equivalent of running our model separately on data, and either 
sending to Fiddler then, or saving this information to a log and sending at a later point.

For this demonstration, we will be going with a log-related approach. 
This log contains rows that have inputs and predictions. 
To most accurately simulate this as a time-series event, we will generate a timestamp and send an event every 5 minutes. Real data will ideally have a timestamp related to when the event took place; otherwise, the current 
time will be used.

We can send the inputs, outputs, targets as well as decisions variables.

**Note**: The timestamp must be in UTC milliseconds. See 
[here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details

In [None]:
import datetime
import time
from IPython.display import clear_output

NUM_EVENTS_TO_SEND = 500

FIVE_MINUTES_MS = 300000
FIFTEEN_MINUTES_MS = FIVE_MINUTES_MS * 3
ONE_DAY_MS = 8.64e+7
start_date = round(time.time() * 1000) - (ONE_DAY_MS * 8)
print(datetime.datetime.fromtimestamp(start_date/1000.0))

In [None]:
# Convert this dataframe into a list of dictionary events, where each event is its own dictionary
event_list_dict = event_log.sample(n=NUM_EVENTS_TO_SEND, random_state=42).to_dict(orient='records') 

for ind, event_dict in enumerate(event_list_dict):
    event_time = start_date + ind * FIVE_MINUTES_MS
    result = client.publish_event(project_id,
                                  model_id,
                                  event_dict,
                                  event_time_stamp=event_time,
                                  event_id=str(ind + 100),
                                  update_event=False)
    
    readable_timestamp = datetime.datetime.fromtimestamp(event_time/1000.0)
    clear_output(wait = True)
    
    print(f'Sending {ind+1} / {NUM_EVENTS_TO_SEND} \n{readable_timestamp} UTC: \n{event_dict}')
    time.sleep(0.01)

**Note**: In the case that labels are ingested in a future point, an event can be updated by calling:

- `res = fiddler_api.publish_event(project_id, model_id, event, event_id=customer_id, update_event=True, event_time_stamp=row['__occurred_at'])`

By setting the `update_event` flag to be true, the event identifed by `event_id` will be updated with whatever additional information you pass in through `event`, including a target label. See [here](https://docs.fiddler.ai/api-reference/python-package/#publish-event) for more details.]

## Seeing Monitoring Traffic
We can now consult our Fiddler instance to visualize our monitoring results. We can see our newly created project within the Projects Overview section:

<img src="images/qs_projects.png" width=1000 height=1000 />

Within our project, we can click `gradient_boosting_regressor` to see our model we created. From there, we can see the traffic that reflects the events we sent by going to the Monitor Section at the top:

<img src="images/qs_monitoring.png" width=1000 height=1000 />

For a walkthrough to learn more about navigating the product, please consult our [Product Tour](https://docs.fiddler.ai/product-tour/)