# Fiddler Ranking Model Quick Start Guide

Fiddler offer the ability for your teams to observe you ranking models to understand thier performance and catch issues like data drift before they affect your applications.

## Supporting Ranking models


### Monitoring

#### Data Drift
We calcualte drift in your model's production events by comaprning it agaisnt the baseline datasets you provide. We offer the following data drift metrics in our platform:

- **JSD** - Jensen-Shanon Divercgence 
- **PSI** - Population Stability Index
 
You can read more about this on our [Data Drift] [https://docs.fiddler.ai/docs/data-drift-platform] docs. 

#### Performance

Fiddler provides 2 ranking specific performance metrics:
- **MAP** (Mean Average Precision) at k: available only for binary relevance ranking models.
- **NDCG** (Normalized Discounted Cumulative Gain) at k: available for both binary and graded relevance ranking models.  

#### Data Integrity 
Finally you can track Data Integrity and Service Metrics like:
- **Missing value violations** — The percentage of missing value violations over all features for a given period of time.
- **Type violations** — The percentage of data type mismatch violations over all features for a given period of time.
- **Range violations** — The percentage of range mismatch violations over all features for a given period of time.



### Explainability

#### Global Explanations 
You can get global or model level impact and importance for each feature that your model uses. This helps you undersatnd what the model focuses on for making the predictions/rankings.

#### Point Explanations
The SHAP algorithms (Fiddler-SHAP and traditional Kernel SHAP), have been modified to get explanation with respect to the rest of the query result. For example, query ID 'xyz' has 150 results and we want to understand why a particular items has been ranked 3rd. SHAP algorithms will be run with the background dataset formed by the 150 results of the query ID 'xyz'. 

#### Dependence plots
Dependence plots (ICE plots and PDP plots) can both be generated from the Fiddler platfrom to understand the model's inner workings. 

--------

# Quickstart: Expedia Search Ranking
The following dataset is coming from Expedia. It includes shopping and purchase data as well as information on price competitiveness. The data are organized around a set of “search result impressions”, or the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the data contain impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel. From: https://www.kaggle.com/c/expedia-personalized-sort/overview

# 0. Imports

In [None]:
import pandas as pd
import lightgbm as lgb
import numpy as np
import time as time
import datetime

# 1. Connect to Fiddler and Create a Project
First we install and import the Fiddler Python client.

In [None]:
!pip install -q fiddler-client
import fiddler as fdl
print(f"Running client version {fdl.__version__}")

Before you can add information about your model with Fiddler, you'll need to connect using our API client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

The latter two of these can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [None]:
URL = '' # Make sure to include the full URL (including https://). For example, https://abc.xyz.ai
ORG_ID = ''
AUTH_TOKEN = ''

Next we run the following code block to connect to the Fiddler API.

In [None]:
client = fdl.FiddlerApi(url=URL, org_id=ORG_ID, auth_token=AUTH_TOKEN)

Once you connect, you can create a new project by specifying a unique project ID in the client's `create_project` function.

In [None]:
PROJECT_ID = 'search_ranking'

if not PROJECT_ID in client.list_projects():
    print(f'Creating project: {PROJECT_ID}')
    client.create_project(PROJECT_ID)
else:
    print(f'Project: {PROJECT_ID} already exists')

# 2. Upload the Baseline Dataset

Now we retrieve the Expedia Dataset as a baseline for this model.

In [None]:
df = pd.read_csv("https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/expedia_baseline_data.csv")
df.head()

Fiddler uses this baseline dataset to keep track of important information about your data.
  
This includes **data types**, **data ranges**, and **unique values** for categorical variables.

---

You can construct a `DatasetInfo` object to be used as **a schema for keeping track of this information** by running the following code block.

In [None]:
dataset_info = fdl.DatasetInfo.from_dataframe(df=df, max_inferred_cardinality=100)
dataset_info

Then use the client's [upload_dataset](https://docs.fiddler.ai/reference/clientupload_dataset) function to send this information to Fiddler!
  
*Just include:*
1. A unique dataset ID
2. The baseline dataset as a pandas DataFrame
3. The [DatasetInfo](https://docs.fiddler.ai/reference/fdldatasetinfo) object you just created

In [None]:
DATASET_ID = 'expedia_data'
client.upload_dataset(project_id=PROJECT_ID,
                      dataset={'baseline': df},
                      dataset_id=DATASET_ID,
                      info=dataset_info)

# 3. Creating the Model

To explain a model's inner workigs we need to upload the model artifacts. Let's train this ranking model with the sample data from expedia that we just downloaded. 
The following model is trained with **lightgbm 2.3.0**

### 3.a Data Prepration 

In [None]:
# Creating training and validation splits: 90/10 split
cutoff_id = df["srch_id"].quantile(0.94) 

X_train = df.loc[df.srch_id < cutoff_id].drop(["click_bool", 'score'], axis=1)
X_eval = df.loc[df.srch_id >= cutoff_id].drop(["click_bool", 'score'], axis=1)
y_train = df.loc[df.srch_id < cutoff_id]["click_bool"]
y_eval = df.loc[df.srch_id >= cutoff_id]["click_bool"]

### 3.b Training 

In [None]:
gbm = lgb.LGBMRanker()
groups = np.unique(X_train.srch_id, return_counts=True)
groups_number = list(groups[1])
gbm.fit(X_train, y_train, group=groups_number)
gbm.predict(X_eval)

### 3.c Saving the Model

We need to create a new folder and add the pieces needed for Fiddler to use/run your model. This folder will have:

- your model file saved into the format of your choice (json, pickle, h5, ..)
- a wrapper: package.py (created in the next step)

In [None]:
import pickle
import pathlib
import os

os.mkdir("model")

model_dir = pathlib.Path('model')

In [None]:
# save model
with open(model_dir / 'model.pkl', 'wb') as infile:
    pickle.dump(gbm, infile)

# 4. Share Model Metadata and Upload the Model

Now let's add this model we just created to Fiddler.

### 4.a Adding model to Fiddler
To add a Ranking model you must specify the ModelTask as `RANKING` in the model info object.  

Additionally, you must provide the `group_by` argument that corresponds to the query search id. This `group_by` column should be present either in:
- `features` : if it is used to build and run the model
- `metadata_cols` : if not used by the model 

Optionally, you can give a `ranking_top_k` number (default is 50). This will be the number of results within each query to take into account while computing the performance metrics in monitoring.  

Unless the prediction column was part of your baseline dataset, you must provide the minimum and maximum values predictions can take in a dictionary format (see below).  

If your target is categorical (string), you need to provide the `categorical_target_class_details` argument. If your target is numerical and you don't specify this argument, Fiddler will infer it.   

This will be the list of possible values for the target **ordered**. The first element should be the least relevant target level, the last element should be the most relevant target level.

In [None]:
target = 'click_bool'
features = list(df.drop(columns=['click_bool', 'score']).columns)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id=PROJECT_ID, dataset_id=DATASET_ID),
    target=target,
    features=features,
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.RANKING,
    outputs={'score':[-5.0, 3.0]},
    group_by='srch_id',
    ranking_top_k=20,
    categorical_target_class_details=[0, 1]
)

# inspect model info and modify as needed
model_info

In [None]:
MODEL_ID = 'expedia_model'

if not MODEL_ID in client.list_models(project_id=PROJECT_ID):
    client.add_model(
        project_id=PROJECT_ID,
        dataset_id=DATASET_ID,
        model_id=MODEL_ID,
        model_info=model_info
    )
else:
    print(f'Model: {MODEL_ID} already exists in Project: {PROJECT_ID}. Please use a different name.')

### 4.b Create a Model Wrapper Script

Package.py is the interface between Fiddler’s backend and your model. This code helps Fiddler to understand the model, its inputs and outputs.

You need to implement three parts:
- init: Load the model, and any associated files such as feature transformers.
- transform: If you use some pre-processing steps not part of the model file, transform the data into a format that the model recognizes.
- predict: Make predictions using the model.

In [None]:
%%writefile model/package.py

import pickle
from pathlib import Path
import pandas as pd

PACKAGE_PATH = Path(__file__).parent

class ModelPackage:

    def __init__(self):
        """
         Load the model file and any pre-processing files if needed.
        """
        self.output_columns = ['score']
        
        with open(PACKAGE_PATH / 'model.pkl', 'rb') as infile:
            self.model = pickle.load(infile)
    
    def transform(self, input_df):
        """
        Accepts a pandas DataFrame object containing rows of raw feature vectors. 
        Use pre-processing file to transform the data if needed. 
        In this example we don't need to transform the data.
        Outputs a pandas DataFrame object containing transformed data.
        """
        return input_df
    
    def predict(self, input_df):
        """
        Accepts a pandas DataFrame object containing rows of raw feature vectors. 
        Outputs a pandas DataFrame object containing the model predictions whose column labels 
        must match the output column names in model info.
        """
        transformed_df = self.transform(input_df)
        pred = self.model.predict(transformed_df)
        return pd.DataFrame(pred, columns=self.output_columns)
    
def get_model():
    return ModelPackage()

### 4.c Upload the model files to Fiddler


Now you can upload the model artifact files using `add_model_artifact`. 
   - The `model_dir` is the path for the folder containing the model file(s) and the `package.py` from ther last step.

In [None]:
#Uploading Model files
client.add_model_artifact(model_dir=model_dir, project_id=PROJECT_ID, model_id=model_id)

# 5. Send Traffic For Monitoring

### 5.a Gather and prepare Production Events
This is the production log file we are going to upload in Fiddler.

In [None]:
df_logs = pd.read_csv('https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/expedia_logs.csv')
df_logs.tail()

In [None]:
df_logs['event_id'] = df_logs['event_id'].apply(str)
#timeshift to move the data to last 29 days
df_logs['time_epoch'] = df_logs['time_epoch'] + (float(time.time()) - df_logs['time_epoch'].max())

For ranking, we need to ingest all events from a given Query ID together. To do that, we need to transform the data to a grouped format.  
You can use the `convert_flat_csv_data_to_grouped` utility function to do the transformation.


In [None]:
df_logs_grouped = fdl.utils.pandas.convert_flat_csv_data_to_grouped(input_data=df_logs, group_by_col='srch_id')

In [None]:
df_logs_grouped.head(2)

### 5.b Publish events

In [None]:
client.publish_events_batch(project_id=PROJECT_ID,
                            model_id=model_id,
                            batch_source=df_logs_grouped,
                            id_field='event_id',
                            timestamp_field='time_epoch')

# 7. Get insights


**You're all done!**
  
You can now head to Fiddler URL and start getting enhanced observability into your model's performance. Run the following code block to get your URL:



In [None]:
print('/'.join([URL, 'projects', PROJECT_ID, 'models', MODEL_ID, 'monitor']))

*Please allow 3-5 minutes for monitoring data to populate the charts.*

--------
**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

Join our [community Slack](http://fiddler-community.slack.com/) to ask any questions!

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.