# Fiddler Ranking Model Quick Start Guide

With Fiddler, you can start monitoring and/or Explain your Ranking models.

### Monitoring
With Fiddler, you can monitor your ranking model.   
The features/prediction drift implementation is unchanged compared to other model types. We offer JSD and PSI as Drift metrics, and you can choose against which baseline drift is computed.  
Fiddler provides 2 ranking specific performance metrics:
- MAP (Mean Average Precision) at k: available only for binary relevance ranking models.
- NDCG (Normalized Discounted Cumulative Gain) at k: available for both binary and graded relevance ranking models.  

Finally you can track Data Integrity and Service Metrics in Monitoring as well.

If you want to use Ranking Monitoring, only `publish_events_batch` api can be used to publish events today. The `publish_event` api doesn't support this new functionality yet.



### Explainability
#### Surrogate model
If you don't want to upload your own ranking model, but would like to check how explainability works, you can ask Fiddler to create a surrogate model from your baseline dataset. This model is used to compute feature impact. We implemented a Ranking surrogate model with lightgbm library. 

#### Upload your own ranking model
Finally, you can bring your own model artifact to get Explainability (point explanation, feature impact and dependence plots).

#### Feature impact / importance
Feature Impact (gives the average increase in predictions when a feature is randomly ablated) is supported for ranking models. However, feature importance (gives the average increase in loss when a feature is randomly ablated) hasn't been implemented yet.  

#### Point explanation
The shap algorithms (Fiddler-SHAP and traditional Kernel SHAP), have been modified to get explanation with respect to the rest of the query result. For example, query ID 'xyz' has 150 results and we want to understand why a particular items has been ranked 3rd. SHAP algorithms will be run with the background dataset formed by the 150 results of the query ID 'xyz'. 

#### Dependence plots
Dependence plots (ICE plots and PDP plots) are both supported for ranking models.


### Analytics
- The evaluate tab hasn't been implemented for Ranking yet. 
- The rest of the functions are available


### Fairness
Fairness hasn't been implemented for Ranking yet. 



# Example: Expedia search ranking
The following dataset is coming from Expedia. It includes shopping and purchase data as well as information on price competitiveness. The data are organized around a set of “search result impressions”, or the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the data contain impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel. From: https://www.kaggle.com/c/expedia-personalized-sort/overview

In [13]:
import pandas as pd
import lightgbm as lgb
import numpy as np

# 1. Connect to Fiddler and Create a Project
First we install and import the Fiddler Python client.

In [14]:
!pip install -q fiddler-client
import fiddler as fdl
print(f"Running client version {fdl.__version__}")

Running client version 1.7.3


Before you can add information about your model with Fiddler, you'll need to connect using our API client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

The latter two of these can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [16]:
URL = 'https://mainbuild.dev.fiddler.ai' # Make sure to include the full URL (including https://). For example, https://abc.xyz.ai
ORG_ID = 'mainbuild'
AUTH_TOKEN = 'xTMiuhAMiBR__WCQ5zgpDQaBZA2p4Q2fh_hPnJGKPW8'

Next we run the following code block to connect to the Fiddler API.

In [17]:
client = fdl.FiddlerApi(url=URL, org_id=ORG_ID, auth_token=AUTH_TOKEN)

Once you connect, you can create a new project by specifying a unique project ID in the client's `create_project` function.

In [20]:
PROJECT_ID = 'search_ranking1'

if not PROJECT_ID in client.list_projects():
    print(f'Creating project: {PROJECT_ID}')
    client.create_project(PROJECT_ID)
else:
    print(f'Project: {PROJECT_ID} already exists')

Creating project: search_ranking1


# 2. Upload the Baseline Dataset

Now we retrieve the Expedia Dataset as a baseline for this model.

In [7]:
df = pd.read_csv("https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/expedia_baseline_data.csv")
df.head()

Unnamed: 0,srch_id,site_id,visitor_location_country_id,visitor_hist_starrating,visitor_hist_adr_usd,prop_country_id,prop_id,prop_starrating,prop_review_score,prop_brand_bool,...,comp8_rate_percent_diff,click_bool,weekday,week_of_year,hour_time,minute_time,time_epoch,early_night,nans_count,score
0,1,12,187,,,219,893,3,3.5,1,...,,0,3,14,8,32,1365064000.0,False,20,-0.762692
1,1,12,187,,,219,10404,4,4.0,1,...,,0,3,14,8,32,1365064000.0,False,22,-1.412878
2,1,12,187,,,219,21315,3,4.5,1,...,,0,3,14,8,32,1365064000.0,False,20,-1.344691
3,1,12,187,,,219,27348,2,4.0,1,...,5.0,0,3,14,8,32,1365064000.0,False,17,-2.391245
4,1,12,187,,,219,29604,4,3.5,1,...,,0,3,14,8,32,1365064000.0,False,20,-0.723813


Fiddler uses this baseline dataset to keep track of important information about your data.
  
This includes **data types**, **data ranges**, and **unique values** for categorical variables.

---

You can construct a `DatasetInfo` object to be used as **a schema for keeping track of this information** by running the following code block.

In [21]:
dataset_info = fdl.DatasetInfo.from_dataframe(df=df, max_inferred_cardinality=100)
dataset_info

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,srch_id,INTEGER,,False,"1 - 1,670"
1,site_id,INTEGER,,False,1 - 34
2,visitor_location_country_id,INTEGER,,False,2 - 224
3,visitor_hist_starrating,FLOAT,,True,2.3 - 5.0
4,visitor_hist_adr_usd,FLOAT,,True,51.77 - 518.3
5,prop_country_id,INTEGER,,False,4 - 230
6,prop_id,INTEGER,,False,"1 - 140,800"
7,prop_starrating,INTEGER,,False,0 - 5
8,prop_review_score,FLOAT,,True,0.0 - 5.0
9,prop_brand_bool,INTEGER,,False,0 - 1


Then use the client's [upload_dataset](https://docs.fiddler.ai/reference/clientupload_dataset) function to send this information to Fiddler!
  
*Just include:*
1. A unique dataset ID
2. The baseline dataset as a pandas DataFrame
3. The [DatasetInfo](https://docs.fiddler.ai/reference/fdldatasetinfo) object you just created

In [23]:
DATASET_ID = 'expedia_data'
client.upload_dataset(project_id=PROJECT_ID,
                      dataset={'baseline': df},
                      dataset_id=DATASET_ID,
                      info=dataset_info)

{'uuid': '264b04cb-aabd-47bf-95cd-a7ff7eb04ada',
 'name': 'Ingestion dataset Upload',
 'info': {'project_name': 'search_ranking1',
  'resource_name': 'expedia_data',
  'resource_type': 'DATASET'},
 'status': 'SUCCESS',
 'progress': 100.0,
 'error_message': None,
 'error_reason': None}

# 3. Creating the Model

To explain a model's inner workigs we need to upload the model artifacts. Let's train this ranking model with the sample data from expedia that we just downloaded. 
The following model is trained with **lightgbm 2.3.0**

### 3.a Data Prepration 

In [24]:
# Creating training and validation splits: 90/10 split
cutoff_id = df["srch_id"].quantile(0.94) 

X_train = df.loc[df.srch_id < cutoff_id].drop(["click_bool", 'score'], axis=1)
X_eval = df.loc[df.srch_id >= cutoff_id].drop(["click_bool", 'score'], axis=1)
y_train = df.loc[df.srch_id < cutoff_id]["click_bool"]
y_eval = df.loc[df.srch_id >= cutoff_id]["click_bool"]

### 3.b Training 

In [29]:
gbm = lgb.LGBMRanker()
groups = np.unique(X_train.srch_id, return_counts=True)
groups_number = list(groups[1])
gbm.fit(X_train, y_train, group=groups_number)
gbm.predict(X_eval)

array([-0.54926111, -2.13484523, -2.77795386, ..., -1.46592082,
       -0.20929101, -1.97752888])

### 3.c Saving the Model

We need to create a new folder and add the pieces needed for Fiddler to use/run your model. This folder will have:

- your model file saved into the format of your choice (json, pickle, h5, ..)
- a wrapper: package.py (created in the next step)

In [39]:
import pickle
import pathlib
import os

os.mkdir("model")

model_dir = pathlib.Path('model')

In [40]:
# save model
with open(model_dir / 'model.pkl', 'wb') as infile:
    pickle.dump(gbm, infile)

# 4. Share Model Metadata and Upload the Model

Now let's add this model we just created to Fiddler.

### 4.a Adding model to Fiddler
To add a Ranking model you must specify the ModelTask as `RANKING` in the model info object.  

Additionally, you must provide the `group_by` argument that corresponds to the query search id. This `group_by` column should be present either in:
- `features` : if it is used to build and run the model
- `metadata_cols` : if not used by the model 

Optionally, you can give a `ranking_top_k` number (default is 50). This will be the number of results within each query to take into account while computing the performance metrics in monitoring.  

Unless the prediction column was part of your baseline dataset, you must provide the minimum and maximum values predictions can take in a dictionary format (see below).  

If your target is categorical (string), you need to provide the `categorical_target_class_details` argument. If your target is numerical and you don't specify this argument, Fiddler will infer it.   

This will be the list of possible values for the target **ordered**. The first element should be the least relevant target level, the last element should be the most relevant target level.

In [41]:
target = 'click_bool'
features = list(df.drop(columns=['click_bool', 'score']).columns)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id=PROJECT_ID, dataset_id=DATASET_ID),
    target=target,
    features=features,
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.RANKING,
    outputs={'score':[-5.0, 3.0]},
    group_by='srch_id',
    ranking_top_k=20,
    categorical_target_class_details=[0, 1]
)

# inspect model info and modify as needed
model_info

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,click_bool,INTEGER,,False,0 - 1

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,srch_id,INTEGER,,False,"1 - 1,670"
1,site_id,INTEGER,,False,1 - 34
2,visitor_location_country_id,INTEGER,,False,2 - 224
3,visitor_hist_starrating,FLOAT,,True,2 - 5
4,visitor_hist_adr_usd,FLOAT,,True,51 - 518
5,prop_country_id,INTEGER,,False,4 - 230
6,prop_id,INTEGER,,False,"1 - 140,800"
7,prop_starrating,INTEGER,,False,0 - 5
8,prop_review_score,FLOAT,,True,0 - 5
9,prop_brand_bool,INTEGER,,False,0 - 1

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,score,FLOAT,,False,-5.0 - 3.0


In [43]:
MODEL_ID = 'expedia_model'

if not MODEL_ID in client.list_models(project_id=PROJECT_ID):
    client.add_model(
        project_id=PROJECT_ID,
        dataset_id=DATASET_ID,
        model_id=MODEL_ID,
        model_info=model_info
    )
else:
    print(f'Model: {MODEL_ID} already exists in Project: {PROJECT_ID}. Please use a different name.')

Model: expedia_model already exists in Project: search_ranking1. Please use a different name.


### 4.b Create a Model Wrapper Script

Package.py is the interface between Fiddler’s backend and your model. This code helps Fiddler to understand the model, its inputs and outputs.

You need to implement three parts:
- init: Load the model, and any associated files such as feature transformers.
- transform: If you use some pre-processing steps not part of the model file, transform the data into a format that the model recognizes.
- predict: Make predictions using the model.

In [46]:
%%writefile model/package.py

import pickle
from pathlib import Path
import pandas as pd

PACKAGE_PATH = Path(__file__).parent

class ModelPackage:

    def __init__(self):
        """
         Load the model file and any pre-processing files if needed.
        """
        self.output_columns = ['score']
        
        with open(PACKAGE_PATH / 'model.pkl', 'rb') as infile:
            self.model = pickle.load(infile)
    
    def transform(self, input_df):
        """
        Accepts a pandas DataFrame object containing rows of raw feature vectors. 
        Use pre-processing file to transform the data if needed. 
        In this example we don't need to transform the data.
        Outputs a pandas DataFrame object containing transformed data.
        """
        return input_df
    
    def predict(self, input_df):
        """
        Accepts a pandas DataFrame object containing rows of raw feature vectors. 
        Outputs a pandas DataFrame object containing the model predictions whose column labels 
        must match the output column names in model info.
        """
        transformed_df = self.transform(input_df)
        pred = self.model.predict(transformed_df)
        return pd.DataFrame(pred, columns=self.output_columns)
    
def get_model():
    return ModelPackage()

Overwriting model/package.py


### 4.c Upload to Fiddler

1. dd the model schema using `add_model`. 
2. Now you can upload the model artifact files using `add_model_artifact`. 
    - The `model_dir` is the path for the folder containing the model file(s) and the `package.py` from ther last step.

In [48]:
client.delete_model(project_id=PROJECT_ID, model_id=model_id)
#Sharing Model Metadata
client.add_model(project_id=PROJECT_ID, model_id=model_id, dataset_id=DATASET_ID, model_info=model_info)
#Uploading Model files
client.add_model_artifact(model_dir=model_dir, project_id=PROJECT_ID, model_id=model_id)

# 5. Send Traffic For Monitoring

### 5.a Preare Production Events
This is the production log file we are going to upload in Fiddler.

In [10]:
df_logs = pd.read_csv('https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/expedia_logs.csv')
df_logs.head()

Unnamed: 0,srch_id,date_time,site_id,visitor_location_country_id,visitor_hist_starrating,visitor_hist_adr_usd,prop_country_id,prop_id,prop_starrating,prop_review_score,...,click_bool,weekday,week_of_year,hour_time,minute_time,time_epoch,early_night,nans_count,score,event_id
0,1672,2021-08-17 14:08:14,14,100,,,100,3674,2,4.5,...,0,3,25,14,8,1371737000.0,False,27,-0.303444,0
1,1672,2021-08-17 14:08:14,14,100,,,100,21062,4,4.5,...,0,3,25,14,8,1371737000.0,False,27,0.12773,1
2,1672,2021-08-17 14:08:14,14,100,,,100,29006,3,4.0,...,0,3,25,14,8,1371737000.0,False,27,-2.540481,2
3,1672,2021-08-17 14:08:14,14,100,,,100,42013,3,4.0,...,0,3,25,14,8,1371737000.0,False,27,-0.210803,3
4,1672,2021-08-17 14:08:14,14,100,,,100,43987,4,4.5,...,0,3,25,14,8,1371737000.0,False,27,-1.249932,4


In [67]:
df_logs['event_id'] = df_logs['event_id'].apply(str)

For ranking, we need to ingest all events from a given Query ID together. To do that, we need to transform the data to a grouped format.  
You can use the `convert_flat_csv_data_to_grouped` utility function to do the transformation.


In [68]:
df_logs_grouped = fdl.utils.pandas.convert_flat_csv_data_to_grouped(input_data=df_logs, group_by_col='srch_id')

### 5.b Publish events

In [70]:
client.publish_events_batch(project_id=PROJECT_ID,
                            model_id=model_id,
                            batch_source=df_logs_grouped,
                            id_field='event_id',
                            timestamp_field='date_time')

{'status': 202,
 'job_uuid': 'd2ef98ae-40ca-4f12-9b81-24a136e821a1',
 'files': ['tmp7rvyqwah.csv'],
 'message': 'Successfully received the event data. Please allow time for the event ingestion to complete in the Fiddler platform.'}

# 7. Get insights


**You're all done!**
  
You can now head to Fiddler URL and start getting enhanced observability into your model's performance. Run the following code block to get your URL:



In [66]:
print('/'.join([URL, 'projects', PROJECT_ID, 'models', MODEL_ID, 'monitor']))

https://mainbuild.dev.fiddler.ai/projects/search_ranking1/models/expedia_model/monitor


--------
**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

Join our [community Slack](http://fiddler-community.slack.com/) to ask any questions!

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.