# Monitoring a Multiclass Classifier Model with Text Inputs
Unstructured data such as text are usually represented as high-dimensional vectors when processed by ML models. In this example notebook we present how [Fiddler Vector Monitoring](https://www.fiddler.ai/blog/monitoring-natural-language-processing-and-computer-vision-models-part-1) can be used to monitor NLP models using a text classification use case.

Following the steps in this notebook you can see how to onboard models that deal with unstructured text inputs. In this example, we use the 20Newsgroups dataset and train a multi-class classifier that is applied to vector embeddings of text documents.

We monitor this model at production time and assess the performance of Fiddler's vector monitoring by manufacturing synthetic drift via sampling from specific text categories at different deployment time intervals.

---

Now we perform the following steps to demonstrate how Fiddler NLP monitoring works:

1. Connect to Fiddler
2. Load a Data Sample
3. Define the Model Specifications
4. Create a Model
5. Publish a Static Baseline (Optional)
6. Publish Production Events

Get insights!

## 0. Imports

In [1]:

%pip install -q fiddler-client

import time as time

import numpy as np
import pandas as pd
import fiddler as fdl

print(f"Running Fiddler Python client version {fdl.__version__}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Running Fiddler Python client version 3.4.0


## 1. Connect to Fiddler

Before you can add information about your model with Fiddler, you'll need to connect using the Fiddler Python client.


---


**We need a couple pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your authorization token

Your authorization token can be found by navigating to the **Credentials** tab on the **Settings** page of your Fiddler environment.

In [2]:
URL = ''  # Make sure to include the full URL (including https:// e.g. 'https://your_company_name.fiddler.ai').
TOKEN = ''

Constants for this example notebook, change as needed to create your own versions

In [3]:
PROJECT_NAME = 'quickstart_examples'
MODEL_NAME = 'nlp_newsgroups_multiclass'

STATIC_BASELINE_NAME = 'baseline_dataset'

# Sample data hosted on GitHub
PATH_TO_SAMPLE_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/data/v3/nlp_text_multiclassifier_data_sample.csv'
PATH_TO_EVENTS_CSV = 'https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/v3/nlp_text_multiclassifier_production_data.csv'

Now just run the following to connect to your Fiddler environment.

In [4]:
fdl.init(url=URL, token=TOKEN)

#### 1.a Create New or Load Existing Project

Once you connect, you can create a new project by specifying a unique project name in the fld.Project constructor and calling the `create()` method. If the project already exists, it will load it for use.

In [5]:
try:
    # Create project
    project = fdl.Project(name=PROJECT_NAME).create()
    print(f'New project created with id = {project.id} and name = {project.name}')
except fdl.Conflict:
    # Get project by name
    project = fdl.Project.from_name(name=PROJECT_NAME)
    print(f'Loaded existing project with id = {project.id} and name = {project.name}')

Loaded existing project with id = 70b74177-c712-44b1-b431-2377c1b908ab and name = quickstart_examplesx


## 2. Load a Data Sample

Now we retrieve the 20Newsgroup dataset. This dataset is fetched from the [scikit-learn real-world datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#) and pre-processed using [this notebook](https://colab.research.google.com/github/fiddler-labs/fiddler-examples/blob/main/pre-proccessing/20newsgroups_prep_vectorization.ipynb).  For simplicity sake, we have stored a copy in our GitHub repo.

In [6]:
sample_data_df = pd.read_csv(PATH_TO_SAMPLE_CSV)
sample_data_df

Unnamed: 0,predicted_target,prob_computer,prob_forsale,prob_recreation,prob_religion,prob_science,original_text,original_target,target,n_tokens,string_size,timestamp
0,science,0.049720,0.015294,0.040910,0.025433,0.868644,: Has anyone ever heard of a food product call...,sci.space,science,87,482,1713288933452
1,computer,0.749487,0.045994,0.019984,0.016209,0.168327,Getting an image from a computer monitor to a ...,comp.graphics,computer,108,589,1713289192800
2,forsale,0.068282,0.739602,0.088766,0.032307,0.071043,The following used CD's are for sale. They ar...,misc.forsale,forsale,128,775,1713289452148
3,science,0.060254,0.036090,0.072608,0.028312,0.802736,"Since your MOSFET is a 1972 vintage, it's prob...",sci.electronics,science,117,652,1713289711496
4,computer,0.259612,0.088936,0.198272,0.206862,0.246319,Are 'Moody Monthly' and 'Moody' the same magaz...,soc.religion.christian,religion,26,148,1713289970844
...,...,...,...,...,...,...,...,...,...,...,...,...
2328,science,0.284201,0.027899,0.018825,0.012399,0.656676,Yes. I use 74HC4066 and others commerically f...,sci.electronics,science,100,513,1713892696059
2329,science,0.140781,0.017383,0.120168,0.063662,0.658006,Even if they somehow address this issue it is ...,sci.crypt,science,38,222,1713892955407
2330,computer,0.937898,0.017932,0.009471,0.007481,0.027219,"Hello, I am searching for rendering softw...",comp.graphics,computer,39,215,1713893214755
2331,science,0.153452,0.039111,0.073378,0.030796,0.703263,For those of you interested in the above Proce...,sci.med,science,114,556,1713893474103


## 3. Define the Model Specifications

Next we should tell Fiddler a bit more about the model by creating a `ModelSpec` object that specifies the model's inputs, outputs, targets, and metadata, along with other information such as the enrichments we want performed on our model's data.

### Instruct Fiddler to generate embeddings for our unstructured model input

Fiddler offers a powerful set of enrichment services that we can use to enhance how we monitor our model's performance.  In this example, we instruct Fiddler to generate embeddings for our unstructured text.  These generated embeddings are a numerical vector that represent the content and context of our unstructured input field, _original_text_.  These embeddings then power Fiddler's vector monitoring capability for detecting drift.

Before creating a `ModelSpec` object, we define a custom feature using the [fdl.Enrichment()](https://docs.fiddler.ai/python-client-3-x/api-methods-30#fdl.enrichment-private-preview) API. When creating an enrichment, a name must be assigned to the custom feature using the `name` argument. Each enrichment appears in the monitoring tab in the Fiddler UI with this assigned name. Finally, the default clustering setup can be modified by passing the number of cluster centroids to the `n_clusters` argument.

Here we define an [embedding fdl.Enrichment](https://docs.fiddler.ai/python-client-3-x/api-methods-30#embedding-private-preview) and then use that embedding enrichment to create a [fdl.TextEmbedding](https://docs.fiddler.ai/python-client-3-x/api-methods-30#customfeature) input that can be used to track drift and to be plotted in Fiddler's UMAP visualizations.

In [7]:
fiddler_backend_enrichments = [
    fdl.Enrichment(
        name='Enrichment Text Embedding',
        enrichment='embedding',
        columns=['original_text'],
    ),
    fdl.TextEmbedding(
        name='Original TextEmbedding',
        source_column='original_text',
        column='Enrichment Text Embedding',
        n_clusters=6,
    ),
]

In [8]:
model_spec = fdl.ModelSpec(
    inputs=['original_text'],
    outputs=[col for col in sample_data_df.columns if col.startswith('prob_')],
    targets=['target'],
    metadata=['n_tokens', 'string_size', 'timestamp'],
    custom_features=fiddler_backend_enrichments,
)

If you have columns in your ModelSpec which denote **prediction IDs or timestamps**, then Fiddler can use these to power its analytics accordingly. Here we are just noting the column with the event datetime.

In [9]:
timestamp_column = 'timestamp'

Fiddler supports a variety of model tasks. In this case, we're setting the task type to multi-class classification to inform Fiddler that is the type of model we are monitoring.

In [10]:
model_task = fdl.ModelTask.MULTICLASS_CLASSIFICATION

task_params = fdl.ModelTaskParams(
    target_class_order=[col[5:] for col in model_spec.outputs]
)

## 4. Create a Model

Once we've specified information about our model, we can onboard (add) it to Fiddler by calling `Model.create`.

In [11]:
model = fdl.Model.from_data(
    name=MODEL_NAME,
    project_id=project.id,
    source=sample_data_df,
    spec=model_spec,
    task=model_task,
    task_params=task_params,
    event_ts_col=timestamp_column,
)

model.create()
print(f'New model created with id = {model.id} and name = {model.name}')

New model created with id = 81ee43cf-3ae3-4d2f-a4c7-d49bee8f4422 and name = nlp_newsgroups_multiclass


## 5. Publish a Static Baseline (Optional)

Since Fiddler already knows how to process data for your model, we can now add a **baseline dataset**.

You can think of this as a static dataset which represents **"golden data,"** or the kind of data your model expects to receive. 

Note : The data given during the model creation is purely for schema inference and not a default baseline.

Then, once we start sending production data to Fiddler, you'll be able to see **drift scores** telling you whenever it starts to diverge from this static baseline.

***

Let's publish our **original data sample** as a pre-production dataset. This will automatically add it as a baseline for the model.


*For more information on how to design your baseline dataset, [click here](https://docs.fiddler.ai/client-guide/creating-a-baseline-dataset).*

In [12]:
baseline_publish_job = model.publish(
    source=sample_data_df,
    environment=fdl.EnvType.PRE_PRODUCTION,
    dataset_name=STATIC_BASELINE_NAME,
)
print(
    f'Initiated pre-production environment data upload with Job ID = {baseline_publish_job.id}'
)

# Uncomment the line below to wait for the job to finish, otherwise it will run in the background.
# You can check the status on the Jobs page in the Fiddler UI or use the job ID to query the job status via the API.
# baseline_publish_job.wait()

Initiated pre-production environment data upload with Job ID = 5582971f-713b-48cd-8860-40e386c39def


## 6. Publish Production Events

Now let's publish some sample production events into Fiddler. For the timestamps in the middle of the event dataset, we've over-sampled from certain topics which introduces synthetic drift. This oversampling to inject drift allows us to better illustrate how Fiddler's vector monitoring approach detects drift in our unstructured inputs.

In [13]:
production_data_df = pd.read_csv(PATH_TO_EVENTS_CSV)

# Shift the timestamps of the production events to be as recent as today
production_data_df['timestamp'] = production_data_df['timestamp'] + (
    int(time.time() * 1000) - production_data_df['timestamp'].max()
)
production_data_df

Unnamed: 0,predicted_target,prob_computer,prob_forsale,prob_recreation,prob_religion,prob_science,original_text,original_target,target,n_tokens,string_size,timestamp
0,science,0.049720,0.015294,0.040910,0.025433,0.868644,: Has anyone ever heard of a food product call...,sci.space,science,87,482,1729540281307
1,computer,0.749487,0.045994,0.019984,0.016209,0.168327,Getting an image from a computer monitor to a ...,comp.graphics,computer,108,589,1729571654307
2,forsale,0.068282,0.739602,0.088766,0.032307,0.071043,The following used CD's are for sale. They ar...,misc.forsale,forsale,128,775,1729607370307
3,science,0.060254,0.036090,0.072608,0.028312,0.802736,"Since your MOSFET is a 1972 vintage, it's prob...",sci.electronics,science,117,652,1729603415307
4,computer,0.259612,0.088936,0.198272,0.206862,0.246319,Are 'Moody Monthly' and 'Moody' the same magaz...,soc.religion.christian,religion,26,148,1729541667307
...,...,...,...,...,...,...,...,...,...,...,...,...
23928,science,0.079724,0.029163,0.174876,0.204127,0.512110,Stupid me. I believed the Democrats stood for ...,sci.crypt,science,176,858,1731666221307
23929,recreation,0.148276,0.064030,0.646090,0.035585,0.106019,I'm looking for the address to join the Clevel...,rec.sport.baseball,recreation,54,264,1731617135307
23930,religion,0.018349,0.011634,0.048080,0.883586,0.038351,The story I related is one of the seven appari...,soc.religion.christian,religion,794,2703,1731678505307
23931,recreation,0.068823,0.018957,0.781527,0.053339,0.077355,Has anybody noticed that Toyota has an uncanny...,rec.autos,recreation,28,173,1731617227307


Next, let's publish events our events with the synthetic drift

In [14]:
production_publish_job = model.publish(production_data_df)

print(
    f'Initiated production environment data upload with Job ID = {production_publish_job.id}'
)

# Uncomment the line below to wait for the job to finish, otherwise it will run in the background.
# You can check the status on the Jobs page in the Fiddler UI or use the job ID to query the job status via the API.
# production_publish_job.wait()

Initiated production environment data upload with Job ID = 93b7df22-7173-4e1a-9f02-819bdd451d0f


# Get Insights


**You're all done!**
  
Head to the Fiddler UI to start getting enhanced observability into your model's performance.



In particular, you can go to your model's default dashboard in Fiddler and check out the resulting drift chart . The first image below is a screenshot of the automatically generated drift chart which shows the drift of the original text. The second image is a custom drift chart with traffic details at an hourly time bin. (Annotation bubbles are not generated by the Fiddler UI.)

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/nlp_multiiclass_drift_2.png"  />
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/nlp_multiiclass_drift.png"  />
        </td>
    </tr>
</table>