# Monitoring a Multiclass Classifier Model with Text Inputs
Unstructured data such as text are usually represented as high-dimensional vectors when processed by ML models. In this example notebook we present how [Fiddler Vector Monitoring](https://www.fiddler.ai/blog/monitoring-natural-language-processing-and-computer-vision-models-part-1) can be used to monitor NLP models using a text classification use case.

Following the steps in this notebook you can see how to onboard models that deal with unstructured text inputs. In this example, we use the 20Newsgroups dataset and train a multi-class classifier that is applied to vector embeddings of text documents.

We monitor this model at production time and assess the performance of Fiddler's vector monitoring by manufacturing synthetc drift via sampling from specific text categories at different deployment time intervals.

---

Now we perform the following steps to demonstrate how Fiddler NLP monitoring works:

1. Connect to Fiddler
2. Create a Project
3. Load a Data Sample
4. Add Information About the Model's Schema
5. Publish Production Events
6. Get insights

## Imports

In [13]:
import pandas as pd
import numpy as np
import random
import time as time
import os

# 1. Connect to Fiddler

First we install and import the Fiddler Python client.

In [14]:
!pip install -q fiddler-client;
import fiddler as fdl
print(f"Running client version {fdl.__version__}")

Running client version 3.1.2


Before you can add information about your model with Fiddler, you'll need to connect using our API client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your authorization token

These can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [15]:
URL = ''  # Make sure to include the full URL (including https://).
TOKEN = ''

Once you connect, you can create a new project by calling a Project's `create` method.

In [16]:
fdl.init(
    url=URL,
    token=TOKEN
)

# 2. Create a Project

Once you connect, you can create a new project by specifying a unique project ID in the client's `create_project` function.

In [17]:
PROJECT_NAME = 'nlp_newsgroups'

project = fdl.Project(
    name=PROJECT_NAME
)

project.create()

<fiddler.entities.project.Project at 0x7f7e153dfaf0>

# 3. Load a Data Sample

Now we retrieve the 20Newsgroup dataset. This dataset is fetched from the [scikit-learn real-world datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#) and pre-processed using [this notebook](https://colab.research.google.com/github/fiddler-labs/fiddler-examples/blob/main/pre-proccessing/20newsgroups_prep_vectorization.ipynb).  For simplicity sake, we have stored it here:

In [18]:
PATH_TO_SAMPLE_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/data/v3/nlp_text_multiclassifier_data_sample.csv'

sample_df = pd.read_csv(PATH_TO_SAMPLE_CSV)
sample_df

Unnamed: 0,predicted_target,prob_computer,prob_forsale,prob_recreation,prob_religion,prob_science,original_text,original_target,target,n_tokens,string_size,timestamp
0,science,0.049720,0.015294,0.040910,0.025433,0.868644,: Has anyone ever heard of a food product call...,sci.space,science,87,482,1713288933452
1,computer,0.749487,0.045994,0.019984,0.016209,0.168327,Getting an image from a computer monitor to a ...,comp.graphics,computer,108,589,1713289192800
2,forsale,0.068282,0.739602,0.088766,0.032307,0.071043,The following used CD's are for sale. They ar...,misc.forsale,forsale,128,775,1713289452148
3,science,0.060254,0.036090,0.072608,0.028312,0.802736,"Since your MOSFET is a 1972 vintage, it's prob...",sci.electronics,science,117,652,1713289711496
4,computer,0.259612,0.088936,0.198272,0.206862,0.246319,Are 'Moody Monthly' and 'Moody' the same magaz...,soc.religion.christian,religion,26,148,1713289970844
...,...,...,...,...,...,...,...,...,...,...,...,...
2328,science,0.284201,0.027899,0.018825,0.012399,0.656676,Yes. I use 74HC4066 and others commerically f...,sci.electronics,science,100,513,1713892696059
2329,science,0.140781,0.017383,0.120168,0.063662,0.658006,Even if they somehow address this issue it is ...,sci.crypt,science,38,222,1713892955407
2330,computer,0.937898,0.017932,0.009471,0.007481,0.027219,"Hello, I am searching for rendering softw...",comp.graphics,computer,39,215,1713893214755
2331,science,0.153452,0.039111,0.073378,0.030796,0.703263,For those of you interested in the above Proce...,sci.med,science,114,556,1713893474103


# 4. Add Information About the Model's Schema

Next we should tell Fiddler a bit more about our model by creating a `ModelSpec` object that specifies the model's inputs, outputs, targets, and metadata, along with other information such as the enrichments we want performed on our model's data.

### Instruct Fiddler to generate embeddings for our unstructured model input

Fiddler offers a powerful set of enrichment services that we can use to enhance how we monitor our model's performance.  In this example, we instruct Fiddler to generate embeddings for our unstructured text.  These generated embedddings are a numerical vector that represent the content and the context of our unstructured input field, _original_text_.  These embeddings then power Fiddler's vector monitoring capability for detecting drift.

Before creating a `ModelSpec` object, we define a custom feature using the [fdl.Enrichment()](https://docs.fiddler.ai/reference/fdlenrichment-beta) API. When creating an enrichment, a name must be assigned to the custom feature using the `name` argument. Each enrichment appears in the monitoring tab in Fiddler UI with this assigned name. Finally, the default clustering setup can be modified by passing the number of cluster centroids to the `n_clusters` argument.

Here we define an [embedding fdl.Enrichment](https://docs.fiddler.ai/reference/embedding-enrichment-beta) and then use that embedding enrichment to create a [fdl.TextEnrichment](https://docs.fiddler.ai/reference/fdltextembedding) input that can be used to track drift and to be plotted in Fiddler's UMAP visualizations.

In [19]:
fiddler_backend_enrichments = [
    fdl.Enrichment(
        name='Enrichment Text Embedding',
        enrichment='embedding',
        columns=['original_text'],
    ),
    fdl.TextEmbedding(
        name='Original TextEmbedding',
        source_column='original_text',
        column='Enrichment Text Embedding',
        n_clusters=6
    )
]

In [20]:
model_spec = fdl.ModelSpec(
    inputs=[
        'original_text'
    ],
    outputs=[col for col in sample_df.columns if col.startswith('prob_')],
    targets=['target'],
    metadata=['n_tokens', 'string_size', 'timestamp'],
    custom_features=fiddler_backend_enrichments,
)

In [21]:
timestamp_column = 'timestamp'

In [22]:
model_task = fdl.ModelTask.MULTICLASS_CLASSIFICATION

task_params = fdl.ModelTaskParams(target_class_order=[col[5:] for col in model_spec.outputs])

Once we've specified information about our model, we can publish it to Fiddler by calling `fdl.Model.create`.

In [23]:
MODEL_NAME = 'log_reg_multiclassifier'

model = fdl.Model.from_data(
    name=MODEL_NAME,
    project_id=project.id,
    source=sample_df,
    spec=model_spec,
    task=model_task,
    task_params=task_params,
    event_ts_col=timestamp_column
)

model.create()

<fiddler.entities.model.Model at 0x7f7e153f3a90>

# 5. Publish Production Events

Now let's publish some sample production events into Fiddler. For the timestamps in the middle of the event dataset, we've oversampled from certain topics which introduces synthetic drift. This oversampling to inject drift allows us to better illustrate how Fiddler's vector monitoring approach detects drift in our unstructured inputs.

In [24]:
PATH_TO_EVENTS_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/data/v3/nlp_text_multiclassifier_production_data.csv'

production_df = pd.read_csv(PATH_TO_EVENTS_CSV)

# Shift the timestamps of the production events to be as recent as today
production_df['timestamp'] = production_df['timestamp'] + (int(time.time() * 1000) - production_df['timestamp'].max())

Next, let's publish events our events with the synthetic drift

In [25]:
output = model.publish(production_df)

# 6. Get insights


**You're all done!**
  
You can now head to Fiddler URL and start getting enhanced observability into your model's performance.



In particular, you can go to your model's default dashboard in Fiddler and check out the resulting drift chart . Bellow is a sceernshot of the data drift chart after running this notebook on the [Fiddler demo](https://demo.fiddler.ai/) deployment. (Annotation bubbles are not generated by the Fiddler UI.)

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/nlp_multiiclass_drift.png" />
        </td>
    </tr>
</table>