# Monitoring a Multiclass Classifier Model with Text Inputs
Unstructured data such as text are usually represented as high-dimensional vectors when processed by ML models. In this example notebook we present how [Fiddler Vector Monitoring](https://www.fiddler.ai/blog/monitoring-natural-language-processing-and-computer-vision-models-part-1) can be used to monitor NLP models using a text classification use case.

Following the steps in this notebook you can see how to onboard models that deal with unstructured text inputs. In this example, we use the 20Newsgroups dataset and train a multi-class classifier that is applied to vector embeddings of text documents. 

We monitor this model at production time and assess the performance of Fiddler's vector monitoring by manufacturing synthetc drift via sampling from specific text categories at different deployment time intervals.

---

Now we perform the following steps to demonstrate how Fiddler NLP monitoring works: 

1. Connect to Fiddler 
2. Create a Project
3. Upload Baseline Dataset
4. Add Information About the Model's Schema
5. Manufacture Synthetic Data Drift and Publish Production Events
6. Get insights

## Imports

In [None]:
import pandas as pd
import numpy as np
import random
import os

# 1. Connect to Fiddler

First we install and import the Fiddler Python client.

In [None]:
!pip install -q fiddler-client
import fiddler as fdl
print(f"Running client version {fdl.__version__}")

Before you can add information about your model with Fiddler, you'll need to connect using our API client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

The latter two of these can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [None]:
URL = ''
ORG_ID = ''
AUTH_TOKEN = ''

Next we run the following code block to connect to the Fiddler API.

In [None]:
client = fdl.FiddlerApi(
    url=URL,
    org_id=ORG_ID,
    auth_token=AUTH_TOKEN)

# 2. Create a Project

Once you connect, you can create a new project by specifying a unique project ID in the client's `create_project` function.

In [None]:
PROJECT_ID = 'nlp_newsgroups'

if not PROJECT_ID in client.list_projects():
    print(f'Creating project: {PROJECT_ID}')
    client.create_project(PROJECT_ID)
else:
    print(f'Project: {PROJECT_ID} already exists')

# 3. Upload the Baseline Dataset

Now we retrieve the 20Newsgroup dataset. This dataset is fetched from the [scikit-learn real-world datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#) and pre-processed using [this notebook](https://colab.research.google.com/github/fiddler-labs/fiddler-examples/blob/main/pre-proccessing/20newsgroups_prep_vectorization.ipynb).  For simplicity sake, we have stored it here:

In [None]:
BASELINE_DATA_PATH = 'https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/baseline_nlp_text_multiclassifier.csv'
baseline_df = pd.read_csv(BASELINE_DATA_PATH)
baseline_df

Now we create a [DatasetInfo](https://docs.fiddler.ai/reference/fdldatasetinfo) object to describe our baseline dataset.

In [None]:
dataset_info = fdl.DatasetInfo.from_dataframe(baseline_df, max_inferred_cardinality=100)
dataset_info

Next we call the [upload_dataset()](https://docs.fiddler.ai/reference/clientupload_dataset) API to upload a baseline  to Fiddler. In addition to the baseline data, we also uploaded the whole production data framework as the 'test_data' dataset which allows us to look at the model performance metrics for unseen data.

In [None]:
DATASET_ID = 'newsgroups_baseline'

if not DATASET_ID in client.list_datasets(project_id=PROJECT_ID):
    print(f'Upload dataset {DATASET_ID}')
    client.upload_dataset(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    dataset={'baseline':baseline_df},
    info=dataset_info
)
else:
    print(f'Dataset: {DATASET_ID} already exists in Project: {PROJECT_ID}.\n'
               'The new dataset is not uploaded. (please use a different name.)') 

# 4. Add Information About the Model's Schema

Next we should tell Fiddler a bit more about our model by creating a [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object that specifies the model's task, inputs, outputs, and other information such as the enrichments we want performed on our model's data.

### Instruct Fiddler to generate embeddings for our unstructured model input

Fiddler offers a powerful set of enrichment services that we can use to enhance how we monitor our model's performance.  In this example, we instruct Fiddler to generate embeddings for our unstructured text.  These generated embedddings are a numerical vector that represent the content and the context of our unstructured input field, _original_text_.  These embeddings then power Fiddler's vector monitoring capability for detecting drift.

Before creating a [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object, we define a custom feature using the [fdl.Enrichment()](https://docs.fiddler.ai/reference/fdlenrichment-beta) API. When creating an enrichment, a name must be assigned to the custom feature using the `name` argument. Each enrichment appears in the monitoring tab in Fiddler UI with this assigned name. Finally, the default clustering setup can be modified by passing the number of cluster centroids to the `n_clusters` argument.

Here we define an [embedding fdl.Enrichment](https://docs.fiddler.ai/reference/embedding-enrichment-beta) and then use that embedding enrichment to create a [fdl.TextEnrichment](https://docs.fiddler.ai/reference/fdltextembedding) input that can be used to track drift and to be plotted in Fiddler's UMAP visualizations.

In [None]:
fiddler_backend_enrichments = [
    fdl.Enrichment(
        name='Enrichment Text Embedding',
        enrichment='embedding',
        columns=['original_text'],
    ),
    fdl.TextEmbedding(
        name='Original TextEmbedding',
        source_column='original_text',
        column='Enrichment Text Embedding',
        n_clusters=6
    )
]

### Generate Model_Info Object and Add Model 

Since this notebook demonstrates a monitoring-only use case and model predictions are already added to both baseline and production data, there is no need to access the model directly or to build a surrogate model and we use the [add_model()](https://docs.fiddler.ai/reference/clientadd_model) API. This requires passing a [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object which conitains information about our model's task, inputs, outputs, targets and enrichments that we would like to be monitored.

In [None]:
model_task = fdl.ModelTask.MULTICLASS_CLASSIFICATION
model_target = 'target'
model_outputs= [col for col in baseline_df.columns if col.startswith('prob_')]
model_features = ['original_text']
model_categorical_target_class_details = [col[5:] for col in model_outputs]

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    dataset_id=DATASET_ID,
    features=model_features,
    target=model_target,
    outputs=model_outputs,
    custom_features = fiddler_backend_enrichments,
    model_task=model_task,
    decision_cols=['predicted_target'],
    categorical_target_class_details=model_categorical_target_class_details,
    metadata_cols=['n_tokens', 'string_size'],
    description='A multi-class calssifier trained on NLP original text inputs.'
)
model_info

In [None]:
MODEL_ID = 'logistic_regression_multiclassifier'

if not MODEL_ID in client.list_models(project_id=PROJECT_ID):
    client.add_model(
        project_id=PROJECT_ID,
        dataset_id=DATASET_ID,
        model_id=MODEL_ID,
        model_info=model_info
    )
else:
    print(f'Model: {MODEL_ID} already exists in Project: {PROJECT_ID}. Please use a different name.')

# 5. Manufacture Synthetic Data Drift and Publish Production Events

Now we publish some production events into Fiddler. We publish events in data batches and manually create data drift by sampling from particular newsgroups. This allows us to evaluate the effectiveness of Fiddler vector monitoring.

In [42]:
PROD_EVENTS_DATA_PATH = 'https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/production_nlp_text_multiclassifier.csv'
production_df = pd.read_csv(PROD_EVENTS_DATA_PATH)
production_df

Unnamed: 0,predicted_target,prob_computer,prob_forsale,prob_recreation,prob_religion,prob_science,original_text,original_target,target,n_tokens,string_size
0,recreation,0.076641,0.060584,0.568645,0.108312,0.185817,Saku isn't that small any longer I guess I hea...,rec.sport.hockey,recreation,22,106
1,computer,0.944998,0.011931,0.018384,0.004506,0.020182,Has anyone had problems with Ami Pro 3.0 after...,comp.os.ms-windows.misc,computer,204,1087
2,forsale,0.080242,0.599389,0.164612,0.013835,0.141921,"Whistler Spectrum 2-SE. X, K, Ka. Pulse prot...",misc.forsale,forsale,20,107
3,science,0.057480,0.022540,0.045148,0.038236,0.836596,:Thousands? Tens of thousands? Do some arith...,sci.crypt,science,55,303
4,religion,0.036691,0.021405,0.104772,0.772644,0.064488,"Being a parent in need of some help, I ask tha...",soc.religion.christian,religion,589,3207
...,...,...,...,...,...,...,...,...,...,...,...
6994,computer,0.785213,0.032077,0.046269,0.025683,0.110758,"wing the suggestion of Stu Lynne, I have poste...",comp.graphics,computer,19,112
6995,science,0.067842,0.041828,0.151476,0.190409,0.548445,"As many people have mentioned, there is no rea...",talk.religion.misc,religion,113,647
6996,computer,0.834109,0.032805,0.046836,0.021237,0.065012,Hmmmmmm...I got my comp with windows pre-insta...,comp.sys.ibm.pc.hardware,computer,53,284
6997,computer,0.933943,0.011013,0.011684,0.008445,0.034915,"Well, the temp file thing creates an obvious p...",comp.graphics,computer,38,221


In [43]:
batch_size = 900 #number of events per (daily) bin
event_batches_df=[]

For sanity check, we use the baseline data as the first event batch

In [44]:
event_batches_df.append(baseline_df)

Next sample from all categories (same as baseline)

In [45]:
n_intervals = 6
for i in range(n_intervals):
    event_batches_df.append(production_df.sample(batch_size))

Now we generate synthetic data drift by adding event batches that are sampled from specific newsgroups

In [46]:
T1 = ['computer','science','recreation']
T2 = ['science','religion']
T3 = ['religion']
T4 = ['computer','science','religion','forsale']
T5 = ['science','religion','forsale']
T6 = ['forsale']
synthetic_intervals = [T1,
                       T2,
                       T3,T3,T3,T3,
                       T4,T4,T4,
                       T5,T5,
                       T6,
                       ]

In [47]:
for categories in synthetic_intervals:
    production_df_subset = production_df[production_df['target'].isin(categories)]
    event_batches_df.append(production_df_subset.sample(batch_size, replace=True))

Add more intervals sampled from all categories (no data drift)

In [48]:
n_intervals = 6
for i in range(n_intervals):
    event_batches_df.append(production_df.sample(batch_size))

### Add Timestamp to Batches and Publish Events

In [49]:
from datetime import datetime, date, time 
today_beginning = datetime.combine(date.today(), time()).timestamp()
daily_time_gap = 24*3600 #daily time gap in seconds

In [50]:
#start from 29 days back
timestamp= today_beginning - 29*daily_time_gap
for event_df in event_batches_df:
    timestamp_vec = [timestamp + random.randrange(daily_time_gap) for i in range(len(event_df))]
    event_df['timestamp'] = timestamp_vec
    timestamp += daily_time_gap

In [58]:
# event_new_df = pd.DataFrame()
# for event_df in event_batches_df:
#     event_new_df = pd.concat(event_new_df, event_df)

# event_new_df

concatenated_df = pd.concat(event_batches_df)
concatenated_df.to_csv('production_nlp_text_multiclassifier.csv', index=False)

Lastly, let's publish events our events with synthetic drift

In [None]:
for event_df in event_batches_df:
    client.publish_events_batch(
        project_id=PROJECT_ID,
        model_id=MODEL_ID,
        batch_source=event_df,
        timestamp_field= 'timestamp' #comment this line if you are not adding timestamps
    )

# 6. Get insights


**You're all done!**
  
You can now head to Fiddler URL and start getting enhanced observability into your model's performance.



In particular, you can go to your model's default dashboard in Fiddler and check out the resulting drift chart . Bellow is a sceernshot of the data drift chart after running this notebook on the [Fiddler demo](https://demo.fiddler.ai/) deployment. (Annotation bubbles are not generated by the Fiddler UI.)

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/nlp_multiiclass_drift.png" />
        </td>
    </tr>
</table>