# Monitoring NLP data using Fiddler Vector Monotoring

In this notebook we present the steps for using Fiddler NLP monitoring. Fiddler employs a vector-based monitoring approach that can be used to monitor data drift in multi-dimensional data such as NLP embeddings and images. In this notebook we show a use case for monitoring NLP embeddings to detect drift in text data.

Fiddler is the pioneer in enterprise Model Performance Management (MPM), offering a unified platform that enables Data Science, MLOps, Risk, Compliance, Analytics, and LOB teams to **monitor, explain, analyze, and improve ML deployments at enterprise scale**. 
Obtain contextual insights at any stage of the ML lifecycle, improve predictions, increase transparency and fairness, and optimize business revenue.

---

You can experience Fiddler's NLP monitoring ***in minutes*** by following these five quick steps:

1. Connect to Fiddler
2. Fetch, prepare, and vectorize 20Newsgroup data
2. Upload the vectorized baseline dataset
3. Add metadata about your model
4. Publish production events
5. Get insights

## Imports

In [None]:
!pip install -q fiddler-client

import fiddler as fdl
import pandas as pd
import numpy as np
import time
import sys
import random
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

logging.basicConfig(level=logging.INFO, stream=sys.stdout)
logger = logging.getLogger(__name__)

logger.info(f"Running Fiddler client version {fdl.__version__}")

In [None]:
pd.__version__

# 1. Connect to Fiddler

Before you can add information about your model with Fiddler, you'll need to connect using our API client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

The latter two of these can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [None]:
URL = 'https://danny.trial.fiddler.ai' # Make sure to include the full URL (including https://).
ORG_ID = 'danny'
AUTH_TOKEN = 'qBVrDEtZfe9xWFZukYUGNSdWG004gWCD3drfNmfgNUg'

Now just run the following code block to connect to the Fiddler API!

In [None]:
client = fdl.FiddlerApi(
    url=URL,
    org_id=ORG_ID,
    auth_token=AUTH_TOKEN,
    version=2
)

Once you connect, you can create a new project by specifying a unique project ID in the client's `create_project` function.

In [None]:
PROJECT_ID = 'simple_nlp_monitoring_example'

if not PROJECT_ID in client.list_projects():
    logger.info(f'Creating project: {PROJECT_ID}')
    client.create_project(PROJECT_ID)
else:
    logger.info(f'Project: {PROJECT_ID} already exists')

# 2. Fetch, prepare and vectorize 20Newsgroup data

In order to get insights into the model's performance, **Fiddler needs a small  sample of data that can serve as a baseline** for making comparisons with production inferences (aka. events).

For this model's baseline dataset, we will use the __"20 newsgroups text dataset"__.  This dataset contains around 18,000 newsgroups posts on 20 topics. This dataset is available as one of the standard scikit-learn real-world datasets and can be be fechted directly using scikit-learn.

Let's first configure our baseline size and the topics present in the baseline.  For the baseline data, let's sample text data from topics related to compuers, sports, and politics:

In [None]:
baseline_size = 2000 # number of records in the baseline dataset

# Topics present in the baseline dataset
baseline_cat = ["comp.graphics",
                "comp.sys.ibm.pc.hardware",
                "comp.sys.mac.hardware",
                "comp.windows.x",
                "rec.autos",
                "rec.motorcycles",
                "rec.sport.baseball",
                "rec.sport.hockey",
                "talk.politics.guns",
                "talk.politics.mideast",
                "talk.politics.misc",
               ]

Use sklearn to fetch our baseline dataset of newsgroup posts

In [None]:
# Fetch newgroup posts within our categories and remove headers, footers, and quotes.
data, labels = fetch_20newsgroups(
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    return_X_y=True,
    categories=baseline_cat
)

# Remove carriage returns and filtering out special characters
data_filtered = list(filter(None, [s.strip('\n,=') for s in data]))

# Sample 2000 random posts for our baseline
base_samples = random.sample(data_filtered, baseline_size)
base_series = pd.Series(base_samples)

### Vectorization

Fiddler monitors NLP and CV data by using encoded data in the form of embeddings, or **vectors**.  Before we load our baseline or our event data into the Fiddler platform for monitoring purposes, we must *vectorize* the raw NLP input.  

The follow section provides two methods of vectorizing the NLP data: *TF-IDF vectorization* and *word2vec*.  Please run only 1 method.

***Method 1: TF-IDF vectorization***

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_features=300, min_df=0.02, max_df=0.9, stop_words='english')
vectorizer.fit(base_series)
tfidf_baseline_sparse = vectorizer.transform(base_series)

In [None]:
embedding_cols = vectorizer.get_feature_names_out().tolist()
embedding_cols = ['f'+str(c+1) for c in range(len(embedding_cols))]

In [None]:
baseline_df = pd.DataFrame.sparse.from_spmatrix(tfidf_baseline_sparse, columns=embedding_cols)


***Method 2: word2vec by Spacy***

The following lines show how to use ***word2vec*** embedding from Sacy. In order to run the following cell, you need to install spacy and its pre-trained models like 'en_core_web_lg'. See: https://spacy.io/usage

In [None]:
# import spacy
# nlp = spacy.load('en_core_web_lg')

# s = time.time()
# base_embeddings = base_series.apply(lambda sentence: nlp(sentence).vector)
# print(f' Time to compute embeddings {time.time() - s}')

# baseline_df = pd.DataFrame(base_embeddings.values.tolist())
# baseline_df = baseline_df.rename(columns = {c:'f'+str(c+1) for c in baseline_df.columns})

### Add output, target and additional data columns

The following section manufactures other columns needed for our baseline dataset, namely our output, target and some synthetic tabular features.

In [None]:
#Synthetic income
baseline_df['income'] = np.random.rand(baseline_size)*1000

predictions = np.random.rand(baseline_size)
targets = [1 if s>0.5 else 0 for s in predictions]

baseline_df['predicted_score'] = predictions
baseline_df['target'] = targets
baseline_df.head()

# 3. Upload the vectorized baseline dataset to Fiddler

Next, let's create a [DatasetInfo](https://docs.fiddler.ai/reference/fdldatasetinfo) object to describe our baseline dataset and then [upload_dataset()](https://docs.fiddler.ai/reference/clientupload_dataset) to Fiddler.

In [None]:
DATASET_ID = 'newsgroups1'  # The dataset name in Fiddler platform
dataset_info = fdl.DatasetInfo.from_dataframe(baseline_df)

if not DATASET_ID in client.list_datasets(project_id=PROJECT_ID):
    logger.info(f'Upload dataset {DATASET_ID}')
    
    client.upload_dataset(
        project_id=PROJECT_ID,
        dataset_id=DATASET_ID,
        dataset={'baseline': baseline_df},
        info=dataset_info
    )
    
else:
    logger.info(f'Dataset: {DATASET_ID} already exists in Project: {PROJECT_ID}.\n'
               'The new dataset is not uploaded. (please use a different name.)') 

# 4. Add metadata about the model

Next we must tell Fiddler a bit more about our model.  This is done either by calling [.register_model()](https://docs.fiddler.ai/reference/clientregister_model) or [.add_model()](https://docs.fiddler.ai/reference/clientadd_model).  This notebook will use [.register_model()](https://docs.fiddler.ai/reference/clientregister_model) which will also create a surrogate model for you that will allow additional insight like feature impact and partial dependency analysis.  When calling [.register_model()](https://docs.fiddler.ai/reference/clientregister_model), we must pass in a [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object to tell Fiddler about our model.  This [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object will tell Fiddler about our model's task, inputs, output, target and which features are apart of the NLP vector created above.

Let's first define our NLP vector using a custom feature.

In [None]:
CF1 = fdl.CustomFeature.from_columns(embedding_cols, n_clusters=5, custom_name='vector1')

Now let's define our [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object.

In [None]:
# Specify task
model_task = 'binary'

if model_task == 'regression':
    model_task = fdl.ModelTask.REGRESSION
    
elif model_task == 'binary':
    model_task = fdl.ModelTask.BINARY_CLASSIFICATION

elif model_task == 'multiclass':
    model_task = fdl.ModelTask.MULTICLASS_CLASSIFICATION

elif model_task == 'ranking':
    model_task = fdl.ModelTask.RANKING
    
    
# Specify column types
target = 'target'
outputs = ['predicted_score']
features = baseline_df.columns.drop('target', 'predicted_score')

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info = dataset_info,
    dataset_id = DATASET_ID,
    features = features,
    target = target,
    outputs = outputs,
    custom_features = [CF1],
    model_task=model_task,
    binary_classification_threshold=0.5 #optional
)
model_info

And call [.register_model()](https://docs.fiddler.ai/reference/clientregister_model) to tell Fiddler about our model and to create a model surrogate.

In [None]:
MODEL_ID = 'newsgroup_classifier' # choose a different model ID

if not MODEL_ID in client.list_models(project_id=PROJECT_ID):
    client.register_model(
        project_id=PROJECT_ID,
        dataset_id=DATASET_ID,
        model_id=MODEL_ID,
        model_info=model_info
    )
else:
    logger.info(f'Model: {MODEL_ID} already exists in Project: {PROJECT_ID}. Please use a different name.')

# 5. Publish production events

Let's manufacture some events we can publish to Fiddler.  Here, we will exploit the available topic tags to generate synethic data drift. We achieve this by changing the topics from which the event data are sampled compared to topics available in the baseline dataset.

In [None]:
event_batches=[]
FRAC = 0.2 #sample size in each time interval as a fraction of baseline data

#first bin is a random sample from baseline
event_df = baseline_df.sample(frac=FRAC)
event_batches.append(event_df)

Generate events based on two groups of topics, each a subset of the baseline topics.

In [None]:
group1 = ["rec.autos",
         "rec.motorcycles",
         "rec.sport.baseball",
         "rec.sport.hockey",
         "talk.politics.guns",
         "talk.politics.mideast",
         "talk.politics.misc",]

group2 = ["talk.politics.guns",
         "talk.politics.mideast",
         "talk.politics.misc",]

event_categories = {'group1':group1, 'group2':group2}

Now let's manufacture events by fetching the newsgroup data again just for each group above.  Pay attention to the vectorization step as it is different based on whether we used TF-IDF or word2vec above.

In [None]:
#now generate events from specific groups

for cat in event_categories.keys():
    data, labels = fetch_20newsgroups(
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    categories=event_categories[cat],
    return_X_y=True,
    )
    
    event_size = int(baseline_size*FRAC)
    
    data_filtered = list(filter(None, [s.strip('\n,=') for s in data]))
    samples = pd.Series(random.sample(data_filtered,event_size))

    # use the following lines for TF-IDF embedding
    tfidf_embedding = vectorizer.transform(samples)
    event_df = pd.DataFrame.sparse.from_spmatrix(tfidf_embedding, columns=embedding_cols)
        
    # use the following lines for word2vec embedding 
    #embeddings = samples.apply(lambda sentence: nlp(sentence).vector)
    #event_df = pd.DataFrame(embeddings.values.tolist(), columns=embedding_cols)
    
    event_df['age'] = np.random.rand(event_size)*100
                        
    predictions = np.random.rand(event_size)
    targets = [1 if s>0.5 else 0 for s in predictions]
    event_df['predicted_score'] = predictions
    event_df['target'] = targets

    event_batches.append(event_df)
    
    #cat_text[cat] = samples
    #cat_embeddings[cat] =  np.stack(embeddings.values)

Now let's manufacture timestamps for our event data.

In [None]:
# Generating hourly bins. You can try daily bins by setting time_gap = 24*3600*1000
time_gap = 3600*1000*2 

#start from 23hrs ago
timestamp=time.time()*1000 - 3600*12*1000
for event_df in event_batches:
    event_df['timestamp'] = timestamp
    timestamp += time_gap  

And, finally, publish our event groups in batches.

In [None]:
for event_df in event_batches:
    client.publish_events_batch(
        project_id=PROJECT_ID,
        model_id=MODEL_ID,
        batch_source=event_df,
        timestamp_field= 'timestamp' #comment this line if you are not adding timestamps
    )

# 5. Get insights

**You're all done!**
  
Now just head to your Fiddler URL and start getting enhanced observability into your model's performance.

Run the following code block to get your URL.

In [None]:
print('/'.join([URL, 'projects', PROJECT_ID, 'models', MODEL_ID, 'monitor']))

*Please allow 3-5 minutes for monitoring data to populate the charts.*

<table>
    <tr>
        <td><img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-samples/master/content_root/tutorial/quickstart/images/nlp1.png" /></td>
    </tr>
</table>

---


**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

Join our [community Slack](http://fiddler-community.slack.com/) to ask any questions!

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.