In this demo we'll use the AG news dataset to show how deepchecks can be used to identify and investigate data drift in NLP data.

The news dataset contains the first paragraphs of news stories, alongside their broad classifications into topics.

# Create a deepchecks TextData

In [None]:
import sys
!{sys.executable} -m pip install -U deepchecks[nlp]

## Download files

In [None]:
import pandas as pd
import requests
import zipfile
from io import BytesIO, StringIO

def download_gdrive_file(file_id, paqruet=False):
    url = f"https://drive.google.com/uc?export=download&id={file_id}"

    # Send a request to download the file
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Read the CSV file using pandas
        if paqruet:
            df = pd.read_parquet(BytesIO(response.content))
        else:
            csv_content = response.content.decode('utf-8')
            df = pd.read_csv(StringIO(csv_content))
    else:
        print(f"Error: Unable to download the file. Status code: {response.status_code}")

    return df

In [None]:
train_text = download_gdrive_file('17KlCcAaaUMoYzStyqpcNnmMBAEvsqMIO')
test_text = download_gdrive_file('14-lGyJ-UxJp-eek8Y376sjq9RpmWx8yu')

In [None]:
train_labels = download_gdrive_file('1XMjetF-2p46SjQDeopnB1wkAvaSriQDW')
test_labels = download_gdrive_file('1rTKihXkiJSqart3W88FxXzUBUzNDECi_')

## create TextData

In [None]:
from deepchecks.nlp import TextData

Deepchecks' TextData object contain the text samples, labels and possibly also properties and metadata. </br>
it stores cache to save time between repeated computations and contain functionalities for input validations and sampling.

In [None]:
train = TextData(train_text.values.flatten(), label=train_labels.values.flatten(), task_type='text_classification')
test = TextData(test_text.values.flatten(), label=test_labels.values.flatten(), task_type='text_classification')

In [None]:
train.head()

## Load text properties

Some of Deepchecks' checks uses properties of the text samples for various calculations. </br>
Deepcheck have a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. </br>
In order for Deepcheck's checks to be able to access the properties they be stored within the TextData object.

In [None]:
train_properties = download_gdrive_file('18cv_lsk9pshiRBI9xCbRlBV0f40mNqIZ')
test_properties = download_gdrive_file('1d_Ed2VDHr9nyhpnAEfxrSuu5FtYs-3Xm')

In [None]:
train.set_properties(train_properties)
test.set_properties(test_properties)

In [None]:
train.properties.head(2)

In [None]:
# # We could also have used deepchecks to calculate them
# from torch import device
# train.calculate_default_properties(include_long_calculation_properties=True, device=device('mps'))
# test.calculate_default_properties(include_long_calculation_properties=True,  device=device('mps'))

## Train a model on the data

We'll train a simple model using Open-AI ada-02 embeddings and a simple XGBoost model.

In [None]:
train_embeddings = download_gdrive_file('1I5ZLzgv6dQZ-S_uqUkoQPCXqckmceMs7')
test_embeddings = download_gdrive_file('1iiTugfVSUwkZaawwMcppKEn2N7381aMz', paqruet=True)

In [None]:
label_map = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}

In [None]:
from xgboost import XGBClassifier

In [None]:
model = XGBClassifier(n_estimators=100, max_depth=7, random_state=42)

In [None]:
model.fit(train_embeddings, pd.Categorical(train.label, categories=label_map.values()).codes)

We'll compute the predictions and probabilities for the two datasets using that simple model

In [None]:
train_pred = (pd.Series(model.predict(train_embeddings)) + 1).replace(label_map)
test_pred = (pd.Series(model.predict(test_embeddings)) + 1).replace(label_map)

In [None]:
train_proba = model.predict_proba(train_embeddings)
test_proba = model.predict_proba(test_embeddings)

# Finding data drift using the Property Drift check

We'll instantiate the property drift, which uses statistical measures to find changes in the distribution of properties between the two datasets.

In [None]:
from deepchecks.nlp.checks import PropertyDrift

check = PropertyDrift(n_top_properties=3)
res = check.run(train, test)
res

We can easily identify some significant drifts - the test data contains a lot of informal samples compared to the training data. Additionally, we can see that news stories in the test data tend to be longer by a bit.

# Investigating data drift using the Embedding Drift check

We'll use the same embeddings from earlier to run the check

In [None]:
train.set_embeddings(train_embeddings)
test.set_embeddings(test_embeddings)

In [None]:
# # We could also have calculated them on the spot
# train.calculate_default_embeddings(model='open_ai')
# test.calculate_default_embeddings(model='open_ai')

In [None]:
from deepchecks.nlp.checks import TextEmbeddingsDrift

check = TextEmbeddingsDrift()
res = check.run(train, test)
res

We notice now that there is a significant cluster (on the bottom right) that is comprised mainly of test samples. If we look into it, we see it's mainly samples dealing with Sports!
So there are more articles in the test dealing with sports events, and that is also probably why we saw more informal texts in the test data - sport reporting tends to be less formal compared to science, business and world politics.

# How did this affect our model's performance?

## Model performance

In [None]:
from deepchecks.nlp.checks import TrainTestPerformance

In [None]:
TrainTestPerformance().add_condition_train_test_relative_degradation_less_than().run(train, test, train_predictions=train_pred, test_predictions=test_pred)

First we'll note that the lack of samples dealing with sports in the training data led to a decline in the Recall on this class, as our condition has captured.

## Segment performance

We can use our Property Segment Performance check to try and see if we have specific sub-segment that are performing worse compared to the rest of the data

In [None]:
from deepchecks.nlp.checks import PropertySegmentsPerformance

In [None]:
PropertySegmentsPerformance(segment_minimum_size_ratio=0.1).run(test, predictions=test_pred, probabilities=test_proba)

We'll note two interesting facts:
1. First, we perform worse on the low formality samples. This was expected as we know that sport reporting is less formal, and that because sports where less abundant in the training data the model is doing worse on them.
2. Second, we note the the model is also doing worse for reports with low average word length. This has also been surfaced by our Property Drift check, but now after looking at the models' performance we can say something more - we have Concept Drift! The low formality samples in the test data (mostly Sports) also use simpler language (shorter words) in the test set compared to the training data, and that's an additional reason why our model is doing worse on these new sport samples in the test data.