In this demo we'll use the AG news dataset to show how deepchecks can be used to identify and investigate data drift in NLP data.

The news dataset contains the first paragraphs of news stories, alongside their broad classifications into topics.

# Create a deepchecks TextData

In [1]:
import sys
!{sys.executable} -m pip install -U 'deepchecks[nlp]'
# !{sys.executable} -m pip install "git+https://github.com/deepchecks/deepchecks.git@noam/data-ai#egg=deepchecks[nlp]"

Collecting deepchecks[nlp]
  Downloading deepchecks-0.18.1-py3-none-any.whl.metadata (5.7 kB)
Collecting pandas<2.2.0,>=1.1.5 (from deepchecks[nlp])
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scikit-learn<1.4.0,>=0.23.2 (from deepchecks[nlp])
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting PyNomaly>=0.3.3 (from deepchecks[nlp])
  Downloading PyNomaly-0.3.4-py3-none-any.whl.metadata (581 bytes)
Collecting category-encoders>=2.3.0 (from deepchecks[nlp])
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting scipy<=1.10.1,>=1.4.1 (from deepchecks[nlp])
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter-server>=2.7.2 (from deepch

## Download files

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
import pandas as pd
import requests
import zipfile
from io import BytesIO, StringIO

def download_gdrive_file(file_id, paqruet=False):
    # Check if the file has already been downloaded and stored locally
    filename = f"{file_id}.parquet" if paqruet else f"{file_id}.csv"
    if os.path.exists(filename):
        # Read the file from the local cache
        if paqruet:
            df = pd.read_parquet(filename)
        else:
            df = pd.read_csv(filename)
    else:
        url = f"https://drive.google.com/uc?export=download&id={file_id}"

        # Send a request to download the file
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Read the CSV file using pandas
            if paqruet:
                df = pd.read_parquet(BytesIO(response.content))
            else:
                csv_content = response.content.decode('utf-8')
                df = pd.read_csv(StringIO(csv_content))

            # Cache the file locally
            df.to_parquet(filename) if paqruet else df.to_csv(filename, index=False)
        else:
            print(f"Error: Unable to download the file. Status code: {response.status_code}")

    return df

In [4]:
train_text = download_gdrive_file('17KlCcAaaUMoYzStyqpcNnmMBAEvsqMIO')
test_text = download_gdrive_file('14-lGyJ-UxJp-eek8Y376sjq9RpmWx8yu')

In [5]:
train_labels = download_gdrive_file('1XMjetF-2p46SjQDeopnB1wkAvaSriQDW')
test_labels = download_gdrive_file('1rTKihXkiJSqart3W88FxXzUBUzNDECi_')

## create TextData

In [6]:
from deepchecks.nlp import TextData


`torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.



Deepchecks' TextData object contain the text samples, labels and possibly also properties and metadata. </br>
it stores cache to save time between repeated computations and contain functionalities for input validations and sampling.

In [7]:
train = TextData(train_text.values.flatten(), label=train_labels.values.flatten(), task_type='text_classification')
test = TextData(test_text.values.flatten(), label=test_labels.values.flatten(), task_type='text_classification')

In [8]:
train.head()

Unnamed: 0,text,label
0,BEDFORD -- Scientists at NitroMed Inc. hope th...,Business
1,Even a genius can mess up. Bill Gates was a br...,Business
2,Central Square in Lynn should be looking a bit...,Business
3,The Blues is alive and well in the Philippines...,Business
4,AP - The prospect that a tropical storm and a ...,Sci/Tech


## Load text properties

Some of Deepchecks' checks uses properties of the text samples for various calculations. </br>
Deepcheck have a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. </br>
In order for Deepcheck's checks to be able to access the properties they be stored within the TextData object.

In [9]:
train_properties = download_gdrive_file('18cv_lsk9pshiRBI9xCbRlBV0f40mNqIZ')
test_properties = download_gdrive_file('1d_Ed2VDHr9nyhpnAEfxrSuu5FtYs-3Xm')

In [10]:
train_properties

Unnamed: 0,Text Length,Average Word Length,Max Word Length,% Special Characters,Language,Sentiment,Subjectivity,Toxicity,Fluency,Formality,Lexical Density,Unique Noun Count
0,158,5.360000,12,0.031646,en,0.033333,0.416667,0.000617,0.975427,0.854858,95.83,10
1,368,5.150000,14,0.019022,en,0.145000,0.635000,0.002366,0.958778,0.820170,84.75,15
2,259,5.046512,12,0.046332,en,0.034091,0.442803,0.000546,0.794649,0.868064,90.70,23
3,231,5.270270,12,0.030303,en,-0.075000,0.325000,0.000600,0.944581,0.820948,86.49,14
4,187,4.812500,14,0.032086,en,0.000000,0.562500,0.000628,0.944103,0.791041,93.55,9
...,...,...,...,...,...,...,...,...,...,...,...,...
2787,180,5.961538,13,0.016667,en,-0.050000,0.150000,0.000594,0.975025,0.898856,100.00,13
2788,207,5.181818,14,0.033816,en,0.000000,0.000000,0.000712,0.966561,0.815303,90.91,13
2789,185,5.413793,14,0.032432,en,0.050000,0.350000,0.000799,0.930435,0.797099,93.33,10
2790,259,5.666667,15,0.015444,en,0.108939,0.339242,0.000795,0.914457,0.737241,87.18,9


In [11]:
train.set_properties(train_properties)
test.set_properties(test_properties)

In [12]:
train.properties.head(2)

Unnamed: 0,Text Length,Average Word Length,Max Word Length,% Special Characters,Language,Sentiment,Subjectivity,Toxicity,Fluency,Formality,Lexical Density,Unique Noun Count
0,158,5.36,12,0.031646,en,0.033333,0.416667,0.000617,0.975427,0.854858,95.83,10
1,368,5.15,14,0.019022,en,0.145,0.635,0.002366,0.958778,0.82017,84.75,15


In [13]:
# We could also have used deepchecks to calculate them
# from torch import device
# train.calculate_default_properties(include_long_calculation_properties=True, device=device('mps'))
# test.calculate_default_properties(include_long_calculation_properties=True,  device=device('mps'))

## Train a model on the data

We'll train a simple model using Open-AI ada-02 embeddings and a simple XGBoost model.

In [14]:
train_embeddings = download_gdrive_file('1I5ZLzgv6dQZ-S_uqUkoQPCXqckmceMs7')
test_embeddings = download_gdrive_file('1iiTugfVSUwkZaawwMcppKEn2N7381aMz', paqruet=True)

In [15]:
label_map = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}

In [16]:
from xgboost import XGBClassifier

In [17]:
model = XGBClassifier(n_estimators=100, max_depth=7, random_state=42)

In [18]:
model.fit(train_embeddings, pd.Categorical(train.label, categories=label_map.values()).codes)

We'll compute the predictions and probabilities for the two datasets using that simple model

In [19]:
train_pred = (pd.Series(model.predict(train_embeddings)) + 1).replace(label_map)
test_pred = (pd.Series(model.predict(test_embeddings)) + 1).replace(label_map)

In [20]:
train_proba = model.predict_proba(train_embeddings)
test_proba = model.predict_proba(test_embeddings)

# Finding data drift using the Property Drift check

We'll instantiate the property drift, which uses statistical measures to find changes in the distribution of properties between the two datasets.

In [21]:
from deepchecks.nlp.checks import PropertyDrift

check = PropertyDrift(n_top_properties=3)
res = check.run(train, test)
res

We can easily identify some significant drifts - the test data contains a lot of informal samples compared to the training data. Additionally, we can see that news stories in the test data tend to be longer by a bit.

# Investigating data drift using the Embedding Drift check

We'll use the same embeddings from earlier to run the check

In [22]:
train.set_embeddings(train_embeddings)
test.set_embeddings(test_embeddings)

In [23]:
# # We could also have calculated them on the spot
# train.calculate_default_embeddings(model='open_ai')
# test.calculate_default_embeddings(model='open_ai')

In [24]:
train_with_labels = train.copy()
test_with_labels = test.copy()

In [25]:
train_with_labels = train_with_labels.cast_to_dataset(train_with_labels)
test_with_labels = test_with_labels.cast_to_dataset(test_with_labels)

In [26]:
train_with_labels.head()

Unnamed: 0,text,label
0,BEDFORD -- Scientists at NitroMed Inc. hope th...,Business
1,Even a genius can mess up. Bill Gates was a br...,Business
2,Central Square in Lynn should be looking a bit...,Business
3,The Blues is alive and well in the Philippines...,Business
4,AP - The prospect that a tropical storm and a ...,Sci/Tech


In [27]:
train_with_labels['label'] = train_with_labels['label'].map(label_map)
# test_with_labels.data['label'] = test_with_labels['label'].map(label_map)

TypeError: 'TextData' object is not subscriptable

In [None]:
from deepchecks.nlp.checks import TextEmbeddingsDrift

check = TextEmbeddingsDrift()
res = check.run(train, test)
res

We notice now that there is a significant cluster (on the bottom right) that is comprised mainly of test samples. If we look into it, we see it's mainly samples dealing with Sports!
So there are more articles in the test dealing with sports events, and that is also probably why we saw more informal texts in the test data - sport reporting tends to be less formal compared to science, business and world politics.

# How did this affect our model's performance?

## Model performance

In [None]:
from deepchecks.nlp.checks import TrainTestPerformance

In [None]:
TrainTestPerformance().add_condition_train_test_relative_degradation_less_than().run(train, test, train_predictions=train_pred, test_predictions=test_pred)

First we'll note that the lack of samples dealing with sports in the training data led to a decline in the Recall on this class, as our condition has captured.

## Segment performance

We can use our Property Segment Performance check to try and see if we have specific sub-segment that are performing worse compared to the rest of the data

In [None]:
from deepchecks.nlp.checks import PropertySegmentsPerformance

In [None]:
PropertySegmentsPerformance(segment_minimum_size_ratio=0.1).run(test, predictions=test_pred, probabilities=test_proba)

We'll note two interesting facts:
1. First, we perform worse on the low formality samples. This was expected as we know that sport reporting is less formal, and that because sports where less abundant in the training data the model is doing worse on them.
2. Second, we note the the model is also doing worse for reports with low average word length. This has also been surfaced by our Property Drift check, but now after looking at the models' performance we can say something more - we have Concept Drift! The low formality samples in the test data (mostly Sports) also use simpler language (shorter words) in the test set compared to the training data, and that's an additional reason why our model is doing worse on these new sport samples in the test data.