Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update drift guide for nlp #2595

Merged
merged 9 commits into from
Jun 15, 2023
Merged
90 changes: 88 additions & 2 deletions docs/source/general/guides/drift_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ This guide will help you understand what drift is and how you can detect in usin
* `Which Types of Drift Are There? <#which-types-of-drift-are-there>`__
* `How Do You Detect Drift? <#how-do-you-detect-drift>`__
* `How Can I Use Deepchecks to Detect Drift? <#how-can-i-use-deepchecks-to-detect-drift>`__

* `Tabular Data <#tabular-data>`__
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved
* `Text (NLP) Data <#text-nlp-data>`__
* `Computer Vision Data <#computer-vision-data>`__
* `What Can You Do in Case of Drift? <#what-can-you-do-in-case-of-drift>`__
* `Code Examples <#code-examples>`__

Expand Down Expand Up @@ -181,6 +185,32 @@ For code examples, see `here <#tabular-checks>`__
All of these checks appear also in the `deepchecks interactive demo <https://checks-demo.deepchecks.com>`__, where you can
insert corruption into the data and see the checks at work.

Text (NLP) Data
---------------

Regarding `data <#data-drift>`__ or `concept drift <#concept-drift>`__
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved
In text data, we can't measure drift on the text directly, as text is not structured data that can be measured.
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved
However, we can use different methods to represent the text as a structured variable, and then measure drift on that variable.
In deepchecks-NLP, we use 2 such methods: :ref:`Text Embeddings <nlp__embeddings_guide>` and :ref:`Text Properties <nlp__properties_guide>`.
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved

Both methods have their pros and cons when used to measure drift: Properties are more explainable, but will not necessarily
capture all the information in the text. Embeddings are able to find more complex patterns in the text, but these
patterns may be difficult to explain. Therefore, we recommend to use both methods to detect
`data <#data-drift>`__ or `concept drift <#concept-drift>`__:

#. The :ref:`Text Embeddings Drift Check <nlp__embeddings_drift>` uses embeddings to measure drift using a
`domain classifier <#detection-by-domain-classifier>`__
#. The :ref:`Text Properties Drift Check <nlp__properties_drift>` uses properties to measure drift using
`univariate measures <#detection-by-univariate-measure>`__

For drift in your label's distribution, deepchecks offers the :ref:`nlp__label_drift`,
which uses `univariate measures <#detection-by-univariate-measure>`__.

In cases where the label is not available, we strongly recommend to also use the :ref:`nlp__prediction_drift`,
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved
which uses the same methods but on the model's predictions, and can detect possible changes in the distribution of the label.

For code examples, see `here <#text-nlp-checks>`__

Computer Vision Data
--------------------

Expand Down Expand Up @@ -294,6 +324,62 @@ Tabular Checks
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)


Text (NLP) Checks
-----------------

:ref:`nlp__embeddings_drift`:

.. code-block:: python

# Load the embeddings from a file:
train_dataset.set_embeddings('my_train_embeddings_file.npy')
test_dataset.set_embeddings('my_test_embeddings_file.npy')

# If you do not have a model to extract embeddings from, you can use the deepchecks default embeddings:
train_dataset.calculate_default_embeddings()
test_dataset.calculate_default_embeddings()

# For more on loading embeddings, see the :ref:`nlp__embeddings_guide'.
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved

# Calculate the check:
nirhutnik marked this conversation as resolved.
Show resolved Hide resolved
from deepchecks.nlp.checks import TextEmbeddingsDrift
check = TextEmbeddingsDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)

:ref:`nlp__property_drift`:

.. code-block:: python

# If text properties were not calculated yet:
train_dataset.calculate_default_properties()
test_dataset.calculate_default_properties()

from deepchecks.nlp.checks import PropertyDrift
check = PropertyDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)

:ref:`nlp__label_drift`:

.. code-block:: python

from deepchecks.nlp.checks import LabelDrift
check = LabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)

:ref:`nlp__prediction_drift`:

.. code-block:: python

from deepchecks.nlp.checks import PredictionDrift
check = PredictionDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset,
train_predictions=train_predictions, test_predictions=test_predictions)

# For Text Classification tasks, it is recommended to use the probabilities:
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset,
train_probabilities=train_probabilities, test_probabilities=test_probabilities)


Computer Vision Checks
----------------------

Expand All @@ -302,7 +388,7 @@ Computer Vision Checks
.. code-block:: python

from deepchecks.vision.checks import ImagePropertyDrift
check = TrainTestPropertyDrift()
check = ImagePropertyDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)

:ref:`vision__image_dataset_drift`:
Expand All @@ -327,4 +413,4 @@ Computer Vision Checks

from deepchecks.vision.checks import PredictionDrift
check = PredictionDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)