Skip to content

Commit

Permalink
Drift guide 1066 (#1447)
Browse files Browse the repository at this point in the history
* added blank file

* draft

* small fixes to feature_importance.rst

* drift guide draft

* drift guide draft

* drift guide draft

* Apply suggestions from code review

Co-authored-by: Noam Bressler <noamzbr@gmail.com>

* Fixes

* Fixes

* measure instead of statistical test

* Fixes

* Fixes

* Fixes

* Bressler was right about grammar

* Apply suggestions from code review

Co-authored-by: Noam Bressler <noamzbr@gmail.com>

* PR Fixes

* Edited tabular docs

* Finished vision docs

* Added images

* Some small fixes

* Apply suggestions from code review

Co-authored-by: shir22 <33841818+shir22@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: shir22 <33841818+shir22@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: shir22 <33841818+shir22@users.noreply.github.com>

* small changes

* commetns

* Apply suggestions from code review

Co-authored-by: shir22 <33841818+shir22@users.noreply.github.com>

* Fixed comments

* Additional changes

* Additional changes

* Fixed links

* Fixed links

* Fixed links

* Fixed links

* Fixed comments

* Apply suggestions from code review

Co-authored-by: Noam Bressler <noamzbr@gmail.com>

* Last Bressler comment

* Fixed comments
Added to index

* .

Co-authored-by: Noam Bressler <noamzbr@gmail.com>
Co-authored-by: shir22 <33841818+shir22@users.noreply.github.com>
  • Loading branch information
3 people committed May 29, 2022
1 parent 40df64d commit e87b2cd
Show file tree
Hide file tree
Showing 14 changed files with 485 additions and 244 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -7,45 +7,31 @@
**Structure:**
* `What is prediction drift? <#what-is-prediction-drift>`__
* `What Is Prediction Drift? <#what-is-prediction-drift>`__
* `Generate Data <#generate-data>`__
* `Build Model <#build-model>`__
* `Run check <#run-check>`__
What Is Prediction Drift?
===========================
The term drift (and all it's derivatives) is used to describe any change in the data compared
to the data the model was trained on. Prediction drift refers to the case in which a change
in the data (data/feature drift) has happened and as a result, the distribution of the
models' prediction has changed.
========================
Drift is simply a change in the distribution of data over time, and it is
also one of the top reasons why machine learning model's performance degrades
over time.
Prediction drift is when drift occurs in the prediction itself.
Calculating prediction drift is especially useful in cases
in which labels are not available for the test dataset, and so a drift in the predictions
is our only indication that a changed has happened in the data that actually affects model
predictions. If labels are available, it's also recommended to run the `Label Drift Check
</examples/tabular/checks/distribution/examples/plot_train_test_label_drift.html>`__.
There are two main causes for prediction drift:
* A change in the sample population. In this case, the underline phenomenon we're trying
to predict behaves the same, but we're not getting the same types of samples. For example,
Iris Virginica stops growing and is not being predicted by the model trained to classify Iris species.
* Concept drift, which means that the underline relation between the data and
the label has changed.
For example, we're trying to predict income based on food spending, but ongoing inflation effect prices.
It's important to note that concept drift won't necessarily result in prediction drift, unless it affects features that
are of high importance to the model.
How Does the TrainTestPredictionDrift Check Work?
=================================================
There are many methods to detect drift, that usually include statistical methods
that aim to measure difference between 2 distributions.
We experimented with various approaches and found that for detecting drift between 2
one-dimensional distributions, the following 2 methods give the best results:
* For regression problems, the `ramer's V <https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V>`__
* For classification problems, the `Wasserstein Distance (Earth Mover's Distance) <https://en.wikipedia.org/wiki/Wasserstein_metric>`__
predictions. If labels are available, it's also recommended to run the
:doc:`Label Drift check </checks_gallery/tabular/train_test_validation/plot_train_test_label_drift>`.
For more information on drift, please visit our :doc:`drift guide </user-guide/general/drift_guide>`.
How Deepchecks Detects Prediction Drift
------------------------------------
This check detects prediction drift by using :ref:`univariate measures <drift_detection_by_univariate_measure>`
on the prediction output.
"""

#%%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,46 +14,22 @@
What is a feature drift?
========================
Data drift is simply a change in the distribution of data over time. It is
also one of the top reasons of a machine learning model performance degrades
Drift is simply a change in the distribution of data over time, and it is
also one of the top reasons why machine learning model's performance degrades
over time.
Causes of data drift include:
Feature drift is a data drift that occurs in a single feature in the dataset.
* Upstream process changes, such as a sensor being replaced that changes the
units of measurement from inches to centimeters.
* Data quality issues, such as a broken sensor always reading 0.
* Natural drift in the data, such as mean temperature changing with the seasons.
* Change in relation between features, or covariate shift.
For more information on drift, please visit our :doc:`drift guide </user-guide/general/drift_guide>`.
Feature drift is such drift in a single feature in the dataset.
How Deepchecks Detects Feature Drift
------------------------------------
In the context of machine learning, drift between the training set and the
test set will likely make the model to be prone to errors. In other words,
this means that the model was trained on data that is different from the
current test data, thus it will probably make more mistakes predicting the
target variable.
This check detects feature drift by using :ref:`univariate measures <drift_detection_by_univariate_measure>`
on each feature column separately.
Another possible method for drift detection is by :ref:`a domain classifier <drift_detection_by_domain_classifier>`
which is used in the :doc:`Whole Dataset Drift check </checks_gallery/tabular/train_test_validation/plot_whole_dataset_drift>`.
How deepchecks detects feature drift
------------------------------------
There are many methods to detect feature drift. Some of them include
training a classifier that detects which samples come from a known
distribution and defines the drift by the accuracy of this classifier. For
more information, refer to the :doc:`Whole Dataset Drift check
</checks_gallery/tabular/train_test_validation/plot_whole_dataset_drift>`.
Other approaches include statistical methods aim to measure difference
between distribution of 2 given sets. We exprimented with various approaches
and found that for detecting drift in a single feature, the following 2
methods give the best results:
* `Cramer's V <https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V>`__
* `Wasserstein metric (Earth Movers Distance) <https://en.wikipedia.org/wiki/Wasserstein_metric>`__
For numerical features, the check uses the Earth Movers Distance method
and for the categorical features it uses the PSI. The check calculates drift
between train dataset and test dataset per feature, using these 2 statistical
measures.
"""

#%%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,32 @@
"""
Train Test Label Drift
**********************
This notebooks provides an overview for using and understanding label drift check.
**Structure:**
* `What Is Label Drift? <#what-is-label-drift>`__
* `Run Check on a Classification Label <#run-check-on-a-classification-label>`__
* `Run Check on a Regression Label <#run-check-on-a-regression-label>`__
* `Add a Condition <#run-check>`__
What Is Label Drift?
========================
Drift is simply a change in the distribution of data over time, and it is
also one of the top reasons why machine learning model's performance degrades
over time.
Label drift is when drift occurs in the label itself.
For more information on drift, please visit our :doc:`drift guide </user-guide/general/drift_guide>`.
How Deepchecks Detects Label Drift
------------------------------------
This check detects label drift by using :ref:`univariate measures <drift_detection_by_univariate_measure>`
on the label column.
"""

#%%
Expand All @@ -15,14 +41,17 @@
from deepchecks.tabular.checks import TrainTestLabelDrift

#%%
# Generate data - Classification label
# Run Check on a Classification Label
# ====================================

# Generate data:
# --------------

np.random.seed(42)

train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
Expand All @@ -36,16 +65,19 @@

#%%
# Run Check
# =========
# ===============================

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

#%%
# Generate data - Regression label
# Run Check on a Regression Label
# ================================

# Generate data:
# --------------

train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)

Expand All @@ -59,14 +91,15 @@

#%%
# Run check
# =========
# ---------

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

#%%
# Add condition
# Add a Condition
# ===============

check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)
Original file line number Diff line number Diff line change
Expand Up @@ -8,53 +8,31 @@
**Structure:**
* `What is a dataset drift? <#what-is-a-dataset-drift>`__
* `What Is Multivariate Drift? <#what-is-a-multivariate-drift>`__
* `Loading the Data <#loading-the-data>`__
* `Run the check <#run-the-check>`__
* `Define a condition <#define-a-condition>`__
What is a dataset drift?
========================
A whole dataset drift, or a multivariate dataset drift, occurs when the
statistical properties of our input feature change, denoted by a change
in the distribution P(X).
Causes of data drift include:
* Upstream process changes, such as a sensor being replaced that changes
the units of measurement from inches to centimeters.
* Data quality issues, such as a broken sensor always reading 0.
* Natural drift in the data, such as mean temperature changing with the seasons.
* Change in relation between features, or covariate shift.
The difference between a :doc:`feature drift
</checks_gallery/tabular/train_test_validation/plot_train_test_feature_drift>`
(or univariate dataset drift) and a multivariate drift is that in the
latter the data drift occures in more that one feature.
In the context of machine learning, drift between the training set and the
test means that the model was trained on data that is different from the
current test data, thus it will probably make more mistakes predicting the
target variable.
How deepchecks detects dataset drift
* `Run the Check <#run-the-check>`__
* `Define a Condition <#define-a-condition>`__
What Is Multivariate Drift?
==============================
Drift is simply a change in the distribution of data over time, and it is
also one of the top reasons why machine learning model's performance degrades
over time.
A multivariate drift is a drift that occurs in more than one feature at a time,
and may even affect the relationships between those features, which are undetectable by
univariate drift methods.
The whole dataset drift check tries to detect multivariate drift between the two input datasets.
For more information on drift, please visit our :doc:`drift guide </user-guide/general/drift_guide>`.
How Deepchecks Detects Dataset Drift
------------------------------------
There are many methods to detect feature drift. Some of them are statistical
methods that aim to measure difference between distribution of 2 given sets.
This methods are more suited to univariate distributions and are primarily
used to detect drift between 2 subsets of a single feature.
Measuring a multivariate data drift is a bit more challenging. In the whole
dataset drift check, the multivariate drift is measured by training a classifier
that detects which samples come from a known distribution and defines the
drift by the accuracy of this classifier.
Practically, the check concatanates the train and the test sets, and assigns
label 0 to samples that come from the training set, and 1 to those who are
from the test set. Then, we train a binary classifer of type
`Histogram-based Gradient Boosting Classification Tree
<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html>`__, and measure the
drift score from the AUC score of this classifier.
This check detects multivariate drift by using :ref:`a domain classifier <drift_detection_by_domain_classifier>`.
Other methods to detect drift include :ref:`univariate measures <drift_detection_by_univariate_measure>`
which is used in other checks, such as :doc:`Train Test Feature Drift check </checks_gallery/tabular/train_test_validation/plot_train_test_feature_drift>`.
"""

#%%
Expand Down Expand Up @@ -89,7 +67,7 @@
train_ds.label_name

#%%
# Run the check
# Run the Check
# =============
from deepchecks.tabular.checks import WholeDatasetDrift

Expand Down Expand Up @@ -129,7 +107,7 @@
# contributed the most to that drift. This is reasonable since the sampling
# was biased based on that feature.
#
# Define a condition
# Define a Condition
# ==================
# Now, we define a condition that enforce the whole dataset drift score must be
# below 0.1. A condition is deepchecks' way to validate model and data quality,
Expand Down

0 comments on commit e87b2cd

Please sign in to comment.