Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added properties and metadata guides (#2468)
* Added properties and metadata guides * Update deepchecks/nlp/text_data.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/user-guide/nlp/nlp_metadata.rst Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/user-guide/nlp/nlp_properties.rst Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Fixed CR comments * Fixed CR comments * Fixed CR comments * pylint * cr comments * Apply suggestions from code review Co-authored-by: Nadav Barak <67195469+Nadav-Barak@users.noreply.github.com> * Nir/dee 482 create plot files for nlp drift (#2472) * prediction drift plot file * Added label drift * property label correlation * Outliers check * Outliers check * small changes * Changes * Update deepchecks/vision/suites/default_suites.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Changes * Update docs/source/checks/nlp/data_integrity/plot_property_label_correlation.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/checks/nlp/data_integrity/plot_property_label_correlation.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/checks/nlp/data_integrity/plot_text_property_outliers.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/checks/nlp/model_evaluation/plot_prediction_drift.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Update docs/source/checks/nlp/model_evaluation/plot_prediction_drift.py Co-authored-by: Noam Bressler <noamzbr@gmail.com> * CR changes * CR changes * CR changes --------- Co-authored-by: Noam Bressler <noamzbr@gmail.com> * Moved files * Updated links --------- Co-authored-by: Noam Bressler <noamzbr@gmail.com> Co-authored-by: Nadav Barak <67195469+Nadav-Barak@users.noreply.github.com>
- Loading branch information
1 parent
e4429e3
commit 0b39ef1
Showing
12 changed files
with
705 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
71 changes: 71 additions & 0 deletions
71
docs/source/checks/nlp/data_integrity/plot_property_label_correlation.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
.. _nlp__property_label_correlation: | ||
Property Label Correlation | ||
************************** | ||
This notebook provides an overview for using and understanding the "Property Label Correlation" check. | ||
**Structure:** | ||
* `What Is The Purpose of the Check? <#what-is-the-purpose-of-the-check>`__ | ||
* `Run the Check <#run-the-check>`__ | ||
What Is The Purpose of the Check? | ||
================================= | ||
The check estimates for every :ref:`text property <nlp__properties_guide>` | ||
(such as text length, language etc.) its ability to predict the label by itself. | ||
This check can help find a potential bias in the dataset - the labels being strongly correlated with simple text | ||
properties such as percentage of special characters, sentiment, toxicity and more. | ||
This is a critical problem that can result in a phenomenon called "shortcut learning", where the model is likely to | ||
learn this property instead of the actual textual characteristics of each class, as it's easier to do so. | ||
In this case, the model will show high performance on text collected under similar conditions (e.g. same source), | ||
but will fail to generalize on other data (for example, when production receives new data from another source). | ||
This kind of correlation will likely stay hidden without this check until tested on the actual problem data. | ||
For example, in a classification dataset of true and false statements, if only true facts are written in detail, | ||
and false facts are written in a short and vague manner, the model might learn to predict the label by the length | ||
of the statement, and not by the actual content. In this case, the model will perform well on the training data, | ||
and may even perform well on the test data, but will fail to generalize to new data. | ||
The check is based on calculating the predictive power score (PPS) of each text | ||
property. In simple terms, the PPS is a metric that measures how well can one feature predict another (in our case, | ||
how well can one property predict the label). | ||
For further information about PPS you can visit the `ppscore github <https://github.com/8080labs/ppscore>`__ | ||
or the following blog post: `RIP correlation. Introducing the Predictive Power Score | ||
<https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598>`__ | ||
""" | ||
|
||
#%% | ||
# Run the Check | ||
# ============= | ||
|
||
from deepchecks.nlp.checks import PropertyLabelCorrelation | ||
from deepchecks.nlp.datasets.classification import tweet_emotion | ||
|
||
# For this example, we'll use the tweet emotion dataset, which is a dataset of tweets labeled by one of four emotions: | ||
# happiness, anger, sadness and optimism. | ||
|
||
# Load Data: | ||
dataset = tweet_emotion.load_data(as_train_test=False) | ||
|
||
#%% | ||
# Let's see how our data looks like: | ||
dataset.head() | ||
|
||
#%% | ||
# Now lets run the check: | ||
result = PropertyLabelCorrelation().run(dataset) | ||
result | ||
|
||
#%% | ||
# We can see that in our example of tweet emotion dataset, the label is correlated with the "sentiment" property, | ||
# which makes sense, as the label is the emotion of the tweet, and the sentiment expresses whether the tweet is | ||
# positive or negative. | ||
# Also, there's some correlation with the "toxciity" property, which is a measure of how toxic the tweet is. | ||
# This is also reasonable, as some emotions are more likely to be expressed in a toxic way. | ||
# However, these correlation may indicate that a model may learn to predict the label by curse words, for instance, | ||
# instead of the actual content of the tweet, which could lead it to fail on new tweets that don't contain curse words. |
75 changes: 75 additions & 0 deletions
75
docs/source/checks/nlp/data_integrity/plot_text_property_outliers.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
.. _nlp__text_property_outliers: | ||
Text Property Outliers | ||
======================= | ||
This notebooks provides an overview for using and understanding the text property | ||
outliers check, used to detect outliers in simple text properties in a dataset. | ||
**Structure:** | ||
* `Why Check for Outliers? <#why-check-for-outliers>`__ | ||
* `How Does the Check Work? <#how-does-the-check-work>`__ | ||
* `Which Text Properties Are Used? <#which-text-properties-are-used>`__ | ||
* `Run the Check <#run-the-check>`__ | ||
Why Check for Outliers? | ||
----------------------- | ||
Examining outliers may help you gain insights that you couldn't have reached from taking an aggregate look or by | ||
inspecting random samples. For example, it may help you understand you have some corrupt samples (e.g. | ||
texts without spaces between words), or samples you didn't expect to have (e.g. texts in Norwegian instead of English). | ||
In some cases, these outliers may help debug some performance discrepancies (the model can be excused for failing on | ||
a totally blank text). In more extreme cases, the outlier samples may indicate the presence of samples interfering with | ||
the model's training by teaching the model to fit "irrelevant" samples. | ||
How Does the Check Work? | ||
------------------------ | ||
Ideally we would like to directly find text samples which are outliers, but this is computationally expensive and does not | ||
have a clear and explainable results. Therefore, we use text properties in order to find outliers (such as text length, | ||
average word length, language etc.) which are much more efficient to compute, and each outlier is easily explained. | ||
* For numeric properties (such as "percent of special characters"), we use | ||
`Interquartile Range <https://en.wikipedia.org/wiki/Interquartile_range#Outliers>`_ to define our upper and lower | ||
limit for the properties' values. | ||
* For categorical properties (such as "language"), we look for a "sharp drop" in the category distribution to | ||
define our lower limit for the properties' values. This method is based on the assumption that the distribution of | ||
categories in the dataset is "smooth" and differences in the commonality of categories are gradual. | ||
For example, in a clean dataset, if the distribution of English texts is 80%, the distribution of the next most | ||
common language would be of similar scale (e.g. 10%) and so forth. If we find a category that has a much lower | ||
distribution than the rest, we assume that this category and even smaller categories are outliers. | ||
Which Text Properties Are Used? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
By default the checks use the built-in text properties, and it's also possible to replace the default properties | ||
with custom ones. For the list of the built-in text properties and explanation about custom properties refer to | ||
:ref:`NLP properties <nlp__properties_guide>`. | ||
""" | ||
|
||
#%% | ||
# Run the Check | ||
# ------------- | ||
# For this example, we'll use the tweet emotion dataset, which is a dataset of tweets labeled by one of four emotions: | ||
# happiness, anger, sadness and optimism. | ||
|
||
from deepchecks.nlp.checks import TextPropertyOutliers | ||
from deepchecks.nlp.datasets.classification import tweet_emotion | ||
|
||
dataset = tweet_emotion.load_data(as_train_test=False) | ||
|
||
check = TextPropertyOutliers() | ||
result = check.run(dataset) | ||
result | ||
|
||
#%% | ||
# Observe Graphic Result | ||
# ^^^^^^^^^^^^^^^^^^^^^^ | ||
# In this example, we can find many tweets that are outliers - For example, in the "average word length" property, | ||
# we can see that there are tweets with a very large average word length, which is is usually because of missing spaces | ||
# in the tweet itself, or the fact that tweeter hashtags remained in the data and they don't contain spaces. This | ||
# could be problematic for the model, as it cannot coprehend the hashtags as words, and it may cause the model to | ||
# fail on these tweets. |
Oops, something went wrong.