Add ECDF plot type to data_doctor; add SciPy to requirements#112
Merged
Conversation
lshpaner
requested changes
Dec 15, 2025
Collaborator
|
looks good; merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ECDF Plot Integration in
data_doctor; add SciPy to requirementsOverview
This document describes the addition of an ECDF (Empirical Cumulative Distribution Function) plot option to the
data_doctorfunction. The ECDF is introduced as a newptype("ecdf") and is designed to complement existing distribution plots such as KDEs, histograms, and box/violin plots.This enhancement was motivated by user feedback and discussion at JupyterCon 2025, where there was expressed interest in clearer, bin-free ways to inspect distributions during exploratory data analysis (EDA).
What is an ECDF?
An ECDF shows the cumulative distribution of a variable.
Each point on the ECDF corresponds to one observation in the dataset.
Unlike histograms or KDEs:
Visually, the ECDF appears as a step function, where each step corresponds to the inclusion of another data point.
Why ECDFs are useful in EDA
ECDFs answer a fundamentally different question than histograms:
Histograms ask:
ECDFs ask:
This makes ECDFs especially helpful for:
ECDF implementation details
The ECDF implementation in
data_doctorincludes:The normal CDF overlay uses the empirical mean and standard deviation of the data and is intended as a visual diagnostic, not a formal goodness-of-fit test.
Because this overlay relies on
scipy.stats.norm, SciPy is now included as a dependency.Dependency update
SciPy requirement
The ECDF itself does not require SciPy. However, the optional normal reference curve does.
As a result:
scipyhas been added torequirements.txtExample usage
Why this combination works well for EDA
This combination is particularly effective for exploratory data analysis:
Interpreting the ECDF vs. normal reference
When the ECDF is plotted alongside a normal reference curve, it becomes easy to see where the data diverges from a normal approximation.
In the example shown in this PR:
tipvalues approximately between 0.60 and 1.10 show a noticeable deviation from the normal reference curveKey points:
This illustrates a core strength of ECDFs in exploratory analysis: they reveal local structure that may be hidden by binning or smoothing.
Relationship to existing plot types
The ECDF is not intended to replace existing plots, but to complement them:
Together, these views provide a more complete understanding of a feature’s distribution.
Origin of the enhancement
The addition of ECDF support was prompted by:
data_doctorSummary
This PR:
"ecdf"as a supported plot type indata_doctor