Skip to content

Add ECDF plot type to data_doctor; add SciPy to requirements#112

Merged
lshpaner merged 4 commits into
mainfrom
data_doctor_ecdf_integration
Dec 15, 2025
Merged

Add ECDF plot type to data_doctor; add SciPy to requirements#112
lshpaner merged 4 commits into
mainfrom
data_doctor_ecdf_integration

Conversation

@Oscar-Gil-Data
Copy link
Copy Markdown
Collaborator

ECDF Plot Integration in data_doctor; add SciPy to requirements

Overview

This document describes the addition of an ECDF (Empirical Cumulative Distribution Function) plot option to the data_doctor function. The ECDF is introduced as a new ptype ("ecdf") and is designed to complement existing distribution plots such as KDEs, histograms, and box/violin plots.

This enhancement was motivated by user feedback and discussion at JupyterCon 2025, where there was expressed interest in clearer, bin-free ways to inspect distributions during exploratory data analysis (EDA).


What is an ECDF?

An ECDF shows the cumulative distribution of a variable.

  • The x-axis represents observed values of the feature
  • The y-axis represents the proportion of observations less than or equal to each value

Each point on the ECDF corresponds to one observation in the dataset.

Unlike histograms or KDEs:

  • No binning is required
  • No smoothing assumptions are imposed
  • Every observation is explicitly represented

Visually, the ECDF appears as a step function, where each step corresponds to the inclusion of another data point.


Why ECDFs are useful in EDA

ECDFs answer a fundamentally different question than histograms:

Histograms ask:

“How many observations fall into this bin?”

ECDFs ask:

“How much of the data have we seen up to this point?”

This makes ECDFs especially helpful for:

  • Understanding cumulative behavior
  • Identifying skew and asymmetry
  • Comparing distributions across groups
  • Spotting cutoff effects, saturation, or clustering
  • Diagnosing where parametric assumptions (such as normality) break down

ECDF implementation details

The ECDF implementation in data_doctor includes:

  • A step plot (R-style) showing the empirical CDF
  • Points overlaid on the steps to emphasize individual observations
  • An optional smooth normal CDF overlay for reference

The normal CDF overlay uses the empirical mean and standard deviation of the data and is intended as a visual diagnostic, not a formal goodness-of-fit test.

Because this overlay relies on scipy.stats.norm, SciPy is now included as a dependency.


Dependency update

SciPy requirement

The ECDF itself does not require SciPy. However, the optional normal reference curve does.

As a result:

  • scipy has been added to requirements.txt
  • This ensures consistent behavior across environments and avoids runtime errors

Example usage

data_doctor(
    df=df,
    feature_name="tip",
    plot_type=["kde", "ecdf", "box_violin"],
    scale_conversion="log",
)
image

Why this combination works well for EDA

This combination is particularly effective for exploratory data analysis:

  • KDE shows overall shape
  • ECDF shows cumulative behavior and deviations from parametric assumptions
  • Box/Violin summarizes spread, central tendency, and outliers

Interpreting the ECDF vs. normal reference

When the ECDF is plotted alongside a normal reference curve, it becomes easy to see where the data diverges from a normal approximation.

In the example shown in this PR:

  • Log-transformed tip values approximately between 0.60 and 1.10 show a noticeable deviation from the normal reference curve
  • In this mid-range, the empirical ECDF accumulates at a different rate than the normal model would predict
image

Key points:

  • This deviation is localized, not global
  • The lower and upper tails track the normal reference more closely
  • The ECDF highlights where the normal approximation is imperfect, rather than implying that the distribution is fundamentally non-normal

This illustrates a core strength of ECDFs in exploratory analysis: they reveal local structure that may be hidden by binning or smoothing.


Relationship to existing plot types

The ECDF is not intended to replace existing plots, but to complement them:

  • Histograms / KDEs emphasize shape and density
  • Box/Violin plots emphasize summary statistics and outliers
  • ECDFs emphasize cumulative structure and ordering

Together, these views provide a more complete understanding of a feature’s distribution.


Origin of the enhancement

The addition of ECDF support was prompted by:

  • Practitioner feedback
  • Live discussion at JupyterCon 2025
  • A desire for more transparent, assumption-light distribution diagnostics within data_doctor

Summary

This PR:

  • Introduces "ecdf" as a supported plot type in data_doctor
  • Adds SciPy as a dependency to support a normal reference overlay
  • Expands the toolkit’s ability to diagnose distributional behavior during EDA
  • Aligns the package more closely with common statistical workflows and user expectations

Comment thread requirements.txt
@lshpaner lshpaner reopened this Dec 15, 2025
@lshpaner lshpaner merged commit 480c7d9 into main Dec 15, 2025
@lshpaner
Copy link
Copy Markdown
Collaborator

looks good; merged

@lshpaner lshpaner deleted the data_doctor_ecdf_integration branch December 15, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants