This repository has been archived by the owner. It is now read-only.

Add Quantiles to Numerical analyses in AnalyzeSpark #436

merged 3 commits into from Oct 16, 2017


None yet
2 participants
Copy link

huitseeker commented Oct 4, 2017

What changes were proposed in this pull request?

Uses T-Digests¹ to lightly track quantiles in the numerical column analysis in Spark. Provides the structure for later replacement of histograms as we implement them (T-Digests can cheaply track a cdf, for which histograms are 1 derivative away).


Gives a direction towards fixing #290

How was this patch tested?

Extension of the TestAnalysis unit test.

huitseeker added some commits Oct 4, 2017

Add tests
Switch to TDunning's serializable implementation for Spark
See tdunning/t-digest@f4ed7af

Comment round.
Copy link

AlexDBlack left a comment

Ony tiny issue, but otherwise LGTM. Thanks!

return "mean=" + mean + ",sampleStDev=" + sampleStdev + ",sampleVariance=" + sampleVariance + ",countZero="
+ countZero + ",countNegative=" + countNegative + ",countPositive=" + countPositive
+ ",countMinValue=" + countMinValue + ",countMaxValue=" + countMaxValue + ",count="
+ countTotal;
+ countTotal + ", quantiles=[" + quantiles.toString();

This comment has been minimized.


AlexDBlack Oct 4, 2017


Should this have a + "]" after the toString()?

@huitseeker huitseeker requested a review from AlexDBlack Oct 5, 2017

@huitseeker huitseeker merged commit 98ae8d5 into deeplearning4j:master Oct 16, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.