Add new KDE function. #1284

tomicapretto · 2020-07-07T19:28:56Z

Description

This pull requests replaces the function _fast_kde() with a new function called kde() that can be used to estimate density functions for both linear and circular data.

Major changes

Everything related to KDE now lives in kde_utils.py instead of numeric_utils.py. The new function requires so many utilities itself that it is worth a new file.
The old function _fast_kde() used to return density, xmin and xmax, which required creating a new grid after estimating the density.
The new function returns grid and pdf. It can also return the estimated bandwidth by calling kde() with the argument bw_return set to True.
The new function kde() supports circular data via its boolean argument circular.
The argument bw used to represent the bandwidth scaling factor in the old function _fast_kde(). Now it represents either the bandwidth method or the bandwidth itself.
It is still possible to pass a bandwidth scaling factor via the argument bw_fct which is 1 by default.
kde() supports adaptive bandwidth for linear data. This is enabled by setting adaptive to True.
kde() supports several bandwidth estimation methods (see below).
The old _fast_kde() used to perform boundary correction without extending the observed domain in all the cases.
This is also the default for kde() but these settings can be changed with the arguments bound_correction, extend, extend_fct, and custom_lims.

Bandwidth estimation methods supported

For linear data:

Scott's rule (bw = 'scott').
Silverman's rule (bw = 'silverman').
Improved Sheather-Jones plug-in method (bw = 'isj').
Expermiental, which is the average of Silverman's rule and the ISJ method (bw = 'experimental'). This is the default.

For circular data:

Taylor's rule (bw = 'taylor'). It's like the Scott's rule of the circular world.

Comments:

kde_utils.py does not contain anything related to 2D KDE yet. It would be good to put all that here too.
The circular argument is not passed to higher level functions yet. This is something to do in the near future.
All the functions that used to call _fast_kde() now call kde() and have been updated to receive the new objects.

Plots

Updating the KDE function impacts in many different plots. Here is an example that compares the old version versus the new one. Original is here.

Old:

New:

We also have the following comparison which inclues the new version, the old version and the KDE offered by scipy.
5000 samples are used in all the cases.

Checklist

Follows official PR format
Includes a sample plot to visually illustrate the changes (only for plot-related functions)
New features are properly documented (with an example if appropriate)?
Includes new or updated tests to cover the new feature
Code style correct (follows pylint and black guidelines)
Changes are listed in changelog

arviz/kde_utils.py

aloctavodia · 2020-07-10T13:57:55Z

thanks @tomicapretto for your contribution. This is a big one so I made a first round of revision and I will do a another one ASAP.

aloctavodia · 2020-07-10T14:05:25Z

BTW I really like that in the example you posted the new kde is more wiggly than the current version :-)

aloctavodia · 2020-07-17T17:36:21Z

It seems you need to rebase, there are a lot of commits unrelated to this PR

aloctavodia

nitpicks

CHANGELOG.md

aloctavodia · 2020-07-23T12:26:51Z

arviz/kde_utils.py

    return bw


-# Misc utils
+# Misc utils ------------------------------------------------------------------------


remove the dashes

in the line of comment below, maybe these misc methods should go in numeric utils instead? and the comments to divide in sections can be deleted

arviz/kde_utils.py

OriolAbril

Thanks! This is amazing work, I actually feel bad for not being able to correctly review right now. I'll try to review again with more time in some days/week

My main big comment is in terms of api, I think we should pay a lot of attention here to define an api for kde functions so it is easy to modify the functions, their components or even add new kde calculation methods.

CHANGELOG.md

arviz/kde_utils.py

OriolAbril · 2020-07-23T14:34:12Z

arviz/kde_utils.py

+            found = 0
+        if bw <= 0:
+            warnings.warn(
+                "Improved Sheather-Jones did not converge to a positive value. "


I think that whatever we choose we should use custom warnings so even in a worst case scenario they can be ignored without blocking all UserWarnings

OriolAbril · 2020-07-23T14:43:10Z

arviz/kde_utils.py

+def circular_mean(x):
+    sinr = np.sum(np.sin(x))
+    cosr = np.sum(np.cos(x))
+    mean = np.arctan2(sinr, cosr)
+    return mean


Do we still have numeric utils? Shouldn't this go there? Also, I think Using the @conditional_jit decorator here will speed the method up further. See https://github.com/arviz-devs/arviz/blob/master/arviz/stats/stats_utils.py#L526 for an example

Also scipy version seems to do extra checks with respect to this one, not sure if nan checking or something like this could be useful for us. source https://github.com/scipy/scipy/blob/v1.5.1/scipy/stats/morestats.py#L3300-L3366 moreover, we should probably test our function gives the same result.

I am aware of Scipy version but I decided to keep this one because it is faster. As it is used now, things like nan checking are done in kde_circular() which is the function that calls circular_mean().
I guess you mention checks and all that because you're considering using the function in other parts of arviz.
I think there are two alternatives here:

Implement a circular_mean() function that is going to be used everywhere in arviz, instead of Scipy version.

Use Scipy version in other parts of arviz and keep this function for the circular kde since it is faster and safety checks have already been performed when circular_mean() is called.

What is your opinion on this?

BTW, I'm not aware if Scipy circmean() is currently being used in other parts of arviz, I just guess this is the default alternative.

I think we use a circmean (not sure which) in summary with circular variables, I think we should have a single function used in both places.

I am fine with keeping the fast one, given that it covers our use case and we have tests checking it gives the same result as scipy one.

If you agree I can add the following function to numeric_utils.py and make use of it here.
I will also add a test to check the result matches the one obtained with scipy (as far as I've manually tested, it does).

def circular_mean(x, na_rm=False): if len(x[np.isnan(x)]) > 0: if na_rm: x = x[~np.isnan(x)] else: return np.nan sinr = np.sum(np.sin(x)) cosr = np.sum(np.cos(x)) mean = np.arctan2(sinr, cosr) return mean

Sounds good, I would use np.any instead of slicing and then checking length, and can you open an issue on using numba for this? There are other functions that could benefit from it

what about

def circular_mean(x, na_rm=False): if na_rm: x = x[~np.isnan(x)] sinr = np.sum(np.sin(x)) cosr = np.sum(np.cos(x)) mean = np.arctan2(sinr, cosr) return mean

arviz/kde_utils.py

OriolAbril · 2020-07-23T15:12:13Z

arviz/kde_utils.py

+    assert isinstance(bw_fct, (int, float))
+    assert bw_fct > 0


similar comment to one above, these two should probably be type error and value error respectively, right now they'd raise an assertion error without any custom message

the isinstance int, float will also give problems with numpy dtypes i.e. float32

I can replace it with the following

if not isinstance(bw_fct, (int, float, np.integer, np.floating)): raise TypeError(f"`bw_fct` must be a positive number, not an object of {type(bw_fct)}.") if not bw_fct > 0: raise ValueError(f"`bw_fct` must be a positive number, not {bw_fct}.")

Doesn't floating work with python ints too?

Is this what you mean? It returns False

isinstance(1, np.floating) # False

Haha, my bad: with python floats

aloctavodia · 2020-07-28T20:15:47Z

I think this is (almost) ready to merge, we can keep improving the code and its integration to the rest of ArviZ in future PRs. The only missing part is to add a DeprecationWarning indicating users calling _fast_kde to call instead kde. Also not sure if we should keep the dash or not. i.e. we should replace _fast_kde with kde or _kde. Thoughts @OriolAbril, @ahartikainen?

ahartikainen · 2020-07-28T20:45:07Z

Do we want users to call kde? At least for now our kde code has been only for internal use, and I think there is a good reason for that --> there is a lot of coordination between input <<-->> output

For example scipy kde can calculate pdf etc, can predict any point user wants. Our is purely for plotting.

aloctavodia · 2020-07-28T21:28:55Z

I sometimes call _fast_kde, but I think in general we do not want user to call it. So I would like to keep the dash.

OriolAbril · 2020-07-29T23:11:55Z

arviz/numeric_utils.py

 from scipy.sparse import coo_matrix

 from .stats.stats_utils import histogram
 from .utils import _stack, _dot, _cov
+from .kde_utils import _kde


I think this is a cyclic import

I moved things from kde_utils.py to numeric_utils.py (as suggested) but that gives us this cyclic import. It works well because I don't use a relative import in kde_utils, but I don't feel it is very elegant/proper. Any suggestions on how to improve it?

given that _fast_kde is a private method, I think we can leave it in numeric_utils raising an error and pointing to _kde.

Moreover, we used to have _fast_kde available as az._fast_kde (due to it being imported in plots' module __init__) and removed it at some point without any complaint/issue about it.

I am not sure if we should have _kde available again as az._kde though

I think we should as we used to have _fast_kde

canyon289 · 2020-07-31T15:27:08Z

@tomicapretto I really appreciate your very detailed initial PR message with lots of graphs and explanations for code changes at a human level and plots for comparison. It helps quite a bit to help people like me understand whats going on and difference is, again very much appreciate the effort you put in!

arviz/kde_utils.py

OriolAbril

LGTM, thanks @tomicapretto !

arviz/kde_utils.py

tomicapretto · 2020-08-05T14:18:42Z

@aloctavodia, @OriolAbril, @ahartikainen and @canyon289 thank you all for the reviews and the work!!

Added a new KDE function that can be used to estimate linear and circular data. Decided to put everything in a new file (instead of numeric_utils) because these functions required many utilities themselves. Also, updated all functions that used the old function. The circular option is not yet passed to higher level functions. That is something to be done.

* In the previous version the function tried to normalize the angles to [-pi, pi) but it was not correct. This commit includes a function that works appropiately.

…received. * Functions _circular_mean() and normalize_angle() now are in numeric_utils.py. * If the array has more than 1 dimension, _check_type() tries to flatten it silenty. If it is not possible, an error is raised. * Improved how bw_fct type and value are checked and reported. * All functions now start with _ because they are all internals. * _select_bw_method() no longer exists. Its functionality is now incorporated into _get_bw(). * Changed internal _bw_* functions to a common signature. * Added DeprecationWarning to _fast_kde() and now it calls _kde(). * Added tests for _normalize_angle() and _circular_mean(). The following functions and their backends have been modified to incorporate the circular argument that is passed to _kde() as well as the new meaning of the argument bw (see below): * plot_density() * plot_kde() * plot_dist() * plot_posterior() * plot_violion() * calculate_point_estimate() In addition, plot_kde() gained a new argument, adaptive, which is used to determine if the bandwidth is adaptive or not. Finally, note the meaning of the argument bw has changed. It used to be a bandwidth multiplication factor Now it can be either a string indicating the bandwidth method or the bandwidth itself. The internal function _kde() has an argument bw_fct which is the multiplicative factor.

* custom_lims in _kde() can be a tuple but then it tried to modify an inmutable object. Now it is converted to a list internally. * test_kde_scipy() had a slicing step that was not necessary an caused an error when unpacking * No error is raised when the data passed has lenght of 0 or 1. It raises a warning and returns np.nan

Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>

codecov · 2020-08-05T22:11:26Z

Codecov Report

Merging #1284 into master will decrease coverage by 1.03%.
The diff coverage is 66.05%.

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   92.89%   91.86%   -1.04%     
==========================================
  Files         101      102       +1     
  Lines       10192    10471     +279     
==========================================
+ Hits         9468     9619     +151     
- Misses        724      852     +128

Impacted Files	Coverage Δ
arviz/plots/backends/bokeh/distplot.py	`86.56% <ø> (ø)`
arviz/plots/backends/matplotlib/distplot.py	`91.37% <ø> (ø)`
arviz/plots/distplot.py	`89.47% <ø> (ø)`
arviz/plots/energyplot.py	`100.00% <ø> (ø)`
arviz/plots/posteriorplot.py	`87.50% <ø> (ø)`
arviz/plots/violinplot.py	`83.33% <ø> (ø)`
arviz/kde_utils.py	`57.85% <57.85%> (ø)`
arviz/numeric_utils.py	`90.00% <66.66%> (-7.60%)`	⬇️
arviz/plots/densityplot.py	`88.88% <75.00%> (-1.12%)`	⬇️
arviz/plots/kdeplot.py	`96.96% <87.50%> (-3.04%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a1f548...c75989f. Read the comment docs.

ahartikainen · 2020-08-05T22:17:48Z

Thanks for the PR!

joelostblom · 2023-09-12T18:08:37Z

Thank you for this implementation of isj in Python; I have been struggling to find anything other than scott/silverman in Python (it's really only arviz, zfit, and KDEpy that have it, I believe). I'm curious to the motivation behind the "experimental" default of averaging the silverman and isj kernel. I haven't seen that elsewhere; the usual recommendation I come a across is to use isj. I'm not familiar with the literature on density estimation research so that might be why I haven't heard of it, and I'm interested to understand the benefits of this approach.

I did stumble upon your notebooks in this repo https://github.com/tomicapretto/density_estimation @tomicapretto and found them illuminating in general (thank you for explaining concepts and results with clarity and in such detail!). However, I did not find an investigation of the silverman/isj average there either. I did notice that you wrote about an issue with isj undersmoothing in sparse regions and suggested using density-adapted kernel methods as a remedy for that:

With that background, I could imagine that combining silverman and isj would maybe bring some of the same benefits as an adaptive kernel (balancing under and oversmoothing; although I would guess it gives somewhat inferior results?), but at a lower cost in both complexity and computation. Is this (one of) the reason for the choice of the silverman/isj average here, or is there another motivation behind that decision? I would also be curious if you still find it a more suitable choice after having used it as the default in ArViz for a few years now since this was implemented.

tomicapretto · 2023-09-13T13:05:33Z

Hi @joelostblom thanks for the words of recognition! It was a very challenging and fun work.

The sustain behind the experimental bandwidth was a simulation study that I performed. I wrote a report and uploaded it to my old webpage. Then I migrated it and that was lost (sorry!). But I went to the repository and found it in an old commit. So have a look at the report.html here https://github.com/tomicapretto/webpage/tree/5c5df838ed90fcb676d26904885740faf148fd0b/content/files.

joelostblom · 2023-09-13T16:21:31Z

Super, thank you so much for digging that up! That also cleared up a misunderstanding I had around how the adaptive kernel works. It's interesting that Scott and Silverman are used so widely although they appear to have clear oversmoothing drawbacks, especially for sample sizes <~5,000 - 10,000 observations. Thanks again!

aloctavodia requested changes Jul 10, 2020

View reviewed changes

aloctavodia reviewed Jul 23, 2020

View reviewed changes

tomicapretto mentioned this pull request Jul 23, 2020

Update functions using KDE to pass optional arguments #1316

Open

OriolAbril reviewed Jul 23, 2020

View reviewed changes

tomicapretto mentioned this pull request Jul 24, 2020

Use conditional_jit in utilities used by kde() #1318

Open

OriolAbril reviewed Jul 29, 2020

View reviewed changes

aloctavodia approved these changes Jul 31, 2020

View reviewed changes

OriolAbril reviewed Aug 3, 2020

View reviewed changes

arviz/kde_utils.py Outdated Show resolved Hide resolved

arviz/kde_utils.py Outdated Show resolved Hide resolved

arviz/kde_utils.py Outdated Show resolved Hide resolved

arviz/kde_utils.py Show resolved Hide resolved

OriolAbril approved these changes Aug 3, 2020

View reviewed changes

arviz/kde_utils.py Outdated Show resolved Hide resolved

tomicapretto and others added 9 commits August 5, 2020 22:11

Update according to requests

7499764

Fix how circular data is handled and listed changes in CHANGELOG.

8335720

* In the previous version the function tried to normalize the angles to [-pi, pi) but it was not correct. This commit includes a function that works appropiately.

Update according to required changes.

eaaff9d

DeprecationWarning replaced by FutureWarning

819e9a6

Minor modifications regarding last PR comments

4e5bb76

Update arviz/kde_utils.py

01e61c2

Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>

ahartikainen changed the title ~~[WIP] Add new KDE function.~~ Add new KDE function. Aug 5, 2020

ahartikainen and others added 2 commits August 6, 2020 00:19

Minor fixes

0ee8cf3

fix docstrings

c75989f

ahartikainen approved these changes Aug 5, 2020

View reviewed changes

ahartikainen merged commit 74b332a into arviz-devs:master Aug 5, 2020

agustinaarroyuelo mentioned this pull request Oct 20, 2020

Fix labels and use circular KDE in plot_trace #1428

Merged

4 tasks

Add new KDE function. #1284

Add new KDE function. #1284

Conversation

tomicapretto commented Jul 7, 2020 • edited

Description

Major changes

Bandwidth estimation methods supported

Comments:

Plots

Checklist

aloctavodia commented Jul 10, 2020

aloctavodia commented Jul 10, 2020

aloctavodia commented Jul 17, 2020

aloctavodia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OriolAbril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aloctavodia Jul 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aloctavodia commented Jul 28, 2020

ahartikainen commented Jul 28, 2020

aloctavodia commented Jul 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OriolAbril Jul 31, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

canyon289 commented Jul 31, 2020

OriolAbril left a comment

Choose a reason for hiding this comment

tomicapretto commented Aug 5, 2020

codecov bot commented Aug 5, 2020

Codecov Report

ahartikainen commented Aug 5, 2020

joelostblom commented Sep 12, 2023 • edited

tomicapretto commented Sep 13, 2023

joelostblom commented Sep 13, 2023

tomicapretto commented Jul 7, 2020 •

edited

aloctavodia Jul 24, 2020 •

edited

OriolAbril Jul 31, 2020 •

edited

joelostblom commented Sep 12, 2023 •

edited