Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new KDE function. #1284

Merged
merged 11 commits into from
Aug 5, 2020
Merged

Add new KDE function. #1284

merged 11 commits into from
Aug 5, 2020

Conversation

tomicapretto
Copy link
Contributor

@tomicapretto tomicapretto commented Jul 7, 2020

Description

This pull requests replaces the function _fast_kde() with a new function called kde() that can be used to estimate density functions for both linear and circular data.

Major changes

  • Everything related to KDE now lives in kde_utils.py instead of numeric_utils.py. The new function requires so many utilities itself that it is worth a new file.
  • The old function _fast_kde() used to return density, xmin and xmax, which required creating a new grid after estimating the density.
    The new function returns grid and pdf. It can also return the estimated bandwidth by calling kde() with the argument bw_return set to True.
  • The new function kde() supports circular data via its boolean argument circular.
  • The argument bw used to represent the bandwidth scaling factor in the old function _fast_kde(). Now it represents either the bandwidth method or the bandwidth itself.
    It is still possible to pass a bandwidth scaling factor via the argument bw_fct which is 1 by default.
  • kde() supports adaptive bandwidth for linear data. This is enabled by setting adaptive to True.
  • kde() supports several bandwidth estimation methods (see below).
  • The old _fast_kde() used to perform boundary correction without extending the observed domain in all the cases.
    This is also the default for kde() but these settings can be changed with the arguments bound_correction, extend, extend_fct, and custom_lims.

Bandwidth estimation methods supported

For linear data:

  • Scott's rule (bw = 'scott').
  • Silverman's rule (bw = 'silverman').
  • Improved Sheather-Jones plug-in method (bw = 'isj').
  • Expermiental, which is the average of Silverman's rule and the ISJ method (bw = 'experimental'). This is the default.

For circular data:

  • Taylor's rule (bw = 'taylor'). It's like the Scott's rule of the circular world.

Comments:

  • kde_utils.py does not contain anything related to 2D KDE yet. It would be good to put all that here too.
  • The circular argument is not passed to higher level functions yet. This is something to do in the near future.
  • All the functions that used to call _fast_kde() now call kde() and have been updated to receive the new objects.

Plots

Updating the KDE function impacts in many different plots. Here is an example that compares the old version versus the new one. Original is here.

Old:
old

New:
new

We also have the following comparison which inclues the new version, the old version and the KDE offered by scipy.
5000 samples are used in all the cases.

comparison

Checklist

  • Follows official PR format
  • Includes a sample plot to visually illustrate the changes (only for plot-related functions)
  • New features are properly documented (with an example if appropriate)?
  • Includes new or updated tests to cover the new feature
  • Code style correct (follows pylint and black guidelines)
  • Changes are listed in changelog

arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
@aloctavodia
Copy link
Contributor

thanks @tomicapretto for your contribution. This is a big one so I made a first round of revision and I will do a another one ASAP.

@aloctavodia
Copy link
Contributor

BTW I really like that in the example you posted the new kde is more wiggly than the current version :-)

@aloctavodia
Copy link
Contributor

It seems you need to rebase, there are a lot of commits unrelated to this PR

Copy link
Contributor

@aloctavodia aloctavodia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicks

CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
return bw


# Misc utils
# Misc utils ------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the dashes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the line of comment below, maybe these misc methods should go in numeric utils instead? and the comments to divide in sections can be deleted

arviz/kde_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is amazing work, I actually feel bad for not being able to correctly review right now. I'll try to review again with more time in some days/week

My main big comment is in terms of api, I think we should pay a lot of attention here to define an api for kde functions so it is easy to modify the functions, their components or even add new kde calculation methods.

CHANGELOG.md Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
found = 0
if bw <= 0:
warnings.warn(
"Improved Sheather-Jones did not converge to a positive value. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that whatever we choose we should use custom warnings so even in a worst case scenario they can be ignored without blocking all UserWarnings

Comment on lines 210 to 214
def circular_mean(x):
sinr = np.sum(np.sin(x))
cosr = np.sum(np.cos(x))
mean = np.arctan2(sinr, cosr)
return mean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have numeric utils? Shouldn't this go there? Also, I think Using the @conditional_jit decorator here will speed the method up further. See https://github.com/arviz-devs/arviz/blob/master/arviz/stats/stats_utils.py#L526 for an example

Also scipy version seems to do extra checks with respect to this one, not sure if nan checking or something like this could be useful for us. source https://github.com/scipy/scipy/blob/v1.5.1/scipy/stats/morestats.py#L3300-L3366 moreover, we should probably test our function gives the same result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of Scipy version but I decided to keep this one because it is faster. As it is used now, things like nan checking are done in kde_circular() which is the function that calls circular_mean().
I guess you mention checks and all that because you're considering using the function in other parts of arviz.
I think there are two alternatives here:

  • Implement a circular_mean() function that is going to be used everywhere in arviz, instead of Scipy version.
  • Use Scipy version in other parts of arviz and keep this function for the circular kde since it is faster and safety checks have already been performed when circular_mean() is called.

What is your opinion on this?

BTW, I'm not aware if Scipy circmean() is currently being used in other parts of arviz, I just guess this is the default alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use a circmean (not sure which) in summary with circular variables, I think we should have a single function used in both places.

I am fine with keeping the fast one, given that it covers our use case and we have tests checking it gives the same result as scipy one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree I can add the following function to numeric_utils.py and make use of it here.
I will also add a test to check the result matches the one obtained with scipy (as far as I've manually tested, it does).

def circular_mean(x, na_rm=False):   
    if len(x[np.isnan(x)]) > 0:
        if na_rm:
            x = x[~np.isnan(x)]
        else:
            return np.nan
    
    sinr = np.sum(np.sin(x))
    cosr = np.sum(np.cos(x))
    mean = np.arctan2(sinr, cosr)
    
    return mean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I would use np.any instead of slicing and then checking length, and can you open an issue on using numba for this? There are other functions that could benefit from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

def circular_mean(x, na_rm=False):   
    if na_rm:
        x = x[~np.isnan(x)]
    
    sinr = np.sum(np.sin(x))
    cosr = np.sum(np.cos(x))
    mean = np.arctan2(sinr, cosr)
    
    return mean

arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
Comment on lines 667 to 668
assert isinstance(bw_fct, (int, float))
assert bw_fct > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to one above, these two should probably be type error and value error respectively, right now they'd raise an assertion error without any custom message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the isinstance int, float will also give problems with numpy dtypes i.e. float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace it with the following

if not isinstance(bw_fct, (int, float, np.integer, np.floating)):
    raise TypeError(f"`bw_fct` must be a positive number, not an object of {type(bw_fct)}.")
    
if not bw_fct > 0:
    raise ValueError(f"`bw_fct` must be a positive number, not {bw_fct}.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't floating work with python ints too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean? It returns False

isinstance(1, np.floating)
# False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, my bad: with python floats

@aloctavodia
Copy link
Contributor

I think this is (almost) ready to merge, we can keep improving the code and its integration to the rest of ArviZ in future PRs. The only missing part is to add a DeprecationWarning indicating users calling _fast_kde to call instead kde. Also not sure if we should keep the dash or not. i.e. we should replace _fast_kde with kde or _kde. Thoughts @OriolAbril, @ahartikainen?

@ahartikainen
Copy link
Contributor

Do we want users to call kde? At least for now our kde code has been only for internal use, and I think there is a good reason for that --> there is a lot of coordination between input <<-->> output

For example scipy kde can calculate pdf etc, can predict any point user wants. Our is purely for plotting.

@aloctavodia
Copy link
Contributor

I sometimes call _fast_kde, but I think in general we do not want user to call it. So I would like to keep the dash.

from scipy.sparse import coo_matrix

from .stats.stats_utils import histogram
from .utils import _stack, _dot, _cov
from .kde_utils import _kde
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a cyclic import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved things from kde_utils.py to numeric_utils.py (as suggested) but that gives us this cyclic import. It works well because I don't use a relative import in kde_utils, but I don't feel it is very elegant/proper. Any suggestions on how to improve it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that _fast_kde is a private method, I think we can leave it in numeric_utils raising an error and pointing to _kde.

Moreover, we used to have _fast_kde available as az._fast_kde (due to it being imported in plots' module __init__) and removed it at some point without any complaint/issue about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should have _kde available again as az._kde though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should as we used to have _fast_kde

@canyon289
Copy link
Member

@tomicapretto I really appreciate your very detailed initial PR message with lots of graphs and explanations for code changes at a human level and plots for comparison. It helps quite a bit to help people like me understand whats going on and difference is, again very much appreciate the effort you put in!

arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Outdated Show resolved Hide resolved
arviz/kde_utils.py Show resolved Hide resolved
Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tomicapretto !

arviz/kde_utils.py Outdated Show resolved Hide resolved
@tomicapretto
Copy link
Contributor Author

@aloctavodia, @OriolAbril, @ahartikainen and @canyon289 thank you all for the reviews and the work!!

tomicapretto and others added 9 commits August 5, 2020 22:11
Added a new KDE function that can be used to estimate linear and circular data.
Decided to put everything in a new file (instead of numeric_utils) because these functions
required many utilities themselves.
Also, updated all functions that used the old  function.
The circular option is not yet passed to higher level functions. That is something to be done.
* In the previous version the function  tried to normalize
the angles to [-pi, pi) but it was not correct.
This commit includes a function  that works appropiately.
…received.

* Functions _circular_mean() and normalize_angle() now are in numeric_utils.py.
* If the array has more than 1 dimension, _check_type() tries to flatten it silenty. If it is not possible, an error is raised.
* Improved how bw_fct type and value are checked and reported.
* All functions now start with _ because they are all internals.
* _select_bw_method() no longer exists. Its functionality is now incorporated into _get_bw().
* Changed internal _bw_* functions to a common signature.
* Added DeprecationWarning to _fast_kde() and now it calls _kde().
* Added tests for _normalize_angle() and _circular_mean().
The following functions and their backends have been modified to incorporate the circular argument that is passed to _kde() as well as the new meaning of the argument bw (see below):
* plot_density()
* plot_kde()
* plot_dist()
* plot_posterior()
* plot_violion()
* calculate_point_estimate()
In addition, plot_kde() gained a new argument, adaptive, which is used to determine if the bandwidth is adaptive or not.
Finally, note the meaning of the argument bw has changed. It used to be a bandwidth multiplication factor Now it can be either a string indicating the bandwidth method or the bandwidth itself. The internal function _kde() has an argument bw_fct which is the multiplicative factor.
* custom_lims in _kde() can be a tuple but then it tried to modify an inmutable object. Now it is converted to a list internally.
* test_kde_scipy() had a slicing step that was not necessary an caused an error when unpacking
* No error is raised when the data passed has lenght of 0 or 1. It raises a warning and returns np.nan
Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
@ahartikainen ahartikainen changed the title [WIP] Add new KDE function. Add new KDE function. Aug 5, 2020
@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #1284 into master will decrease coverage by 1.03%.
The diff coverage is 66.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   92.89%   91.86%   -1.04%     
==========================================
  Files         101      102       +1     
  Lines       10192    10471     +279     
==========================================
+ Hits         9468     9619     +151     
- Misses        724      852     +128     
Impacted Files Coverage Δ
arviz/plots/backends/bokeh/distplot.py 86.56% <ø> (ø)
arviz/plots/backends/matplotlib/distplot.py 91.37% <ø> (ø)
arviz/plots/distplot.py 89.47% <ø> (ø)
arviz/plots/energyplot.py 100.00% <ø> (ø)
arviz/plots/posteriorplot.py 87.50% <ø> (ø)
arviz/plots/violinplot.py 83.33% <ø> (ø)
arviz/kde_utils.py 57.85% <57.85%> (ø)
arviz/numeric_utils.py 90.00% <66.66%> (-7.60%) ⬇️
arviz/plots/densityplot.py 88.88% <75.00%> (-1.12%) ⬇️
arviz/plots/kdeplot.py 96.96% <87.50%> (-3.04%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a1f548...c75989f. Read the comment docs.

@ahartikainen ahartikainen merged commit 74b332a into arviz-devs:master Aug 5, 2020
@ahartikainen
Copy link
Contributor

Thanks for the PR!

@joelostblom
Copy link

joelostblom commented Sep 12, 2023

Thank you for this implementation of isj in Python; I have been struggling to find anything other than scott/silverman in Python (it's really only arviz, zfit, and KDEpy that have it, I believe). I'm curious to the motivation behind the "experimental" default of averaging the silverman and isj kernel. I haven't seen that elsewhere; the usual recommendation I come a across is to use isj. I'm not familiar with the literature on density estimation research so that might be why I haven't heard of it, and I'm interested to understand the benefits of this approach.

I did stumble upon your notebooks in this repo https://github.com/tomicapretto/density_estimation @tomicapretto and found them illuminating in general (thank you for explaining concepts and results with clarity and in such detail!). However, I did not find an investigation of the silverman/isj average there either. I did notice that you wrote about an issue with isj undersmoothing in sparse regions and suggested using density-adapted kernel methods as a remedy for that:

image

With that background, I could imagine that combining silverman and isj would maybe bring some of the same benefits as an adaptive kernel (balancing under and oversmoothing; although I would guess it gives somewhat inferior results?), but at a lower cost in both complexity and computation. Is this (one of) the reason for the choice of the silverman/isj average here, or is there another motivation behind that decision? I would also be curious if you still find it a more suitable choice after having used it as the default in ArViz for a few years now since this was implemented.

@tomicapretto
Copy link
Contributor Author

Hi @joelostblom thanks for the words of recognition! It was a very challenging and fun work.

The sustain behind the experimental bandwidth was a simulation study that I performed. I wrote a report and uploaded it to my old webpage. Then I migrated it and that was lost (sorry!). But I went to the repository and found it in an old commit. So have a look at the report.html here https://github.com/tomicapretto/webpage/tree/5c5df838ed90fcb676d26904885740faf148fd0b/content/files.

@joelostblom
Copy link

Super, thank you so much for digging that up! That also cleared up a misunderstanding I had around how the adaptive kernel works. It's interesting that Scott and Silverman are used so widely although they appear to have clear oversmoothing drawbacks, especially for sample sizes <~5,000 - 10,000 observations. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants