frequent/largest/smallest determined from source and compare combined #66

anthonyywangmath · 2020-11-18T22:43:29Z

When comparing two numerical series, the detailed tables of freq/max/min values to display was determined based on the source data only. This could make it seem like, for example, that all the largest values come from the source dataset when it is not the case.

For example if the largest values in your source dataset were 20, 18, 16 while in the compare dataset the largest were 21, 19, 17, then only 20, 18, 16 would appear in the table of largest values.

The proposed changes makes it so that the values shown in the table of frequent/largest/smallest is determined by combining both source and compare datasets whenever the compare dataset exists.

Other remarks:

the code might not work when source is int and compare is double (which is mentioned in the comments in get_comparison_num). Maybe the best way to handle this is to force source and compare to have the same datatypes when initially processing them, before getting value counts, etc.
the SWEETVIZ_REPORT.html in docs/examples/ may need to be regenerated to reflect changes made

fbdesignpro · 2020-11-24T14:57:54Z

Hi @anthonyywangmath, thank you so much for the detailed contribution!

This is a good find. I agree that the solution is to force source and compared to use the same data types. I am reviewing a few other issues, and in general it feels like coercing users into a bit more strictness is more desirable than trying to do a bunch of conversion and failsafe cases, because things degenerate quickly since there can be so many edge cases.

I'm not sure if I will bring this in wholesale as a pull request or integrate it piecemeal, as I am in the middle of a big revision and it might be easier to do. I will look into it further and let you know... Thanks again!

sagarishere · 2024-03-21T10:19:10Z

the code might not work when source is int and compare is double (which is mentioned in the comments in get_comparison_num). Maybe the best way to handle this is to force source and compare to have the same datatypes when initially processing them, before getting value counts, etc.

I want to chime in on this one.

Instead of forcing source to have same datatype, an advisory could be issued on top of the report to get more accurate results by having the same data-types.
I usually like the flexibility, as I have coded in JS a lot. Many people appreciate when things just work and give up if nothing works. For more adoption of SweetViz, we should provide users with options.

The entry point to the use of library should be low.

frequent/largest/smallest determined from source and compare combined

c530dfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent/largest/smallest determined from source and compare combined #66

frequent/largest/smallest determined from source and compare combined #66

anthonyywangmath commented Nov 18, 2020

fbdesignpro commented Nov 24, 2020

sagarishere commented Mar 21, 2024

frequent/largest/smallest determined from source and compare combined #66

Are you sure you want to change the base?

frequent/largest/smallest determined from source and compare combined #66

Conversation

anthonyywangmath commented Nov 18, 2020

fbdesignpro commented Nov 24, 2020

sagarishere commented Mar 21, 2024