Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent/largest/smallest determined from source and compare combined #66

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anthonyywangmath
Copy link

When comparing two numerical series, the detailed tables of freq/max/min values to display was determined based on the source data only. This could make it seem like, for example, that all the largest values come from the source dataset when it is not the case.

For example if the largest values in your source dataset were 20, 18, 16 while in the compare dataset the largest were 21, 19, 17, then only 20, 18, 16 would appear in the table of largest values.

The proposed changes makes it so that the values shown in the table of frequent/largest/smallest is determined by combining both source and compare datasets whenever the compare dataset exists.

Other remarks:

  • the code might not work when source is int and compare is double (which is mentioned in the comments in get_comparison_num). Maybe the best way to handle this is to force source and compare to have the same datatypes when initially processing them, before getting value counts, etc.
  • the SWEETVIZ_REPORT.html in docs/examples/ may need to be regenerated to reflect changes made

@fbdesignpro
Copy link
Owner

Hi @anthonyywangmath, thank you so much for the detailed contribution!

This is a good find. I agree that the solution is to force source and compared to use the same data types. I am reviewing a few other issues, and in general it feels like coercing users into a bit more strictness is more desirable than trying to do a bunch of conversion and failsafe cases, because things degenerate quickly since there can be so many edge cases.

I'm not sure if I will bring this in wholesale as a pull request or integrate it piecemeal, as I am in the middle of a big revision and it might be easier to do. I will look into it further and let you know... Thanks again!

@sagarishere
Copy link

the code might not work when source is int and compare is double (which is mentioned in the comments in get_comparison_num). Maybe the best way to handle this is to force source and compare to have the same datatypes when initially processing them, before getting value counts, etc.

I want to chime in on this one.

Instead of forcing source to have same datatype, an advisory could be issued on top of the report to get more accurate results by having the same data-types.
I usually like the flexibility, as I have coded in JS a lot. Many people appreciate when things just work and give up if nothing works. For more adoption of SweetViz, we should provide users with options.

The entry point to the use of library should be low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants