Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank Data, Correlation, Covariance, R Squared #3484

Merged
merged 14 commits into from
May 30, 2022

Conversation

jdunkerley
Copy link
Member

@jdunkerley jdunkerley commented May 25, 2022

Pull Request Description

  • Added new Statistics: Covariance, Pearson, Spearman, R Squared
  • Added covariance_matrix function
  • Added pearson_correlation function to compute correlation matrix
  • Added rank_data and Rank_Method type to create rankings of a Vector
  • Added spearman_correlation function to compute Spearman Rank correlation matrix

Important Notes

  • Added Panic.throw_wrapped_if_error and Panic.handle_wrapped_dataflow_error to help with errors within a loop.
  • Removed Array.set_at use from Table.Vector_Builder

Checklist

Please include the following checklist in your PR:

  • The documentation has been updated if necessary.
  • All code conforms to the
    Scala,
    Java,
    and
    Rust
    style guides.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed: Enso GUI was tested when built using BOTH
      ./run.sh ide dist and ./run.sh ide watch.

@jdunkerley jdunkerley marked this pull request as ready for review May 26, 2022 16:12
@jdunkerley jdunkerley force-pushed the wip/jd/covariance-182059993 branch from e5c14ac to 3b80fb9 Compare May 26, 2022 17:00
@jdunkerley jdunkerley requested a review from hubertp May 26, 2022 17:14
Comment on lines +2 to +18
## Specifies how to handle ranking of equal values.
type Rank_Method
## Use the mean of all ranks for equal values.
type Average

## Use the lowest of all ranks for equal values.
type Minimum

## Use the highest of all ranks for equal values.
type Maximum

## Use same rank value for equal values and next group is the immediate
following ranking number.
type Dense

## Equal values are assigned the next rank in order that they occur.
type Ordinal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is much better, but I'm still thinking if it would be worth to maybe offer some examples? Not sure if here or next to some method using this. But I'm still not sure if I correctly understand how Dense or Average work

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Just two questions:

  1. I did check the code, but did not read deep into formulas for each statistic, I assume they are correct. But if you want me to double check them, I can do that.
  2. What are the reference values for the results coming from? I guess Excel? I'm wondering if it would make sense to indicate somehow (in a very short way) how the references are computed if it is something that is not completely trivial, just in case someone wanted to double check them later if some issues were to appear.

case MINIMUM -> start + 1;
case MAXIMUM -> index;
case DENSE -> dense;
case AVERAGE -> (start + 1 + index) / 2.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering out of curiosity in which situation one may want to use the average ranking?

(and why is it the default one? not saying it shouldn't be, just a honest question, because I haven't seen that)

@jdunkerley jdunkerley force-pushed the wip/jd/covariance-182059993 branch from df9297b to e3e75d4 Compare May 30, 2022 13:45
@jdunkerley jdunkerley added the CI: Ready to merge This PR is eligible for automatic merge label May 30, 2022
@mergify mergify bot merged commit 1aa0bb3 into develop May 30, 2022
@mergify mergify bot deleted the wip/jd/covariance-182059993 branch May 30, 2022 17:13
jdunkerley added a commit that referenced this pull request May 31, 2022
- Added new `Statistic`s: Covariance, Pearson, Spearman, R Squared
- Added `covariance_matrix` function
- Added `pearson_correlation` function to compute correlation matrix
- Added `rank_data` and Rank_Method type to create rankings of a Vector
- Added `spearman_correlation` function to compute Spearman Rank correlation matrix

# Important Notes
- Added `Panic.throw_wrapped_if_error` and `Panic.handle_wrapped_dataflow_error` to help with errors within a loop.
- Removed `Array.set_at` use from `Table.Vector_Builder`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants