Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify and fix rank module result types #183

Closed
bvenn opened this issue Jan 31, 2022 · 5 comments · Fixed by #204
Closed

Unify and fix rank module result types #183

bvenn opened this issue Jan 31, 2022 · 5 comments · Fixed by #204

Comments

@bvenn
Copy link
Member

bvenn commented Jan 31, 2022

There are 4 ranking methods in FSharp.Stats.Rank:

example = [5,3,3,4,2]

  • rankFirst - when ties are present, rank them corresponding to the order of their appearance
    • result :int [] = [5,2,3,4,1]
  • rankMin - when ties are present, their rank correspond to the rank of the first occurrence
    • result :float [] = [5,2,2,4,1]
  • rankMax - when ties are present, their rank correspond to the rank of the last occurrence
    • result :float [] = [5,3,3,4,1]
  • rankAvg - when ties are present, their rank correspond to the average of all tie ranks
    • result :float [] = [5,2.5,2.5,4,1]
  1. By now, all functions except rankFirst result in float arrays. The only function where floats can occur is rankAvg. For harmonization I would suggest, that rankFirst as well should report a float array, although it would be a breaking change.

  2. There seems to be an issue, that ties are not ranked correct here:

    let minTies a _ = (float a + 1.)

    It seems the +1 increment belongs to rankMax rather than rankMin.
    EDIT 01/02/22: THIS WAS A TEMPORAL LOCAL ERROR!

A fix is on the way!

@bvenn
Copy link
Member Author

bvenn commented Jan 31, 2022

  1. in line 23 there is abs (a-b) <= 0. Since an absolute cannot be smaller than 0 it can be changed to abs (a-b) = 0
    if (abs (data'.[i] - data'.[pi]) <= zero) then

@bvenn
Copy link
Member Author

bvenn commented Feb 1, 2022

Apparently, restarting the computer fixed issue 2. Unit tests are added and changes are on their way.

@bvenn bvenn closed this as completed in fad65e0 Feb 1, 2022
@bvenn
Copy link
Member Author

bvenn commented Feb 1, 2022

How to handle nan and infinity values within the sequence?

At this moment the ranking order is as follows:

  1. nan
  2. -infinity
  3. all real numbers
  4. infinity

nans and infinities are treated as individual elements:

let example = [|-infinity;1;nan;infinity;infinity|]
rankFirst =   [|    2;    3; 1;    4;       5    |]
rankMin   =   [|    2;    3; 1;    4;       5    |]
rankMax   =   [|    2;    3; 1;    4;       5    |]
rankAvg   =   [|    2;    3; 1;    4;       5    |] 

How do others handle nans?

  • pandas has the following options for na_option:

    • keep (default): assign NaN rank to NaN values
    • top: assign lowest rank to NaN values
    • bottom: assign highest rank to NaN values
  • R! has the following options for na.last:

    • TRUE (default), missing values in the data are put last
    • FALSE, they are put first
    • NA, they are removed
    • "keep" they are kept with rank NA

Suggestion

I would recommend to assign nan ranks to nan values as default case.

let example = [|-infinity;1;nan;infinity;infinity|]
rankFirst =   [|    1;    2;nan;   3;       4    |]

What do you think?
@muehlhaus
@kMutagene
@ZimmerD

@bvenn bvenn reopened this Feb 1, 2022
@muehlhaus
Copy link
Member

Yes, I think your suggestion is very good.

@bvenn
Copy link
Member Author

bvenn commented Feb 2, 2022

I've just created a update-rank branch to solve all issues. By default nan is sorted to the start of a sequence. This corrupts the loop of the implemented version. There are two possibilities to solve it:

  • Change the comparer to sort nan to the end. Afterwards the loop can be modified, that

    • a) nan ranks are assigned to nans. This version is implemented in f9e8a98 but because in every comparison the new comparer compNaNLast has to perform two nan checks the performance is reduced 20fold. Therefore I added specialized functions rankFirst -> rankFirstNaNLast
    • b) end the loop since all following nans must be nan and no comparison must be performed. All zeros can then be replaced by nans.
  • Leave the sorting as it is and add a counter to the loop that counts whenever a nan occurs. Set its rank to nan and subtract the counter value from the rank of all following real values. Thereby the nan checks are reduced by half. Because these checks are not necessary for all other types than nan, a type query in the beginning could help by reducing the number of nan checks.

bvenn added a commit that referenced this issue May 26, 2022
bvenn added a commit that referenced this issue May 26, 2022
bvenn added a commit that referenced this issue May 26, 2022
@bvenn bvenn mentioned this issue May 26, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants