Addition of a subroutine to compute the median of array elements #426

jvdp1 · 2021-06-04T21:21:20Z

API can be already reviewed

Still to do:

Update hyperlinks in specs
Add comments in common.fypp
Add some tests (e.g., for int64 and real(dp))
Change sort to ord_sort when Issue with stdlib_sorting #428 will be merged

gareth-nx · 2021-06-05T05:19:11Z

Untested suggestion:

I wonder if the code duplication for the integer/real cases can be reduced by including the output type o1 in the code generation loop:
#:for k1, t1, o1 in ( zip(REAL_KINDS, REAL_TYPES, REAL_TYPES) + zip(INT_KINDS, INT_TYPES, ['real(dp)']*len(INT_KINDS)) )

So here 'o1' is the type of res (which is always real(dp) in the integer case).

The calculations seem to rely on constants (1. and 0.5) that have the right precision, but it looks like they could be defined as local parameters of type o1, making the integer/real cases have identical code (?).

gareth-nx · 2021-06-05T05:28:45Z

Another suggestion: From the sorting review, I understood that ord_sort is much faster than sort on data with significant pre-sorted chunks, while it is just a fraction slower for random data.

Considering that for median calculation both cases should be pretty common, I wonder if ord_sort is a better general choice? Although the downside is that more scratch memory will be required.

Perhaps the routine could have an optional argument sort_method that allows switching between the two?

jvdp1 · 2021-06-05T18:07:41Z

Thank you @gareth-nx for your suggestions.

I wonder if the code duplication for the integer/real cases can be reduced by including the output type o1 in the code generation loop:
#:for k1, t1, o1 in ( zip(REAL_KINDS, REAL_TYPES, REAL_TYPES) + zip(INT_KINDS, INT_TYPES, ['real(dp)']*len(INT_KINDS)) )

So here 'o1' is the type of res (which is always real(dp) in the integer case).

Great idea. I did it a bit differently that what you proposed. But it reduced the code a lot.

Considering that for median calculation both cases should be pretty common, I wonder if ord_sort is a better general choice? >Although the downside is that more scratch memory will be required.

Indeed, I also think that ord_sort is a better choice. I didn't use it because it needs more scratch memory. While I was trying ord_sort in this code, I found a bug in it (see #428 for details). When #428 will be merged, I will change sort for ord_sort in this procedure too.

Perhaps the routine could have an optional argument sort_method that allows switching between the two?

The API of median is currently the same as the API of mean. IMO it would be good to keep it like that.

gareth-nx · 2021-06-07T10:29:05Z

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.

program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program

jvdp1 · 2021-06-08T17:26:56Z

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.
program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program

Good suggestion. I wonder what the intrinsic sum reports (and what the standard says) in such a case.

jvdp1 · 2021-06-08T17:44:23Z

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.
program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program
Good suggestion. I wonder what the intrinsic sum reports (and what the standard says) in such a case.

With GFortran, it seems that no checks are done with sum in a release mode. In a debug mode, a runtime error is provided and mentioned that a mismatch was found. I am in favor to keep the same behavior as with the intrinsic sum implemented in gfortran (which is the case currently). However, I open to implemeent a check if desired by the community.

arjenmarkus · 2021-06-09T06:56:07Z

I would say such a check is light-weight and potentially saves a lot of problems that would otherwise not be easily noticed. For the runtime library of a compiler the situation may be different: the compiler knows that extra checks are called for (debug, checks on array bounds and such), but our code does not have that luxury. In debug builds you would see an array bounds problem reported, not a mismatch in array arguments. My preference would be to add the check ;), but others may disagree, Op di 8 jun. 2021 om 19:44 schreef Jeremie Vandenplas < ***@***.***>:

…

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error. program median_local use stdlib_stats, only : median use iso_c_binding, only : dp => C_DOUBLE implicit none real(dp) :: x(10), y(11) integer :: i x = (/(i*1.0_dp, i = 1, size(x))/) y = (/(i*1.0_dp, i = 1, size(y))/) ! Should this be an error, because size(mask) != size(y)? print*, median(y, mask= (x > 5.5_dp))end program Good suggestion. I wonder what the intrinsic sum reports (and what the standard says) in such a case. With GFortran, it seems that no checks are done with sum in a release mode. In a debug mode, a runtime error is provided and mentioned that a mismatch was found. I am in favor to keep the same behavior as with the intrinsic sum implemented in gfortran (which is the case currently). However, I open to implemeent a check if desired by the community. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#426 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN6YR5T6J3QEWW26NJ45YDTRZJIPANCNFSM46DRBQGA> .

jvdp1 · 2021-06-11T17:58:06Z

@gareth-nx @arjenmarkus I added a check on the shapes of x and mask ;) Question: how to test that it works properly, since it used an error stop?

jvdp1

removed an error stop that was inappropriate.

src/stdlib_stats_median.fypp

gareth-nx · 2021-07-04T00:27:34Z

Hi @jvdp1

Is this ready for review? I will be happy to do that once it is (but I note the very top of the thread suggests you still need to update some hyperlinks).

jvdp1 · 2021-07-04T06:33:28Z

Hi @gareth-nx, Thjank you. you may start the review. I will try to update the hyperlinks for FORD later... Le dim. 4 juil. 2021 à 02:27, gareth-nx ***@***.***> a écrit :

…

Hi @jvdp1 <https://github.com/jvdp1> Is this ready for review? I will be happy to do that once it is (but I note the very top of the thread suggests you still need to update some hyperlinks). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#426 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD5RO7H7HEZOOV42SZEEYFTTV6TIDANCNFSM46DRBQGA> .

src/stdlib_stats_median.fypp

gareth-nx

Very nice -- a couple of minor comments for your consideration, but they should be easy to address.

src/stdlib_stats_median.fypp

doc/specs/stdlib_stats.md

src/common.fypp

jvdp1 · 2021-07-21T21:20:17Z

Thank you @gareth-nx @leonfoks @ivan-pi @milancurcic for your review and comments.
I believe I answered all of them.

This whole discussion about selection algorithms led me to add some rules for cases where array contains NaN. Now if it is the case, the result will contain NaN too. In the previous implementation, the result was undetermined in such cases.

milancurcic · 2021-07-22T19:26:33Z

@ivan-pi can this PR be merged?

src/common.fypp

src/stdlib_stats.fypp

ivan-pi · 2021-07-23T06:59:18Z

src/stdlib_stats_median.fypp

+        n = size(x, kind=int64)
+        c = floor( (n + 1) / 2._${o1}$, kind=int64 )
+
+        x_tmp = reshape(x, [n])


Is the reshape necessary here?

Okay, I see this is to flatten the array. I guess pack could also be used in this case?

The subroutine sort only accepts rank-1 array, while median support arrays of all ranks.

Do you have another suggestion to avoid the reshape?

Indeed, x_tmp = pack(x, .true.) should work too. Do you think that pack is more efficient than reshape in this case?

I imagine that a good compiler would do the same thing. So it can remain as is.

ivan-pi · 2021-07-23T07:07:45Z

src/tests/stats/test_median.fypp

+    call check( any(ieee_is_nan(median(d3, 3, .false.))), '${k1}$ median(d3, 3, .false.)' )
+
+    call check( abs(median(d1, 1) - 1.5_${k1}$) < ${k1}$tol, '${k1}$ median(d1, 1), even')
+    call check( sum(abs(median(d2, 1) - [2._${k1}$, -4._${k1}$, 7._${k1}$, 1._${k1}$])) < ${k1}$tol, &


Would using the array kind specifier help reduce the preprocessor noise?

[real(${k1}$) :: 2.0, -4.0, 7.0, 1.0]

With gfortran, this implies a conversion from real(4) to real(8), with a warning triggered by -Wconversion-extra.
For this reason, I am inclined to keep as it is now.

Co-authored-by: Ivan Pribec <ivan.pribec@gmail.com>

ivan-pi

Following the links you provided the implementation looks good to me. The preprocessing work necessary to manage the different ranks is admirable.

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

gareth-nx · 2021-07-23T07:17:58Z

Following the links you provided the implementation looks good to me. The preprocessing work necessary to manage the different ranks is admirable.

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

The R interpreter gives NA if there are NA or NaN values present in the input vector:

> median(c(1,2,3,NA))
[1] NA

> median(c(1,2,3,NaN))
[1] NA

jvdp1 · 2021-07-23T08:10:35Z

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

Julia median returns NaN when NaN are present in the array (and as mentioned in @gareth-nx comment, it seems to be case for R too).

My main issue was that the result of sort is undetermined in prensence of NaN. Therefore, it would be also the case for median if there were no checks for NaN, and modifying its implementation (as proposed with quickselect) may result in a different behaviour. Checking for NaN allows to avoid possible future changes in the behavior of median if its implementation is modified.

@ivan-pi I answered all your questions. However, a couple of comments remain opened.

ivan-pi · 2021-07-23T08:30:04Z

The open comments don't affect the behaviour, so with three approvals this can be merged.

Thanks for your work.🙏

milancurcic · 2021-07-23T13:56:27Z

Thank you all, I'll merge.

jvdp1 added 2 commits June 3, 2021 23:29

progress

8dddff4

median: add subroutine to compute median of elements of an array

dbc16af

jvdp1 added 2 commits June 5, 2021 18:51

median: mv to ord_sort + combine REAL and INTEGER procedures

5d38d82

median: mv to sort due to issue fortran-lang#428

935949e

jvdp1 added 2 commits June 6, 2021 20:10

median: some cleaning

d506f4a

median: remove trailing whitespaces

2bc1a8b

awvwgk added the reviewers needed This patch requires extra eyes label Jun 9, 2021

jvdp1 added 2 commits June 11, 2021 19:50

median: add check on shapes between mask and x

ac1a2a2

median: add pure statement

fdfd150

jvdp1 commented Jun 11, 2021

View reviewed changes

src/stdlib_stats_median.fypp Outdated Show resolved Hide resolved

src/stdlib_stats_median.fypp Outdated Show resolved Hide resolved

jvdp1 added 7 commits June 11, 2021 20:10

Update src/stdlib_stats_median.fypp

b19c537

Update src/stdlib_stats_median.fypp

8ed99fe

median: update test

0d46361

Merge branch 'median' of https://github.com/jvdp1/stdlib into median

bad19f8

median: add comments to common.fypp

3342c6a

Merge remote-tracking branch 'upstream/master' into median

929d5c1

median: replace sort to ord_sort

7e3111e

gareth-nx reviewed Jul 4, 2021

View reviewed changes

src/stdlib_stats_median.fypp Show resolved Hide resolved

gareth-nx requested changes Jul 4, 2021

View reviewed changes

update specs

dfea79d

jvdp1 added 6 commits July 21, 2021 18:18

median: reorder fypp variable

9bbcb74

median: replace _ by numbers

f17b890

median: add in common.fypp where it is used for median case

391c658

median: add comment in test median

afc92a2

median: add warning about naive implementation

4d328dc

median: return NaN when real array contain NaN

bdb47b7

jvdp1 added reviewers needed This patch requires extra eyes and removed waiting for OP This patch requires action from the OP labels Jul 21, 2021

jvdp1 mentioned this pull request Jul 21, 2021

Selection algorithms #471

Open

ivan-pi reviewed Jul 23, 2021

View reviewed changes

src/common.fypp Outdated Show resolved Hide resolved

ivan-pi reviewed Jul 23, 2021

View reviewed changes

src/common.fypp Outdated Show resolved Hide resolved

ivan-pi reviewed Jul 23, 2021

View reviewed changes

src/stdlib_stats.fypp Outdated Show resolved Hide resolved

ivan-pi reviewed Jul 23, 2021

View reviewed changes

@ivan-pi suggestions from code review

227f021

Co-authored-by: Ivan Pribec <ivan.pribec@gmail.com>

ivan-pi requested changes Jul 23, 2021

View reviewed changes

jvdp1 added 3 commits July 23, 2021 09:44

median: rename fypp RName by name

9d38d9d

median: replace median_mask_all by median_all_mask

cbdc4ac

Merge branch 'median' of https://github.com/jvdp1/stdlib into median

a13c700

ivan-pi approved these changes Jul 23, 2021

View reviewed changes

milancurcic merged commit dd81cf5 into fortran-lang:master Jul 23, 2021

jvdp1 deleted the median branch July 23, 2021 14:05

awvwgk removed the reviewers needed This patch requires extra eyes label Sep 25, 2021

14NGiestas mentioned this pull request May 5, 2022

Median #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition of a subroutine to compute the median of array elements #426

Addition of a subroutine to compute the median of array elements #426

jvdp1 commented Jun 4, 2021 •

edited

Loading

gareth-nx commented Jun 5, 2021

gareth-nx commented Jun 5, 2021

jvdp1 commented Jun 5, 2021

gareth-nx commented Jun 7, 2021

jvdp1 commented Jun 8, 2021

jvdp1 commented Jun 8, 2021

arjenmarkus commented Jun 9, 2021 via email

jvdp1 commented Jun 11, 2021

jvdp1 left a comment •

edited

Loading

gareth-nx commented Jul 4, 2021

jvdp1 commented Jul 4, 2021 via email

gareth-nx left a comment

jvdp1 commented Jul 21, 2021

milancurcic commented Jul 22, 2021

ivan-pi Jul 23, 2021

ivan-pi Jul 23, 2021

jvdp1 Jul 23, 2021

jvdp1 Jul 23, 2021

ivan-pi Jul 23, 2021

ivan-pi Jul 23, 2021

jvdp1 Jul 23, 2021

ivan-pi left a comment

gareth-nx commented Jul 23, 2021

jvdp1 commented Jul 23, 2021

ivan-pi commented Jul 23, 2021

milancurcic commented Jul 23, 2021

Addition of a subroutine to compute the median of array elements #426

Addition of a subroutine to compute the median of array elements #426

Conversation

jvdp1 commented Jun 4, 2021 • edited Loading

gareth-nx commented Jun 5, 2021

gareth-nx commented Jun 5, 2021

jvdp1 commented Jun 5, 2021

gareth-nx commented Jun 7, 2021

jvdp1 commented Jun 8, 2021

jvdp1 commented Jun 8, 2021

arjenmarkus commented Jun 9, 2021 via email

jvdp1 commented Jun 11, 2021

jvdp1 left a comment • edited Loading

Choose a reason for hiding this comment

gareth-nx commented Jul 4, 2021

jvdp1 commented Jul 4, 2021 via email

gareth-nx left a comment

Choose a reason for hiding this comment

jvdp1 commented Jul 21, 2021

milancurcic commented Jul 22, 2021

ivan-pi Jul 23, 2021

Choose a reason for hiding this comment

ivan-pi Jul 23, 2021

Choose a reason for hiding this comment

jvdp1 Jul 23, 2021

Choose a reason for hiding this comment

jvdp1 Jul 23, 2021

Choose a reason for hiding this comment

ivan-pi Jul 23, 2021

Choose a reason for hiding this comment

ivan-pi Jul 23, 2021

Choose a reason for hiding this comment

jvdp1 Jul 23, 2021

Choose a reason for hiding this comment

ivan-pi left a comment

Choose a reason for hiding this comment

gareth-nx commented Jul 23, 2021

jvdp1 commented Jul 23, 2021

ivan-pi commented Jul 23, 2021

milancurcic commented Jul 23, 2021

jvdp1 commented Jun 4, 2021 •

edited

Loading

jvdp1 left a comment •

edited

Loading