Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a subroutine to compute the median of array elements #426

Merged
merged 33 commits into from
Jul 23, 2021

Conversation

jvdp1
Copy link
Member

@jvdp1 jvdp1 commented Jun 4, 2021

API can be already reviewed

Still to do:

  • Update hyperlinks in specs
  • Add comments in common.fypp
  • Add some tests (e.g., for int64 and real(dp))
  • Change sort to ord_sort when Issue with stdlib_sorting #428 will be merged

@gareth-nx
Copy link
Contributor

Untested suggestion:

I wonder if the code duplication for the integer/real cases can be reduced by including the output type o1 in the code generation loop:
#:for k1, t1, o1 in ( zip(REAL_KINDS, REAL_TYPES, REAL_TYPES) + zip(INT_KINDS, INT_TYPES, ['real(dp)']*len(INT_KINDS)) )

So here 'o1' is the type of res (which is always real(dp) in the integer case).

The calculations seem to rely on constants (1. and 0.5) that have the right precision, but it looks like they could be defined as local parameters of type o1, making the integer/real cases have identical code (?).

@gareth-nx
Copy link
Contributor

Another suggestion: From the sorting review, I understood that ord_sort is much faster than sort on data with significant pre-sorted chunks, while it is just a fraction slower for random data.

Considering that for median calculation both cases should be pretty common, I wonder if ord_sort is a better general choice? Although the downside is that more scratch memory will be required.

Perhaps the routine could have an optional argument sort_method that allows switching between the two?

@jvdp1
Copy link
Member Author

jvdp1 commented Jun 5, 2021

Thank you @gareth-nx for your suggestions.

I wonder if the code duplication for the integer/real cases can be reduced by including the output type o1 in the code generation loop:
#:for k1, t1, o1 in ( zip(REAL_KINDS, REAL_TYPES, REAL_TYPES) + zip(INT_KINDS, INT_TYPES, ['real(dp)']*len(INT_KINDS)) )

So here 'o1' is the type of res (which is always real(dp) in the integer case).

Great idea. I did it a bit differently that what you proposed. But it reduced the code a lot.

Considering that for median calculation both cases should be pretty common, I wonder if ord_sort is a better general choice? >Although the downside is that more scratch memory will be required.

Indeed, I also think that ord_sort is a better choice. I didn't use it because it needs more scratch memory. While I was trying ord_sort in this code, I found a bug in it (see #428 for details). When #428 will be merged, I will change sort for ord_sort in this procedure too.

Perhaps the routine could have an optional argument sort_method that allows switching between the two?

The API of median is currently the same as the API of mean. IMO it would be good to keep it like that.

@gareth-nx
Copy link
Contributor

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.

program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program

@jvdp1
Copy link
Member Author

jvdp1 commented Jun 8, 2021

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.

program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program

Good suggestion. I wonder what the intrinsic sum reports (and what the standard says) in such a case.

@jvdp1
Copy link
Member Author

jvdp1 commented Jun 8, 2021

In the case with mask as an argument, consider checking whether the size of the mask is equal to the size of x. Maybe this is not desired (to avoid throwing errors etc). But results like the following could easily be a user-error.

program median_local
    use stdlib_stats, only : median
    use iso_c_binding, only : dp => C_DOUBLE
    implicit none

    real(dp) :: x(10), y(11)
    integer :: i

    x = (/(i*1.0_dp, i = 1, size(x))/)
    y = (/(i*1.0_dp, i = 1, size(y))/)
    
    ! Should this be an error, because size(mask) != size(y)? 
    print*, median(y, mask= (x > 5.5_dp))
end program

Good suggestion. I wonder what the intrinsic sum reports (and what the standard says) in such a case.

With GFortran, it seems that no checks are done with sum in a release mode. In a debug mode, a runtime error is provided and mentioned that a mismatch was found. I am in favor to keep the same behavior as with the intrinsic sum implemented in gfortran (which is the case currently). However, I open to implemeent a check if desired by the community.

@arjenmarkus
Copy link
Member

arjenmarkus commented Jun 9, 2021 via email

@awvwgk awvwgk added the reviewers needed This patch requires extra eyes label Jun 9, 2021
@jvdp1
Copy link
Member Author

jvdp1 commented Jun 11, 2021

@gareth-nx @arjenmarkus I added a check on the shapes of x and mask ;) Question: how to test that it works properly, since it used an error stop?

Copy link
Member Author

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed an error stop that was inappropriate.

src/stdlib_stats_median.fypp Outdated Show resolved Hide resolved
src/stdlib_stats_median.fypp Outdated Show resolved Hide resolved
@gareth-nx
Copy link
Contributor

Hi @jvdp1

Is this ready for review? I will be happy to do that once it is (but I note the very top of the thread suggests you still need to update some hyperlinks).

@jvdp1
Copy link
Member Author

jvdp1 commented Jul 4, 2021 via email

Copy link
Contributor

@gareth-nx gareth-nx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- a couple of minor comments for your consideration, but they should be easy to address.

src/stdlib_stats_median.fypp Show resolved Hide resolved
src/stdlib_stats_median.fypp Show resolved Hide resolved
src/stdlib_stats_median.fypp Show resolved Hide resolved
doc/specs/stdlib_stats.md Outdated Show resolved Hide resolved
doc/specs/stdlib_stats.md Outdated Show resolved Hide resolved
doc/specs/stdlib_stats.md Outdated Show resolved Hide resolved
src/common.fypp Show resolved Hide resolved
@jvdp1
Copy link
Member Author

jvdp1 commented Jul 21, 2021

Thank you @gareth-nx @leonfoks @ivan-pi @milancurcic for your review and comments.
I believe I answered all of them.

This whole discussion about selection algorithms led me to add some rules for cases where array contains NaN. Now if it is the case, the result will contain NaN too. In the previous implementation, the result was undetermined in such cases.

@jvdp1 jvdp1 added reviewers needed This patch requires extra eyes and removed waiting for OP This patch requires action from the OP labels Jul 21, 2021
@jvdp1 jvdp1 mentioned this pull request Jul 21, 2021
@milancurcic
Copy link
Member

@ivan-pi can this PR be merged?

src/common.fypp Outdated Show resolved Hide resolved
src/common.fypp Outdated Show resolved Hide resolved
src/stdlib_stats.fypp Outdated Show resolved Hide resolved
n = size(x, kind=int64)
c = floor( (n + 1) / 2._${o1}$, kind=int64 )

x_tmp = reshape(x, [n])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reshape necessary here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see this is to flatten the array. I guess pack could also be used in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subroutine sort only accepts rank-1 array, while median support arrays of all ranks.

Do you have another suggestion to avoid the reshape?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, x_tmp = pack(x, .true.) should work too. Do you think that pack is more efficient than reshape in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that a good compiler would do the same thing. So it can remain as is.

call check( any(ieee_is_nan(median(d3, 3, .false.))), '${k1}$ median(d3, 3, .false.)' )

call check( abs(median(d1, 1) - 1.5_${k1}$) < ${k1}$tol, '${k1}$ median(d1, 1), even')
call check( sum(abs(median(d2, 1) - [2._${k1}$, -4._${k1}$, 7._${k1}$, 1._${k1}$])) < ${k1}$tol, &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using the array kind specifier help reduce the preprocessor noise?

[real(${k1}$) :: 2.0, -4.0, 7.0, 1.0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With gfortran, this implies a conversion from real(4) to real(8), with a warning triggered by -Wconversion-extra.
For this reason, I am inclined to keep as it is now.

Co-authored-by: Ivan Pribec <ivan.pribec@gmail.com>
Copy link
Member

@ivan-pi ivan-pi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the links you provided the implementation looks good to me. The preprocessing work necessary to manage the different ranks is admirable.

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

@gareth-nx
Copy link
Contributor

Following the links you provided the implementation looks good to me. The preprocessing work necessary to manage the different ranks is admirable.

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

The R interpreter gives NA if there are NA or NaN values present in the input vector:

> median(c(1,2,3,NA))
[1] NA

> median(c(1,2,3,NaN))
[1] NA

@jvdp1
Copy link
Member Author

jvdp1 commented Jul 23, 2021

The only comment I have left is how does the behavior compare to other languages when NaN values are present? The check at the beginning any(ieee_is_nan(x)) seems quite expensive, assuming that the majority of cases NaNs won't be there. But perhaps I am wrong and this is not a big issue.

Julia median returns NaN when NaN are present in the array (and as mentioned in @gareth-nx comment, it seems to be case for R too).

My main issue was that the result of sort is undetermined in prensence of NaN. Therefore, it would be also the case for median if there were no checks for NaN, and modifying its implementation (as proposed with quickselect) may result in a different behaviour. Checking for NaN allows to avoid possible future changes in the behavior of median if its implementation is modified.

@ivan-pi I answered all your questions. However, a couple of comments remain opened.

@ivan-pi
Copy link
Member

ivan-pi commented Jul 23, 2021

The open comments don't affect the behaviour, so with three approvals this can be merged.

Thanks for your work.🙏

@milancurcic
Copy link
Member

Thank you all, I'll merge.

@milancurcic milancurcic merged commit dd81cf5 into fortran-lang:master Jul 23, 2021
@jvdp1 jvdp1 deleted the median branch July 23, 2021 14:05
@awvwgk awvwgk removed the reviewers needed This patch requires extra eyes label Sep 25, 2021
@14NGiestas 14NGiestas mentioned this pull request May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants