Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS-1135 - Random functions for collections #732

Merged
merged 18 commits into from
May 29, 2024
Merged

FS-1135 - Random functions for collections #732

merged 18 commits into from
May 29, 2024

Conversation

Lanayx
Copy link
Contributor

@Lanayx Lanayx commented Feb 23, 2023

Click “Files changed” → “⋯” → “View file” for the rendered RFC (i.e.: here).

Discussion: in this thread

@Lanayx Lanayx changed the title Initial commit FS-1135 FS-1135 - Random functions for collections Feb 23, 2023
Copy link
Member

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, some changes needed though.

RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
Lanayx and others added 2 commits February 23, 2023 11:04
Co-authored-by: Phillip Carter <pcarter@fastmail.com>
@Lanayx Lanayx requested a review from cartermp February 24, 2023 01:24
@T-Gro
Copy link
Contributor

T-Gro commented Feb 27, 2023

  1. The APIs which can fail on non-null values ( choice,choices,sample with collections that are emty/too small) -> I would favour having a try* variant of them, to avoid developers doing their own wrappers to prevent exceptions.
    - A try- alternative is in line with existing collections functions

  2. (personally, this can also wait for an impl PR, but for me it is an important detail):
    How do you plan to work with "Shared thread-safe Random instance ". Locking of a single global System.Random instance for the duration of entire operation? Or having multiple instances, either per-thread or in a pool?

  3. Since games and simulations were mentioned, it would be good to thing of high perf scenarios:

    • Just like it exists with sorting, could the Array.shuffle have an accompanying Array.shuffleInPlace to reduce copies?
    • Especially with a more secure random generator, it could be handy to pre-generate the random indexes for shuffle in parallel (to live in Array.Parallel module)
  4. The "with Random parameter" variants of the API : Does it only allow System.Random, or is https://learn.microsoft.com/en-us/dotnet/api/system.security.cryptography.randomnumbergenerator?view=net-7.0 considered?
    (an abstraction covering both regular System.Random as well as RandonNumberGenerator could be requiring a random generating function as an argument).

@Lanayx
Copy link
Contributor Author

Lanayx commented Feb 27, 2023

@T-Gro Hi, thank you for the review

  1. Not all of collection functions have try variant, only the ones where it makes sense. For example: max, average, reduce - just throw and that is ok for me. If function implies that collection is not empty, it is ok to throw, and choice,choices,sample imply that.
  2. The second option, you can see my implementation here https://github.com/Lanayx/fsharp/blob/master/src/FSharp.Core/Random.fs
  3. Agree, inplace shuffling can be useful, I'll add that
  4. System.Random is not sealed and it doesn't implement RandomNumberGenerator, so if someone wants use the latter he can inherit System.Random and implement virtual methods (BTW System.Random.Shared does exactly this)

@dsyme
Copy link
Contributor

dsyme commented Mar 1, 2023

  • I'm concerned about naming being confusing when colliding with existing functions, esp. List.choices v. List.choose. Perhaps

    • List.randomShuffle
    • List.randomShuffleWith
    • List.randomChoices
    • List.randomChoicesWith
    • List.randomSample
    • List.randomSampleWith

    It feels like the use of these functions is rare enough that adding "random" does help clarity here without too much cost.

  • Please list naming of equivalent python functions

  • Can we get any stats on use of equivalent python functions

  • I'm concerned there's no weighted choice option (as mentioned in the unresolved questions)

  • I'm concerned there are no functions to choose a single element.

  • I'm bit concerned by having both "randomChoices" and "randomSample" without the naming aligning, but let's first take a look at the Python naming, also perhaps other languages.

@dsyme
Copy link
Contributor

dsyme commented Mar 1, 2023

p.s. great to see progress on this

@Lanayx
Copy link
Contributor Author

Lanayx commented Mar 2, 2023

@dsyme thank you for the comments

  1. I don't like the random prefix, it makes me feel like those functions shouldn't belong there at all, while for example List.shuffle feels nice like it has always been there. If you are strongly against extending List module - we can consider creating a separate module Random (which will be static class), so the code will look like
[1; 2; 3] |> Random.shuffle
[|1; 2; 3|] |> Random.shuffle

This will allow us to use overloads, but prevent from using partial application. On one hand it is ok, on the other - it is not what the other collection functions look like, so it's not that consistent.

  1. Passing Random seems to fit better than function, since I think we should make it easy to pass new Random.Shared around. Still using shuffleWith instead of shuffleRand feels good with me.

  2. Idk about python functions usage, but I'm sure that every machine learning program should include random shuffling of the data between several epochs.

  3. Python: shuffle, choice, choices, sample (currently all names are taken from there). Java, Scala: shuffle only. Rust: choose, choose_multiple, choose_weighted, shuffle, partial_shuffle, choose_multiple_fill, choose_multiple_weighted. underscorejs: shuffle, sample

  4. choice is the function to get single element

  5. As for weights and counts parameters - my feeling is that they are more specific to data science, so if somebody needs them - they can contribute later, so the the functions will be for example List.choiceWeights and List.sampleCounts

  6. I understand concern about choose and choice wording, but that's a minor issue for me. There are much bigger intersections in F# with .Net base types, like Array or String being a module and a class, so functions are intersected with .net methods. Or having ...Async methods in different libraries returning Task and not Async etc

@Lanayx Lanayx requested review from dsyme and vzarytovskii and removed request for vzarytovskii and dsyme March 4, 2023 08:34
@vzarytovskii
Copy link
Contributor

vzarytovskii commented Mar 9, 2023

This looks fine to me overall.

I don't think we need a separate Random module for these functions, it will be a bit confusing on how it's related to collections.

As for naming - I don't have a strong opinion on that, since I don't have much experience in using those functions, but I share @dsyme's concern about naming - once they're in, there's no way for us changing or removing them in case they're confusing or lack some functionality.

I personally think, that the safest way is to align naming and signatures to what existing functions in .NET and Python have.

@Lanayx
Copy link
Contributor Author

Lanayx commented Mar 10, 2023

@dsyme I created a voting in the discussion and it seems that people like the idea of prefix, so I changed all the names in PR

@abelbraaksma abelbraaksma requested a review from dsyme May 16, 2024 14:53
Copy link
Contributor

@abelbraaksma abelbraaksma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks! A few minor nits, mainly about being precise (and follow the language of NextDouble).

I think this is essentially ready. While I did mention a few things in the discussion thread, up to you to mention them here as well (i.e. on infinite sequences (note that .NET only supports arrays and spans, in contrast), and on the necessary caching).

RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
Copy link
Contributor

@abelbraaksma abelbraaksma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but see these two suggestions of text remnants from the RFC template (not a blocker):


## Diagnostics

Please list the reasonable expectations for diagnostics for misuse of this feature. **I don't see a way to misuse it**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Lanayx, it appears we both forgot about this line. We shouldn't leave the remnants of the template hanging around (I know, it happens, but we shouldn't ;) ).

This is my suggestion:

Suggested change
Please list the reasonable expectations for diagnostics for misuse of this feature. **I don't see a way to misuse it**
There are no known diagnostics on any abuse or misuse of this feature.

EDIT: this is resolved (but GH doesn't let me)

RFCs/FS-1135-random-functions-for-collections.md Outdated Show resolved Hide resolved
@abelbraaksma
Copy link
Contributor

@dsyme and @vzarytovskii, this proposal is ready. If you can give it your thumbs up, I can merge it (or you can ;) ).

@abelbraaksma
Copy link
Contributor

abelbraaksma commented May 27, 2024

/cc reminder for tomorrow: @vzarytovskii

@dsyme dsyme merged commit 63a9351 into fsharp:main May 29, 2024
@dsyme
Copy link
Contributor

dsyme commented May 29, 2024

Merged! Thank you everyone for your hard work on this

@KevinRansom
Copy link
Contributor

KevinRansom commented Jun 7, 2024

@dsyme -- please consider the revoking the approval for this suggestion.

FSharp.Core is already too big, and so far all attempts to refactor it to make it smaller have failed. Useful helper functions such as these belong in an external library where they can be selected by a developer who is working on scenarios where this type of functionality is useful.

We currently ship fsharp.core as a nuget package embedded within the dotnet sdk:

  • The current FSharp.Core.dll is currently 3MB and it looks like when nullability is supported it will bloat by perhaps an additional .5 MB, because we will have to duplicate the metadata stream for nullable aware compilers and legacy compilers. From time to time we get pressure from the SDK team to become an optional install feature because of our size and relatively small user base. I think whenever they hear a customer complaining that VMs for the dotnet SDK have to be larger than for other cloud platforms and languages.

In general we have only added APIs to FSharp.Core when they were widely applicable and supported an idiomatic programming style, or interop with C# assemblies of course some that don't match these criteria have slipped in.

Sure these APIs are small and compact, however, so are the many other APIs we could add, these suggested APIs deserve to live in a support library rather than FSharp.Core, someone should work on it. Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.

@smoothdeveloper
Copy link
Contributor

smoothdeveloper commented Jun 7, 2024

FSharp.Stats?

get pressure from the SDK team to become an optional install feature because of our size and relatively small user base

It is not because F# is "small user base" (and that this line of thinking can lead to eradication of it) that it shouldn't be part of the SDK, unless we must make the CLR a C# only thing OR the SDK would come with no compilers, just the assemblies, msbuild stuff and dotnet tool. This also would save space in the SDK...

In anycase, I concur with your analysis, and also increasing the API surface for the compiler team, however useful, there are several packages that already extend FSharp.Core (https://fsprojects.github.io/FSharp.Collections.ParallelSeq/).

We should keep the fslang-design process though for those extensions though.

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 7, 2024

Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.

Random functions are not part of Numpy, they are part of Python standard library, since they are very basic and widely applicable and were really "missing" in F# since inception
https://docs.python.org/3/library/random.html

@krauthaufen
Copy link

I too think that bloated libraries are hard to maintain and refactor since i have experience with it sadly. Maybe this would be a good starting point for a second nuget package?
Could live in this repo and just produce something called FSharp.Collections maybe?
I personally think that randomozing collections is somewhat oddly specific for a standard library.
Just my thoughts

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 9, 2024

bloated libraries are hard to maintain and refactor

Fully agree. However we should differentiate that from absence of the very basic and generally applicable functionality.

Maybe this would be a good starting point for a second nuget package?

I don't think so

I personally think that randomozing collections is somewhat oddly specific for a standard library.

The presence of random functions in Python standard library is one of the many things that lowered the barrier of ML entry for newcomers and it very positively influenced language popularity, we should follow the success path here rather than avoiding it.

@dsyme
Copy link
Contributor

dsyme commented Jun 12, 2024

Ultimately the team looking after the dotnet/fsharp repo do really have a right of veto on this.

@KevinRansom I'd request that we get those concerns raised much, much earlier in the design process. e.g. at the suggestion stage, or minimally at the RFC-discussion stage. This RFC PR was open 12 months.

@Lanayx Given @KevinRansom's concerns could we get a measure of the size delta?

@Lanayx I would favour dropping the With variants if it helps with size concerns. I'm ambivalent about those anyway. @Lanayx is there any technical reason to include them, e.g. is passing rnd.Next to the By variant equivalent?

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 12, 2024

@dsyme As for size question, here is measurements
Current main: 3261 KB
With random changes: 3293 KB
So the change is 32KB, which is 1% size change. As for dropping .With method I tend to disagree, since it's the fastest option, 2.5 times faster than .By option

module TestData =
    let arr = Array.init 1000 id

[<MemoryDiagnoser>]
type Rand() =


    [<Benchmark(Baseline = true)>]
    member _.Bcl () =
        let newArr = Array.copy TestData.arr
        Random.Shared.Shuffle newArr
        newArr

    [<Benchmark>]
    member _.RandomWith () =
        TestData.arr |> Array.randomShuffleWith Random.Shared

    [<Benchmark>]
    member _.RandomBy () =
        TestData.arr |> Array.randomShuffleBy Random.Shared.NextDouble
| Method     | Mean      | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|----------- |----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| Bcl        |  4.229 us | 0.0650 us | 0.0608 us |  1.00 |    0.00 | 0.4730 |   3.93 KB |        1.00 |
| RandomWith |  3.743 us | 0.0354 us | 0.0331 us |  0.89 |    0.01 | 0.4768 |   3.93 KB |        1.00 |
| RandomBy   | 10.185 us | 0.0871 us | 0.0728 us |  2.41 |    0.04 | 0.4730 |   3.95 KB |        1.01 |

If 1% increase is indeed too much and I had to drop anything from API, I'd drop .By version of functions or Seq module functions (since all those functions internally do caching to the array as a first operation)

@brianrourkeboll
Copy link

@Lanayx If I had to guess, the HOF version being slower might be due to the randomizer not being inlined or else the branching when checking whether the value is between 0 and 1. The latter might be harder to address, I guess, although I haven't put thought into it.

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 12, 2024

@brianrourkeboll It will be slower anyway, because of checking value range and because of doing extra calculations of converting float 0..1 value to int min..max value

@brianrourkeboll
Copy link

brianrourkeboll commented Jun 12, 2024

@Lanayx Hmm, it seems to be possible to get randomShuffleBy much closer by making a few tweaks:

Benchmark source

| Method | Mean     | Error     | StdDev    | Ratio | RatioSD |
|------- |---------:|----------:|----------:|------:|--------:|
| Bcl    | 3.386 us | 0.0312 us | 0.0276 us |  1.00 |    0.00 |
| Pr     | 6.721 us | 0.0795 us | 0.0705 us |  1.99 |    0.03 |
| Faster | 3.881 us | 0.0290 us | 0.0271 us |  1.15 |    0.01 |

The lambda does seem to be required1 if you want full devirtualization, though. That is,

Faster.randomShuffleBy Random.Shared.NextDouble TestData.arr

is slightly slower (~2 μs on my machine) than

Faster.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr

That's probably because the JIT can devirtualize the call to NextDouble in the (inlined) lambda case but not in the other.

So if a user really cares about maximum performance, they can:

  • Use Array.randomShuffle or the BCL Shuffle.
  • Use Array.randomShuffleBy with a lambda.
  • Bind Random.Shared to a local variable and pass in localRandom.NextDouble.1

Footnotes

  1. If you store the Random in a local value, you don't need the lambda:

    // Just as fast as (fun () -> Random.Shared.NextDouble ())
    [<Benchmark>]
    member _.Faster () =
        let random = Random.Shared
        Faster.randomShuffleBy random.NextDouble TestData.arr
    
    2

@dsyme
Copy link
Contributor

dsyme commented Jun 12, 2024

Interesting thanks! BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.

I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)

For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle and so on directly available, without having to explain System.Random or any .NET functionality (or the awful Span<T> overloads on System.Random.Shared.Shuffle - a new presence in the .NET world that is really problematic for teaching - imagine trying to explain to a beginner student why their Shuffle overload resolution fails!).

That's really one of the main things FSharp.Core is for: to present a coherent, teachable, usable, portable programming model that captures most common in-memory programming scenarios before moving on to advanced data structures or interop with system features or UX or tensors and so on. Doing basic random permutations has been a part of programming models like this ever since the days of Python and before. It is in Python for a reason, and should be in F# for the same reason. The addition of System.Random.Shared is a factor in this - it indicated that .NET has embraced the same principle. Therefore F# should embrace the principle too from its own perspective.

The owners of dotnet/fsharp still have a veto right on this one. Size is a factor affecting many scenarios that might not be obvious.

@brianrourkeboll
Copy link

@dsyme

BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.

Do you mean System.Random.Shared in the BCL? If so, it uses an implementation with a ThreadStatic private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next, etc.).

Compare: dotnet/fsharp#17277 (comment) and dotnet/fsharp#17277 (comment).

@dsyme
Copy link
Contributor

dsyme commented Jun 13, 2024

Do you mean System.Random.Shared in the BCL? If so, it uses an implementation with a ThreadStatic private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next, etc.).

Got it, thanks!

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 13, 2024

For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle and so on directly available, without having to explain System.Random or any .NET functionality

But List.randomShuffle and so on will be directly available, so this probably doesn't apply? Very early beginners will just be able to use simplest version of random methods while studying. But anyway, System.Random is so central to random operations in .NET that even a beginner will have to learn it very soon. Also learners usually just copy documentation or samples or what ChatGPT tells them.

I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)

From my hands-on perspective ideal F# API that is based on C# experience should the following properties:

  1. Simpler or easier to use than C# API
  2. Same (or better) performance that C# API
  3. Same (or better) functionality set

So I look at different options with regard to those properties:

  • Simple versions (randomShuffle etc) 1. + 2. A bit worse than C# 3. Random characteristics will be worse than with Random.Shared
  • .By version (randomShuffleBy etc) 1. A bit harder than C# (passing delegate is harder to comprehend than passing object) 2. Worse than C# (without hacks, with hacks still a bit worse) 3. +
  • .With version (randomShuffleWith etc) 1. + 2. + 3. +

Based on above I would use the .With option in my work project (with Random.Shared) it was available. If it's not available, I'll have to add one more method to my own collection of missing language functions (near isNotNull) in all projects that need it. Also, if it's not available, what will we recommend to use by default in production? Will we tell to use this snippet:

Array.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr

This would raise a thought in my mind - "They had a chance to make it nice after 20 years, but failed it".

Ultimately the .With option would be not needed it Fsharp.Core targeted .NET 8, because the implementation could just be based on Random.Shared (which is by far the main case for .With option). But when (or if) this transition happens is unknown to me, so I'd keep it. Additional benefit is that there are many alternative implementations (like here) that all inherit System.Random and user can easily leverage one of them using .With versions without sacrificing performance.

@krauthaufen
Copy link

Just a thought: How exactly is randomShuffleBy any different from sortBy?

Signature-wise it seems to be identical and also the semantics seem to be the same.

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 13, 2024

Just a thought: How exactly is randomShuffleBy any different from sortBy?

Shuffle algorithmic complexity is O(N)

@krauthaufen
Copy link

krauthaufen commented Jun 13, 2024

So the generated double is actually interpreted as an index for classic swap-operations like this?

If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)

@Lanayx
Copy link
Contributor Author

Lanayx commented Jun 13, 2024

So the generated double is actually interpreted as an index for classic swap-operations like this?

Right

If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)

It's covered in the RFC, the ArgumentOutOfRange exception will be thrown
BTW, it's a good case with NaN, i've missed it in the implementation, will need to fix this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants