-
Notifications
You must be signed in to change notification settings - Fork 25
Wma/fix 488 #489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wma/fix 488 #489
Conversation
|
It looks like fsprand is erroring due to integer overflow on 32bit systems with this test case. |
|
this is due to randsubseq, not Finch. We need to implement our own randsubseq or improve the existing implementation |
|
It looks that it's strictly due to 32bit integer being a default one on Windows. It seems that length(CartesianIndices((1000000, 1000000)))Gives |
|
It's true that there's an overflow here, but it looks like it's on us to implement a version of the algorithm which uses the correct number types and does not overflow. The issue is basically that we want to randomly sample from |
|
the main challenge is that all of these algorithms assume the entire collection we are sampling from is representable in memory. |
47d5239 to
caff22b
Compare
|
I added custom implementation to make sure index variables are which doesn't really make any sense, considering that 38056 fits in Int32. |
|
I received a feedback on Julia Discourse. The issue here is that internally it still checks the index against the length (causing 32bit overflow), so it looks like it's going to be problematic on 32-bit anyway. |
|
One option is that we can build index ourselves, in the line where the error occurs: push!(S, A[i += ceil(Int64, s)])do instead something like: i += ceil(Int64, s)
i_tuple = Tuple{N}
for idx, dim in enumerate(dimensions):
i_tuple[idx] = i % dim
i = i // dim
end
push!(S, A[i_tuple...])This way we can index A with a tuple instead of linearized index that can exceed 32bit. |
|
We may want to move towards a custom implementation for tuples of ints, like you've sketched here |
|
Im also happy to implement this, if you like |
|
Let me take care of this! It's a small tweak to the existing implementation. |
|
I'm concerned we need to use a different algorithm. We can't just call |
|
FYI The original implementation does: s = randexp(r) * L
s >= n - i && return S
push!(S, A[i += ceil(Int,s)]) |
|
Yes, don't think the original algorithm is correct for sparse arrays with large dimension. The problem is that the function |
|
we might want to use something like https://github.com/JuliaStats/StatsBase.jl/blob/60fb5cd400c31d75efd5cdb7e4edd5088d4b1229/src/sampling.jl#L256C1-L278C44. This would allow us to customize the simple algorithm for tuples of ints, with correct behavior even for The above algorithm is for sampling k numbers without replacement. However, we still need to accurately sample the number of nonzeros to generate. We need a binomial distribution, but we need to be careful with how we call that too, since we want to sample from way more than n. However, it's debatable whether we should implement We may want to instead introduce a variant of |
|
Here's a concrete example of an issue: |
|
We have an implementation here if that helps: |
|
@hameerabbasi I don't think the algorithm you mentioned can handle this either: |
|
Just fyi, that number is the overflowed version of 1000000^4 |
|
Does this look good to everyone? There's a lot of statistics here so it would be nice to get a review. |
Yes, in |
fixes #488