Skip to content

Conversation

@willow-ahrens
Copy link
Member

fixes #488

@willow-ahrens willow-ahrens enabled auto-merge April 8, 2024 14:54
@willow-ahrens
Copy link
Member Author

It looks like fsprand is erroring due to integer overflow on 32bit systems with this test case.

@willow-ahrens
Copy link
Member Author

this is due to randsubseq, not Finch. We need to implement our own randsubseq or improve the existing implementation

@mtsokol
Copy link
Member

mtsokol commented Apr 9, 2024

It looks that it's strictly due to 32bit integer being a default one on Windows.

It seems that length(A) gets an overflow and returns -727379968. But then isn't it a bug in Julia?

length(CartesianIndices((1000000, 1000000)))

Gives -727379968 of 32bit integer on windows. But I think it should anticipate which size to use and return 64bit integer.

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 9, 2024

It's true that there's an overflow here, but it looks like it's on us to implement a version of the algorithm which uses the correct number types and does not overflow. The issue is basically that we want to randomly sample from (1:int_max, 1:int_max), but we cast this as a CartesianIndices object, which gets treated like an array. So we create an array-like-object with int_max^2 elements, but we access it with an int. I think we need to reimplement randsubseq to do this correctly, or check if someone else has done this already. Unfortunately, I think the algorithms in StatsBase.jl also have this same problem, but we can totally copy their code and repurpose it.

@willow-ahrens
Copy link
Member Author

the main challenge is that all of these algorithms assume the entire collection we are sampling from is representable in memory.

@mtsokol mtsokol disabled auto-merge April 10, 2024 10:27
@mtsokol mtsokol added the bug Something isn't working label Apr 10, 2024
@mtsokol mtsokol force-pushed the wma/fix-488 branch 2 times, most recently from 47d5239 to caff22b Compare April 10, 2024 11:35
@mtsokol
Copy link
Member

mtsokol commented Apr 10, 2024

I added custom implementation to make sure index variables are Int64 for now. Weirdly it fails with:

BoundsError: attempt to access 1000000×1000000 CartesianIndices{2, Tuple{Base.OneTo{Int32}, Base.OneTo{Int32}}} at index [38056]

which doesn't really make any sense, considering that 38056 fits in Int32.

@mtsokol
Copy link
Member

mtsokol commented Apr 10, 2024

I received a feedback on Julia Discourse. The issue here is that internally it still checks the index against the length (causing 32bit overflow), so it looks like it's going to be problematic on 32-bit anyway.

@mtsokol
Copy link
Member

mtsokol commented Apr 10, 2024

One option is that we can build index ourselves, in the line where the error occurs:

push!(S, A[i += ceil(Int64, s)])

do instead something like:

i += ceil(Int64, s)
i_tuple = Tuple{N}
for idx, dim in enumerate(dimensions):
    i_tuple[idx] = i % dim
    i = i // dim
end
push!(S, A[i_tuple...])

This way we can index A with a tuple instead of linearized index that can exceed 32bit.

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 10, 2024

We may want to move towards a custom implementation for tuples of ints, like you've sketched here

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 10, 2024

Im also happy to implement this, if you like

@mtsokol
Copy link
Member

mtsokol commented Apr 10, 2024

Let me take care of this! It's a small tweak to the existing implementation.

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 10, 2024

I'm concerned we need to use a different algorithm. We can't just call randexp and convert to an integer, I'm not even sure that's correct for Int64. I don't think we can expect it to perform well after the float mantissa loses precision, around $n=2^{53}$ this algorithm is only going to select $i$ equal to multiples of 2, 4, 8, etc.

@mtsokol
Copy link
Member

mtsokol commented Apr 10, 2024

FYI The original implementation does:

s = randexp(r) * L
s >= n - i && return S
push!(S, A[i += ceil(Int,s)])

@willow-ahrens
Copy link
Member Author

Yes, don't think the original algorithm is correct for sparse arrays with large dimension. The problem is that the function f(x) = ceil(Int, x::Float) is not actually capable of producing all possible integers.

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 10, 2024

we might want to use something like https://github.com/JuliaStats/StatsBase.jl/blob/60fb5cd400c31d75efd5cdb7e4edd5088d4b1229/src/sampling.jl#L256C1-L278C44. This would allow us to customize the simple algorithm for tuples of ints, with correct behavior even for fsprand(typemax(Int64), typemax(Int64), (10.0/typemax(Int64))/typemax(Int64)).

The above algorithm is for sampling k numbers without replacement. However, we still need to accurately sample the number of nonzeros to generate. We need a binomial distribution, but we need to be careful with how we call that too, since we want to sample from way more than n.

However, it's debatable whether we should implement fsprand(n, m, p) at all when n * m overflows, because even though we expect n * m * p nnzs, we aren't prepared for n * m nnzs, which could be a possibility, especially on 32 bit systems.

We may want to instead introduce a variant of fsprand that allows us to specify the exact number of nonzeros, then use the sampling $k$ without replacement algorithm I linked here.

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 10, 2024

Here's a concrete example of an issue:

julia> using Finch
Precompiling Finch
  1 dependency successfully precompiled in 24 seconds. 13 already precompiled.

julia> t = 1000
1000

julia> countstored(fsprand(t, t, (100/t)/t))
108

julia> countstored(fsprand(t, t, (100/t)/t))
108

julia> countstored(fsprand(t, t, (100/t)/t))
99

julia> t = typemax(Int64)
9223372036854775807

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> t = 1_000_000
1000000

julia> countstored(fsprand(t, t, t, 100/t/t/t))
132

julia> countstored(fsprand(t, t, t, 100/t/t/t))
112

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

@hameerabbasi
Copy link
Collaborator

@willow-ahrens
Copy link
Member Author

@hameerabbasi I don't think the algorithm you mentioned can handle this either:

willow@Willows-Mac-mini ~ % poetry show 
joblib        1.4.0  Lightweight pipelining with Python functions
llvmlite      0.42.0 lightweight wrapper around basic LLVM functionality
numba         0.59.1 compiling Python code using LLVM
numpy         1.26.4 Fundamental package for array computing in Python
scikit-learn  1.4.2  A set of python modules for machine learning and data mining
scipy         1.13.0 Fundamental algorithms for scientific computing in Python
sparse        0.15.1 Sparse n-dimensional arrays for the PyData ecosystem
threadpoolctl 3.4.0  threadpoolctl
willow@Willows-Mac-mini ~ % poetry run python3
Python 3.11.8 (main, Feb  6 2024, 21:21:21) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sparse
>>> sparse.random([1000000,1000000,1000000,1000000], nnz=4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/willow/Library/Caches/pypoetry/virtualenvs/willow-ENU-A1do-py3.11/lib/python3.11/site-packages/sparse/_utils.py", line 340, in random
    ).reshape(shape)
      ^^^^^^^^^^^^^^
  File "/Users/willow/Library/Caches/pypoetry/virtualenvs/willow-ENU-A1do-py3.11/lib/python3.11/site-packages/sparse/_coo/core.py", line 1029, in reshape
    raise ValueError(f"cannot reshape array of size {self.size} into shape {shape}")
ValueError: cannot reshape array of size 2003764205206896640 into shape (1000000, 1000000, 1000000, 1000000)

@willow-ahrens
Copy link
Member Author

Just fyi, that number is the overflowed version of 1000000^4

julia> t = 1000000
1000000

julia> t * t * t * t
2003764205206896640

@willow-ahrens
Copy link
Member Author

willow-ahrens commented Apr 11, 2024

Does this look good to everyone? There's a lot of statistics here so it would be nice to get a review.

@willow-ahrens willow-ahrens merged commit f8c7ae6 into main Apr 12, 2024
@willow-ahrens willow-ahrens deleted the wma/fix-488 branch April 12, 2024 15:49
@hameerabbasi
Copy link
Collaborator

Just fyi, that number is the overflowed version of 1000000^4


julia> t = 1000000

1000000



julia> t * t * t * t

2003764205206896640

Yes, in sparse, we currently don't support arrays of size >= 2 ** 64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

should we support fsprand(n, m, p) when n * m > typemax(Int)?

4 participants