Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior upon increasing input variable. (Illegal memory access) #1953

Closed
Ritobusk opened this issue May 28, 2023 · 5 comments · Fixed by #1954
Closed

Weird behavior upon increasing input variable. (Illegal memory access) #1953

Ritobusk opened this issue May 28, 2023 · 5 comments · Fixed by #1954

Comments

@Ritobusk
Copy link

Hello.

(I am assuming the people fixing issues have access to the futharkhpa03fl server since the program uses a lot of memory!) I get the following error:

./knn-driver: knn-driver.c:10583: CUDA call
  cuLaunchKernel(ctx->program->gpu_map_transpose_f32, grid[0], grid[1], grid[2], 32, 8, 1, 0 + (shared_sizze_66634 + (8 - shared_sizze_66634 % 8) % 8), ((void *)0), kernel_args_66631, ((void *)0))
failed with error code 700 (an illegal memory access was encountered)

This happens when I increase either an input variable k or d. E.g. when k = 60 it works as intended. But increasing it to 120 creates the error.

The machine I used was futharkhpa03fl with futhark version 0.22.4 (the default on the machine) and 0.25 with following git ID: git: 761c9de (Wed May 24 21:20:42 2023 +0200).
Both produced the same error.
My program uses a lot of memory so futharkhpa03fl or a machine with a gpu with a lot of memory should probably be used to run it!
I attached the code. There are 2 versions since unflatten was changed in how it is called in the futhark 0.25 version I used. Otherwise the 2 zip files are identical.
It can be run without issue with the following commands:

unzip IssueCodeV0.22.4.zip

cd IssueCode/

futhark cuda knn-driver.fut

futhark dataset -b -g 1i32 -g 60i64 -g [500000][50]f32 -g [500000][50]f32 > tmpdata

./knn-driver -t /dev/stdout -n < tmpdata >> res.txt

This should work (and take a little over 10 seconds). To reproduce the bug: Change the dataset to:
futhark dataset -b -g 1i32 -g 120i64 -g [500000][50]f32 -g [500000][50]f32 > tmpdata
And try to run the program as before with: ./knn-driver -t /dev/stdout -n < tmpdata >> res.txt
Here k (2nd value of the dataset) was changed. If you change d (dimensionality of 3rd and 4th values of dataset) but keep k at 60 you should get a similar error. E.g. with this as the dataset:

futhark dataset -b -g 1i32 -g 60i64 -g [500000][150]f32 -g [500000][150]f32 > tmpdata

The error occurs in this part of the code (knn-driver.fut line 68-82):

  let supercharging = 
    ----let ksqr = k**2
    let (super_knn_inds_seq) =
      loop (curr_knns_q) = (new_knns_q) for i < k do
        let k_inds = map (\knn_ind_q -> map (\q -> knn_inds_t[knn_ind_q[i],q]) (iota k)) knn_inds_q
        let k_points = map (\inds -> map (\ind -> 
                                        let test_point = map (\k_ind -> test_set[ind, k_ind]) (iota d)
                                        in (ind, test_point)
                                        ) inds) k_inds
                       --map (\inds -> map (\ind -> (ind, test_set[ind])) inds) k_inds
        in  map2 (\refs ind -> bruteForce queries[ind] curr_knns_q[ind] refs ) k_points (iota n)
    in super_knn_inds_seq

  let (super_knn_inds, _) =  unzip <| map (\i_knn -> unzip i_knn) supercharging
  in (super_knn_inds)

I tried running the program without this and it ran without errors.
Increasing either d or k will increase the amount of memory needed. However the program does not fail because of allocating too much memory with the datasets above. This can however happen if another proccess is run on futharkhpa03fl while trying to run the program. Or if k, d and the amount of points in the 3rd and 4th dataset values are sufficiently large enough.

IssueCodeV0.25.zip
IssueCodeV0.22.4.zip

@athas
Copy link
Member

athas commented May 28, 2023

Does it work with the non-GPU backends?

@Ritobusk
Copy link
Author

Ritobusk commented May 28, 2023

I have not tried with the datasets above. Sorry. I thought it would take too long. I have tried with smaller datasets with the c backend and those work as intended.

The dataset:
futhark dataset -b -g 1i32 -g 120i64 -g [100000][50]f32 -g [10000][50]f32 > tmpdata
Takes around 142 seconds with the c backend. The program scales with the gpu backend in a way that this would take around 10 hours with the dataset above. (With this one: futhark dataset -b -g 1i32 -g 120i64 -g [500000][50]f32 -g [500000][50]f32 > tmpdata)

I'll try to see if it scales in a similar fashion with the c backend!

To be more clear: The code works as intended up to a certain point with the cuda backend. Then when the inputs increase in size what I would expect to happen is that it has to use too much memory. But before that happens the 'weird' bug occurs.

Edit: it scales better! I think it will be finished within 1 hour.

@Ritobusk
Copy link
Author

Ritobusk commented May 28, 2023

It runs without error with the c backend on the dataset
futhark dataset -b -g 1i32 -g 120i64 -g [500000][50]f32 -g [500000][50]f32 > tmpdata
This dataset produces the error with the cuda backend as mentioned above!

@athas
Copy link
Member

athas commented May 29, 2023

The issue arises during transposition of an array of shape [50][60000000]. Note that the total number of elements does not fit in a single 32-bit integer. (See @coancea, it's beginning to happen in real code!)

I have a pretty good guess at what is going wrong now. I just hope fixing it won't cause performance regressions for transposing smaller arrays. I'd really prefer not to have multi-versioned transpose kernels. (Or rather, even more versions, as we already have three for every primitive type.)

athas added a commit that referenced this issue May 29, 2023
Unfortunately, using 64 bit index arithmetic in all cases results in
unacceptable performance for small arrays, so we need this bit of
multi-versioning.  Fortunately the transpose kernels are not large.

Fixes #1953.
athas added a commit that referenced this issue May 30, 2023
Unfortunately, using 64 bit index arithmetic in all cases results in
unacceptable performance for small arrays, so we need this bit of
multi-versioning.  Fortunately the transpose kernels are not large.

Fixes #1953.
@Ritobusk
Copy link
Author

Thank you for solving the issue! It now works on my end.

razetime pushed a commit to razetime/futhark that referenced this issue Jul 16, 2023
Unfortunately, using 64 bit index arithmetic in all cases results in
unacceptable performance for small arrays, so we need this bit of
multi-versioning.  Fortunately the transpose kernels are not large.

Fixes diku-dk#1953.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants