Implement cuda kernels for unary ops #341

coreylowman · 2023-01-07T16:24:42Z

ViliamVadocz · 2023-01-09T22:06:13Z

Currently the test for abs fails on my machine. There is a small issue in the abs_backward implementation.

running 1 test
test tensor_ops::abs::tests::test_abs ... FAILED

failures:

---- tensor_ops::abs::tests::test_abs stdout ----
thread 'tensor_ops::abs::tests::test_abs' panicked at 'assertion failed: `(left == right)`
  left: `[0.2, 0.2, 0.0, -0.2, -0.2]`,
 right: `[-0.2, -0.2, 0.0, 0.2, 0.2]`', src\tensor_ops\abs\mod.rs:53:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I fixed it by changing line 27 in abs.cu. signbit(x) is true when the x is negative.

-    float dx = inp[i] == 0.0 ? 0.0 : (signbit(inp[i]) ? 1.0 : -1.0);
+    float dx = inp[i] == 0.0 ? 0.0 : (signbit(inp[i]) ? -1.0 : 1.0);

You could also write it a bit less confusingly like this:

float dx = inp[i] == 0.0 ? 0.0 : copysignf(1.0, inp[i]);

ViliamVadocz · 2023-01-09T22:16:59Z

If I understand correctly, the current kernels should be written for single-precision floating point numbers (float or f32). If so, it might be appropriate to use function specifically for floats such as fabsf or expf to avoid any accidental conversions to doubles and back.

Here's a list of the available functions: Single Precision Mathematical Functions

What are your thoughts on using intrinsics?

coreylowman · 2023-01-10T14:55:17Z

Oooh good catch about the single precision stuff, I had no idea, so glad ya'll know 😀 What is the difference between intrinsics vs the non intrinsics?

… and #334 (#346) * Add cuda implementations for unary and binary tensor operations * Add cuda kernel for powi; Use fewer 64-bit functions * use copysign in abs kernal, as suggested in #341 (comment)

coreylowman · 2023-01-10T16:13:56Z

Thanks for the PR @nkoppel

nkoppel · 2023-01-10T16:17:07Z

Intrinsics are faster than their non-intrinsic counterparts, at the cost of lower precision. Here is documentation for the standard mathematical functions, and here are their intrinsic counterparts. Note that ulp error stands for units in last place.

It's probably best to avoid using intrinsics for now, because we can't guarantee that intrinsics will be worth it for every use case, and because these functions will usually only take a small fraction of the runtime anyways.

coreylowman · 2023-01-10T16:20:26Z

Makes sense, it also seems like the -use_fast_math flag forces them to compile to intrinsics? So we should use that flag instead of hardcoding to intrinsics in the future if we want to move to them

nkoppel · 2023-01-10T16:32:21Z

It may be best to expose -use_fast_math to users through a cargo feature so that they can keep accuracy and avoid potential weird behavior from using intrinsics when needed.

ViliamVadocz · 2023-01-10T22:24:04Z

One big downside of intrinsics other than reduced accuracy is they might handle subnormals (NaNs, infs, etc.) differently. I think it definitely should not be enabled by default but instead behind a feature flag.

coreylowman added the gpu Related to GPU support label Jan 7, 2023

coreylowman mentioned this issue Jan 7, 2023

0.11.0 release #278

Closed

47 tasks

nkoppel added a commit to nkoppel/dfdx that referenced this issue Jan 10, 2023

use copysign in abs kernal, as suggested in coreylowman#341 (comment)

4cead1f

nkoppel mentioned this issue Jan 10, 2023

Add cuda implementations for unary and binary tensor operations in #341 and #334 #346

Merged

coreylowman closed this as completed Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cuda kernels for unary ops #341

Implement cuda kernels for unary ops #341

coreylowman commented Jan 7, 2023 •

edited

Loading

ViliamVadocz commented Jan 9, 2023

ViliamVadocz commented Jan 9, 2023

coreylowman commented Jan 10, 2023

coreylowman commented Jan 10, 2023

nkoppel commented Jan 10, 2023

coreylowman commented Jan 10, 2023

nkoppel commented Jan 10, 2023

ViliamVadocz commented Jan 10, 2023

Implement cuda kernels for unary ops #341

Implement cuda kernels for unary ops #341

Comments

coreylowman commented Jan 7, 2023 • edited Loading

ViliamVadocz commented Jan 9, 2023

ViliamVadocz commented Jan 9, 2023

coreylowman commented Jan 10, 2023

coreylowman commented Jan 10, 2023

nkoppel commented Jan 10, 2023

coreylowman commented Jan 10, 2023

nkoppel commented Jan 10, 2023

ViliamVadocz commented Jan 10, 2023

coreylowman commented Jan 7, 2023 •

edited

Loading