Convert Futhark to 64-bit #134

athas · 2016-03-19T22:37:01Z

The Futhark compiler is presently tied heavily to the 32-bit world. Not only are all dimension sizes and loop indices 32-bit (signed!) integers, but the generated code also has these assumptions. Clearly this is not going to fly in the long run. I don't want to mix 32-bit and 64-bit sizes, so we should just move everything to 64-bit, always. Should the int type also default to 64-bit? I think that might be a little confusing.

I did a few experiments, and 64-bit integer arithmetic does not seem to be noticeably slower on the GPU, and we rarely have arrays of sizes, so programs should not end up using more storage.

The text was updated successfully, but these errors were encountered:

athas · 2016-03-23T11:18:07Z

Cosmin made a good point that we should also change iota to return 64-bit integers, because it is often used for generating indices. I must admit I am a little uneasy about making iota return a different type than the default int (which will likely remain an alias for i32).

athas · 2016-09-29T14:50:49Z

This has some uncomfortable consequences. This program becomes invalid:


fun main(i: int, bs: [n]bool): bool =
  i >= n

Because n is now of type i64, while n is of type i32.

oleks · 2016-09-29T14:56:03Z

Also beware of this stuff.

athas · 2016-09-29T15:05:32Z

I am not sure you have read that paper.

\ Troels
/\ Henriksen

athas · 2016-09-29T15:08:01Z

Specifically, Futhark is a "safe" language, in that everything is bounds-checked. The issues raised in that paper are about implicit conversions and similar errors in a low-level language. I don't think it is relevant to us.

oleks · 2016-09-29T15:08:05Z

Or thought about this issue.

athas · 2017-03-19T19:14:19Z

This is becoming increasingly relevant and will have to be solved relatively soon. The compiler engineering part is straightforward enough; the big question is how the source language is affected. We will probably need @melsman's advice on language design here.

In Rust, all sizes are of a supposedly opaque size type. Under the covers, it is almost always a 64-bit integer, however. I think the intent is to make the programmer stop and think when he or she does size computation. Maybe that would be a good way to go.

athas · 2017-03-25T09:51:45Z

I talked to @melsman about this and we decided to just make sizes to be of type i64. There is no reason for a size type.

However, there is one more problem that I forgot to bring up. Right now, the arguments to iota and replicate can be of any integral type. This means that the law shape (iota x) == [x] does not hold, because shape always returns an array of i32s (and i64s in the future), while x can be some other type. Do we care about this?

Related to #134.

athas · 2017-03-25T14:20:10Z

Changing the type of dimension declarations from i32 to i64 makes 155 of our 641 test programs fail. Wonderful way to spend a weekend.

athas · 2017-03-25T16:30:08Z

Oh, and we'll need to make some extensions to ScalExp handling and @coancea's algebraic simplification to handle 64-bit values. Possibly also propagate range information through type conversions. I begin to remember why I gave up last time I attempted this change.

athas · 2017-03-26T10:12:25Z

I have modified enough of the compiler to translate OptionPricing. Unfortunately, we get about a 50% slowdown, likely due to the fact that 64-bit operations are emulated on current GPUs (and take up more register space, too). The only thing that are 64-bit are array shapes and index calculations, and I suspect the latter is what kills us. I'll suspend my efforts for now. It takes less than a day to convert the compiler to 64-bit, but the trick will be coming up with a technique that also makes it generate fast code.

One solution would be to put the burden on the programmer to indicate the type of dimension sizes of arrays. This feels very complicated and clunky, however. Another would be to use an opaque size type in the source language, which we can then translate as appropriate for the target hardware.

One thing that is certain is that we will definitely have to come up with a fix if we want to scale to large multi-GPU/distributed programs. While we do support arrays taking up more than 4GiB space, we cannot handle arrays with more than 2**31-1 elements. I have a nagging suspicion that such arrays will occur eventually.

RasmusWL · 2017-03-26T17:25:10Z

I'm all in favour of introducing a size type. It feels like TheRightThingToDo™.

athas · 2018-10-10T12:43:51Z

This has turned up in the way we handle segmented/blocked operations like scan and reduce_by_index, where we first compute a flat index. This computation may overflow, even though the nested index will not.

This is related to #134. I think we will just gradually move things to 64-bit over time.

athas · 2019-06-04T19:40:28Z

I have now encountered real programs that need to handle arrays with more than two billion elements. We need to address this.

This is not a complete solution to #134, because individual array dimensions must still fit in a signed 32-bit integer. However, it does allow the product of dimensions to be large.

* Use 64-bit arithmetic for computing array offsets. This is not a complete solution to #134, because individual array dimensions must still fit in a signed 32-bit integer. However, it does allow the product of dimensions to be large. * Use 64-bit arithmetic for computing thread IDs. This allows kernels to have more than 2**32-1 virtual threads, although the number of groups must still be less than this. Note that this is not about *physical* threads, where having this many would be pointlessly large, but about virtual threads. * It is apparently better to use zero-extension here. * More tags; just stop running this! * This also needs 64-bit indexes.

athas · 2020-09-09T14:03:09Z

Everyone I've talked to seems to think we should just do one massive compatibility break, and turn all size parameters (and functions like iota) into 64-bit versions, all at once. This will break pretty much every Futhark program, but is there a sensible alternative? At least the breakage will be pretty easy to fix.

athas · 2020-09-10T11:54:57Z

On the RTX 2080 Ti, the difference is smaller, and in many cases negligible.

athas · 2020-09-10T12:02:23Z

Impact on CPU performance appears fine. Actually, from what I can see it gets a little faster (5-10%). I hope that will also apply to the multicore backend.

athas · 2020-09-10T13:14:52Z

Impact on LUD and SGEMM is neglible-to-zero, especially when using the CUDA backend. It really does seem CUDA behaves much better with 64-bit sizes. I am pleased that these relatively highly tuned benchmarks behave so well. So far, my analysis is that 64-bit sizes are primarily detrimental for map-reduce kernels with very large map kernels (like OptionPricing).

workingjubilee · 2020-09-10T23:44:53Z

Rust uses the usize/isize type partly because, as a strongly hardware-oriented abstract machine (in this sense "Rust" is "a thin layer on the metal"), it would like to have low-impedance abstractions over various ISAs, especially when handling raw pointers, and there are actually already 128-bit integer ISAs! They use 64 bit hardware in practice, of course, they just allow pointers to be as wide as 128 bits. So scaling from 16 bits to 64 or even 128 bits matters there.

I think that Futhark is fine to select 64-bit "pointer" type, however, if it thinks it will only see usage in that area.

athas · 2020-09-11T10:13:13Z

Rust also needs to actually deal directly with object addresses ("pointers"). In Futhark, that is not exposed to programmers - only the sizes of objects, and offsets within them. Even with our current 32-bit indexes, a Futhark program can still allocate far more than 4GiB of memory, since the pointers that operate as array offsets are always whatever the target machine uses.

I have measured the impact of 64-bit sizes on our microbenchmarks, and as expected, there is very little difference. The one weird exception is regular segmented scans, which shows a 30% slowdown. I have not looked at the code yet, but I suspect it's something else that breaks, or an optimisation that no longer applies.

athas · 2020-09-11T11:57:27Z

The segmented scans are expensive because of the frequent check for whether we are crossing a segment, which involves computing a 64-bit remainder. I think we can improve on that, but I will leave it for later.

workingjubilee · 2020-09-11T15:11:08Z

Yeah. With Rust, you're even doing a lot of byte-bashing so offsets are very often but I imagine Futhark code is reaaally not doing as much raw byte-bashing.

C++ has even considered it a mistake to use unsigned integers in general and the authors of the STL (including Bjarne Stroustrup) have expressed that they would have preferred it if their indexing, even, used signed integers, because they are strongly of the opinion that unsigned integers should only be used for expressing raw bitfields and an offset is not a bitfield. Which makes sense for their case because you want to often do ~~pointer~~ offset arithmetic.

Closes #134.

This has wide-ranging implications for the types of things in the prelude: * Functions like `replicate` and `iota` now take `i64` arguments. * The `from_fraction` function now takes `i64`. * The `to_i32` function has been removed. Closes #134.

* More Steve-friendly. * Fix usage text. * This is more type-safe. * Detect more complex invariant loop parameters. (#1111) Closes #1110. * Detect bad entry points names early. * Oops. * Clearer error message. * Style fix. * Fix expected error. * Fix #1112. * Handle another obscure tiling case. * Parallelise tiling across entry points. * Test that this tiles. * Remove dead reference. * Use cache-oblivious transpose on CPU. (#1113) This is a good bit faster in many cases, but a bit slower for small arrays. Maybe we can special case those later. * This linker flag is needed. * CUDA backend can now be called from multiple threads. (#1114) Closes #1077. * More principled use of phantom-typed PrimExp. * Clean up implementation. * Generalise the optimisation of concatenations. (#1116) * Generalise the optimisation of concatenations. * Add small hack to avoid (or delay...) code explosion. * Use const pointers in array creation functions. * Avoid some unused-parameter warnings. * Avoid more untyped operations. * Ignore parameters in a smarter way. * Style fix. * Use style-check.sh in precommit hook. * Introduce typed variables in code generator (#1119) These will help us keep our types straight, hopefully. * Fix synchronisation bug. * delete gitattributes * Fix this description. * Remove instance with nonobvious type. * futhark-benchmarks: bump * Do not tolerate warnings in benchmarks. * Permit array indexing with any integer type. (#1123) Also changes the rules for warning about type defaulting, so that only types that propagate to the top-level binding are warned about. Otherwise we would get an ambiguity warning for every instance of `x[0]`. Closes #1122. * Ban unsigned ranges. (#1125) This has various implications, such as removing u8.iota and similar. The point is to simplify size handling in preparation for 64-bit sizes, and the optimisations they will need. * Better loop simplification for non-i32 loops. * Fix #1126. (#1127) * Clarify some type restrictions. * Actually enforce this restriction. * Relax these constraints. * An allocation is a priori considered for hoisting. * Remove certificates on safe statements. * Revert "Remove certificates on safe statements." This reverts commit 98db1fd. Turns out this broke user-provided assertions. * Remove certificates on some allocations. * Build and upload nightly tarball on macOS. (#1130) * Fix typo. * Fix the typo again. * Better opencl commands (#1131) * Add --list-devices flag to opencl executables This convenience feature lists the current devices and platforms on the system, showing what to choose between using the `-p` and `-d` flags. * Add --help command and usage in C-like compiled futhark programs * Add --help command to man and usage pages * Shorter help messages * Add changelog entry * Also run these with oclgrind. * Hack around the local memory problem. * futhark-benchmarks: bump * This is 0.17.2. * Onwards! * Releases must be on master. * Revert "Releases must be on master." This reverts commit 4e02f90. As usual, CI services have terrible documentation with no semantics. * Try to restrict releases to master, again. * Fix #1133. * Better work queue in pmapIO. * Fix prelude doc link (#1136) * Fix NaN comparisons yet again. * Fix action name and description. * futhark-benchmarks: bump * Fix a sum type corner case. I hoped this would also fix #1139, but it did not. * Add missing case for sum types. * Add Bifunctor instance to TypeBase. * Slightly more information in this internal error message. * Consistently use the variable-free type here. * Fix #1139. * Better context information when type-checking If. * Fix #1142. * Fix #1143. * Remember to zero-initialise this. * This is 0.17.3. * Onwards! * Merge multicore back-end into master (#1146) * Use strong compare_exchange * Fixes for CAS SegHist * don't start new task while in nested case * Add name id to subtask struct * Only take time if MCPROFILE is defined * Make us of CAS swap too * Use a faster rand num gen * XXX * We need direct execution to avoid too much overhead * Bug fix for segHist * Optimize code based on number of subtasks created * Add name to subtask * Check code body for possible imbalance * Only generate 1 subtask when no free workers * Support 64 bit integer CAS seghist * Refactor some code * Allocate cached intermediate arrays on stack * This should not be a pointer * Pass string for easier identification of generated code * Choose histogram implementation based on condition * Use lock-free deque * Decide on sequential execution based on number of free workers * This shoudl not be a pointer * Automatik granularity of dynamic scheduled tasks * Override tid on steal * This should be statically scheduling * Remove now unused Code * I don't need this anymore * Uses own tid when subtask is chunkable * Need this on Linux * This should be up here * Optimize segscan when op is on scalar values * Remove unused code * Add name identifier to task * Start work on using timing * Remove debug prints * Consistently extract allocations * Implement heartbeat style timing * Improve timing * Pass along the physical thread id * Need to check for errors * Use a Global variable for threads to exit * Dont' use the global var anyways * Add tuning program * Make use of kappa for dynamic scheduling * Clean up code a bit * Only create enough task based minimum task overhead * Clean up tuning program * Implement a dynamic scheduling algorithm * Try to steal from "main" queue first * Reduce code duplication * Need to break on succesful steal * More clean-up * Remove unused stuff * Remove unused code * Better comments * We don't use this in these cases * I missed one * More refinement * Fix potential race condition * Make tuning use seperat threads * Hack for avoid unsolvable deadlock * Forgot to commit this * This estimate is slightly better * This should be zero * This should be smaller or less than 1 * Measure timings inside of function bodies * Remove redundant parenthesis * CLean up timing program * Remove redundant initialization * Hack for avoiding division zero in segreduce-iota * Revert "Hack for avoiding division zero in segreduce-iota" This reverts commit c313ca9. * Merge master into multicore. * Do not perform flattening in multicore pipeline. Instead we depend on sequentialisation to generate efficient code. * Revert "Merge master into multicore." * Revert "Revert "Merge master into multicore."" * Use 64-bit for intermediate indicies * Beter naming for chunks * Only use 64-bit in cases we compute product of dimensions * Make variables more readable * Update tuning program to follow similar approach to heartbeat * Lets try this automatically process * Use simpler stop condition * Use a more appropriate default value * Add a AtomicXchg operation * I forgot an _n * Lets try to old queue again * Implement half work-stealing * Steal from the front * Adapt tuning to new queue * Disable auto tuning * Change ints to int64_t * This is not a float * Let random number be unsigned * Add some debug flags for later * This looks prettier * Measure time properly for nested parallelism * This needs a fence now * Measure time * Modify to tuning again * Modify tuning program * This should finally work * Prettier functions * Shoudl initialize this * Oops * Remove unused field * This should be int64_t * Clean up * Setup for easy switching between queues * Better error handling * Fix typo * Do not naively lift allocations out of loops. * Call this a parloop * Renaming * Clean up * Use a higher res clock (if avaliable) * Update tuning program too * Stolen tasks are executed immediately * Use threshold to select between segHist versions * No need to measure sequential runs * Show in us instead of ns * Use half-work stealing with chaselev deque too * Remove debug statements * This should be int64_t * Wake up threads when there is work with chaselev * Need a fence here, just in case * Steal from queue 0 first, else try random queue * That was dumb * Don't try to steal if there is no active work * Use jobqueue again * Accidently swapped these * This should based on number of subhistos * Cast these to int64 * Simplify seghist * FIx compile error * try to use local variables when possible (WIP) * Just hack with shape for now * Don't run sequentially * Fix potential deadlock * Fix for more reduce cases * Let's try to use nested ops too * Fix for missing variable declaration for nested op * Let's just avoid stack allocations * Ok let's not Revert "Let's just avoid stack allocations" This reverts commit 9e80aa3. * Only wake up threads when using the nested function * Use the actual number of subtasks created to decide if task should be sequential * Revert "Use the actual number of subtasks created to decide if task should be sequential" This reverts commit 6961e61. * Revert "Only wake up threads when using the nested function" This reverts commit db8c704. * Add field to wake up threads * Need to load this * Oops * Clean up * Use exact same process as in paper * Use hardware cycle counter * Remember to convert to ns * I hope this works on linux * Remove void * Not consistent to use cycle counters * Apply Ormolu. * Reduce duplication. * Test multicore on CI. * Cleanup. * Restore nice Dev module. * These tests are now in the attributes/ subdir. * Run CI on multicore branch, maybe. * Strangle some warnings. * Clean up code * prettier clamping of number of subtasks * Simplify * Remove dead code * Clean up more * Avoid deadlocking in case of errors * Add comments * Simplify * CLean up and add more comments * Properly measure time working by each thread * Only output thread usage if profiling * I forgot this one * Remove dead code * Fix function args for deque destroy for chase-lev * Only used nested function when number we don't have enough work * Vectorise SegHist operators. * Also do double-buffering in multicore backend. * Measure time for sequential execution too * Give multicore -P option a description. * Eliminate DeclareStackMem. * Improve SegRed with vectorised operators. * Include possible allocation in prebody too * Remove unused variable. * Sequentialisation of histograms in multicore pipeline. * Avoid division zero like this instead * Small fixes * rename task_fn -> segop_fn * Avoid using a shared accumulator array for scan * Remove unused code * Rename task to segop * Use task-local small histograms. * Clean up code generation a bit * Remove chase-lev deque from multicore use multicore-deque instead * more cleanup * More clean-up * No need for this anymore. * Apply Ormolu. * futhark-benchmarks: bump * Strangle some warnings. * Propagate errors in multicore backend. * Make the current thread the first worker when entering entry point. * Wait for all subtasks to finish before propagating error. * Duc says it is better to free last. Co-authored-by: Troels Henriksen <athas@sigkill.dk> * Strangle more warnings. * Manpage for multicore backend. * multicore does not work on Windows. * Remove unused code. * Make all sizes of type `i64`. (#1124) This has wide-ranging implications for the types of things in the prelude: * Functions like `replicate` and `iota` now take `i64` arguments. * The `from_fraction` function now takes `i64`. * The `to_i32` function has been removed. Closes #134. * futhark-benchmarks: bump * These are 64-bit. * futhark-benchmarks: bump * Move a division out of the histogram kernel. * Move more 64-bit divisions out of histogram kernels. * Use $GITHUB_PATH instead of add-path (#1151) The `add-path` method of adding stuff to $PATH has been deprecated. More info: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/ https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-commands-for-github-actions#environment-files https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-commands-for-github-actions#adding-a-system-path * Fix input type. * Ask for binary output when testing. * Let's do both because apparently Ubuntu is inconsistent. (#1152) * This is 0.18.1. * Onwards! * The macOS build must be done before we can deploy. * Fix a 32-bit leftover. * Document overloading subtlety (#1153). * Clean up README more. * Also run pyopencl backend on GA. (#1154) * Report warnings even when type errors occur. Closes #1153. Closes #1155. * Eliminate more 32-bit artifacts from code generation. Most importantly, this lets the multicore backend handle more than 2**31-1 iterations per task. * SizeOf is a 64-bit expression. * Another 32/64-bit fix. * Add some documentation for Tools (#1161) * Add some documentation for tools * Fix rendering issues * No header guards in embedded RTS code. * Move scheduler_common.h into scheduler.h. * One global variable down (#1157). * Make kappa scheduler-local. * Rework self-tuning a little bit, does not appear to work. * adding tests for matrix multiplication with reg tiling, should be run with cuda or opencl versions, but cosmin does not know how to specify that inside the source file * Combine scheduler.h and scheduler_tune.h and fix kappa-tuning. * Silence warning about potential uninitialised variable. * Benchmarks do not belong in the test suite. * Further cleanup in scheduler implementation. * These should be static. * Combine all multicore headers into scheduler.h. * Centralise scheduler initialisation. * Stop using mutable global variables in multicore scheduler. We still use a single thread-local variable to find the worker struct for a thread, but that is harmless, as it does not prevent multiple contexts from co-existing (they will have their own threads). Closes #1157. * Close #1162. * Fix error message generation for multicore backend. * Fix another 32-bit leftover. * Restore newline after warnings. * Fix exit code on bugs and limitations. * Also do variable substitution inside Ops. * Look properly for variant allocations deep in kernels. * Look across loops when spelunking for parallelism. * Implement partial tiling. (#1163) Closes #1145. * Print newline after warnings. * Switch to newer Nixpkgs and GHC. (#1166) * Fix RST syntax error. * Use the newest version of 'versions'. (#1165) * Fix multicore histograms with empty inputs. * Not a bug. * More descriptive internal names. * Fix #1168. * Fix #1169. It's a bit ad-hoc that we just lock here. This should use the criticalSection abstraction, but that's internal to GenericC. This is good enough for now, but if we ever do more complex entry/exit operations, this will need refactoring. * futhark dataset now more accepting of piping into something dead. * Freeing an opaque is a critical section (#1169). * This does not need arguments. * Better type error for #1171. * Fix #1173. * Fix #1174. * This was hard, so it deserves a mention. * Eliminate fishy instances. * Use dedicated datatype for pattern literals. * Fix #1134. (#1178) Our counterexamples for missing matches are now slightly worse, but at least we detect them properly (I hope!). * Fix #1177. * This is 0.18.2. * Onwards! * Polish some docs. * Fix #1180. * Add error handling for bad file paths (#1181) * Add error handling for bad file paths * Catch error, instead of check * Fix toctou issue * datacmp: simplify error handling. * Oops, avoid deadlock. * This is 0.18.3. * Onwards! * Fix style violation Co-authored-by: Philip Lassen <philiplassen+git@gmail.com> Co-authored-by: Philip Munksgaard <philip@munksgaard.me> Co-authored-by: Ryan Huang <NPN@users.noreply.github.com> Co-authored-by: Minh Duc Tran <minhtran1391@gmail.com> Co-authored-by: Cosmin Oancea <cosmin.oancea@diku.dk>

athas added the compiler label Jan 16, 2017

athas added this to the Version 0.1 milestone Mar 23, 2017

athas added a commit that referenced this issue Mar 25, 2017

Use 64-bit numbers for memory block sizes.

1973a64

Related to #134.

athas removed this from the Version 0.1 milestone Sep 14, 2017

athas added a commit that referenced this issue Oct 10, 2018

blockedGenReduce: compute sizes in 64 bits.

1111904

This is related to #134. I think we will just gradually move things to 64-bit over time.

athas mentioned this issue Jul 21, 2020

Use 64-bit arithmetic for computing array offsets. #1059

Merged

athas mentioned this issue Sep 11, 2020

Change size parameters and annotations to be of type i64 #1124

Merged

5 tasks

athas added a commit that referenced this issue Sep 17, 2020

Make all sizes of type i64.

8be627a

Closes #134.

athas closed this as completed in #1124 Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Futhark to 64-bit #134

Convert Futhark to 64-bit #134

athas commented Mar 19, 2016

athas commented Mar 23, 2016

athas commented Sep 29, 2016

oleks commented Sep 29, 2016

athas commented Sep 29, 2016

athas commented Sep 29, 2016

oleks commented Sep 29, 2016 •

edited

Loading

athas commented Mar 19, 2017

athas commented Mar 25, 2017

athas commented Mar 25, 2017

athas commented Mar 25, 2017

athas commented Mar 26, 2017

RasmusWL commented Mar 26, 2017

athas commented Oct 10, 2018

athas commented Jun 4, 2019

athas commented Sep 9, 2020

athas commented Sep 10, 2020

athas commented Sep 10, 2020

athas commented Sep 10, 2020

workingjubilee commented Sep 10, 2020

athas commented Sep 11, 2020

athas commented Sep 11, 2020

workingjubilee commented Sep 11, 2020 •

edited

Loading

Convert Futhark to 64-bit #134

Convert Futhark to 64-bit #134

Comments

athas commented Mar 19, 2016

athas commented Mar 23, 2016

athas commented Sep 29, 2016

oleks commented Sep 29, 2016

athas commented Sep 29, 2016

athas commented Sep 29, 2016

oleks commented Sep 29, 2016 • edited Loading

athas commented Mar 19, 2017

athas commented Mar 25, 2017

athas commented Mar 25, 2017

athas commented Mar 25, 2017

athas commented Mar 26, 2017

RasmusWL commented Mar 26, 2017

athas commented Oct 10, 2018

athas commented Jun 4, 2019

athas commented Sep 9, 2020

athas commented Sep 10, 2020

athas commented Sep 10, 2020

athas commented Sep 10, 2020

workingjubilee commented Sep 10, 2020

athas commented Sep 11, 2020

athas commented Sep 11, 2020

workingjubilee commented Sep 11, 2020 • edited Loading

oleks commented Sep 29, 2016 •

edited

Loading

workingjubilee commented Sep 11, 2020 •

edited

Loading