ERM has a collection of ideas in it that might be foolish, crazy, or impossible to implement in hardware - or all three. It is, in many cases, a collection of oddball ideas that might or might not actually work in practice. Portions are completely unimplementable using the hardware design tools at hand - in 1990 when I first started at designing it - and now, in 2016, where the world has seemingly moved to massively distributed single clock domains on a very small scale with wires being more of the problem than transistors at < 16nm.
There are also a few ideas that are not fully thought out. If I don’t at least write down the bits of the ideas as I have time, I’ll never get around to working on them more fully.
One of them is the idea of using lots and lots of very small but very wide cam memories. And a lot of very small and very narrow cam memories (in addition to conventional flip flops).
Heat, generational memory and dram cams
ERM uses generational memory structures throughout, partially to handle nasty concurrency problems, and partially due to being an abstract way to “let die” data you don’t need anymore. And - cut down on heat consumption. (and - to swap in only data and programs in the cache on a software implementation that you need. At least in programs you tend to throw away 80% of the data or more - in a data flow architecture - well - you just keep actively throwing it away and just keep “hot” what you need. Actually, the sense of that is wrong - you “fling forward and out til it comes back” the data you are going to need, and let die the data you don’t)
You can store data in the delay, just as we do today with protocols like TCP.
The generational state machine
Everything has essentially a minimum of a 4 state state machine in it and circulates through those states on a collaborative basis. I begrudge C for not having a 2 bit type.
One way to think of it are the following four states:
The DRAM CAM
One of the old ideas in ERM is the idea of using DRAM based cam, which seemed to be a good idea (back in 2000). It had some nice properties - smaller, faster - and some bad ones (hotter, and needing a refresh). That concept appears to have died in the marketplace (and I don’t know why) - and cam memory designs are a high cost bit of IP you seemingly have to buy differently for every architecture and not available on opencores. XilinxCAM does, at least, have an “enhanced ternary mode” - which basically reduces to some of the four valued logic I’d like to use throughout erm. I find it ironic that folk still can’t think of it as 4 valued logic. Enhanced ternary mode. Sigh.
“Enhanced Ternary Mode: In this mode, bit X also matches either 1, 0, or X (1010 = 1X1X = 10XX) and is also referred to as a don’t care bit. Bit U does not match any of the four possible bit values 1, 0, X, or U, and is referred to as an unmatchable bit.”
Anyway - back to DRAM cams. On the refresh front, if you haven’t accessed the cam in the x clock cycles you’ve been using it, thus you aren’t using it anymore, so just let it die and don’t refresh it.
On the generational front, once you switch over to a new set of cams, the others can die - just get powered down - don’t refresh, either - until you need them again.You can select a larger or smaller set of cams for your next generation data set.
It’s certainly feasible to just use static cams. But they must DIE on a schedule to reduce heat generated by the architecture.
Writes on the xilinx circuit above take 16 clocks, so “rebuilding a cam” takes a lot more time than accessing it. (1 clock).
Another really core idea in ERM is the concept of using async circuit design throughout. You don’t care how long an operation takes, but you do want it to die on the first failure and return “don’t care”, (stopping power to all the extra circuits you would have used to complete that operation), stopping faster and cooler, and letting some other part of the chip “win” that result.
Circuits look weird in erm - some look like loops, others, like spirals, others, like fractal patterns. (in todays finfet world, they might be spirals, loops, and fractals in three dimensions!) The data passing on the first part of the loop (or spiral) supplies power (or additional power) to the later parts. Delay lines, sometimes measured in picoseconds - are needed….
There are enormous metastability problems, but as each bit of logic connecting a loop together can be tested individually, and tweaked, and loops built up from smaller operations, it seemed feasible to fix each of them, either by hand, or with better design tools. Or fall back on conventional clocked logic if that becomes too hard for a given subsystem.
Perhaps some decent async design tools now exist, but thus far, I’ve come up empty aside from CHP. I came up empty in 92, too!
And the world is so different today.
Everything today has these massively clocked central domains, and processors that have explicit power states, (that are a pain in the arse to get in and out of), and many tools are enforcing rigorous adherence to that centrally clocked design.
Erm has interval timers. That’s it. if you went to sleep, had a cache miss, or anything else that took more or less time, the only way to know how long it took is to check a nearby interval timer. That’s a sloppily synced clock, and any given result can take variable time, and if you are late, you just get in a later line with 1024 different other potential queues.
(aside: You can’t even get at the cycle timer on an arm box by default without specially programming a special unit. This is nuts.)
And: All that said - the code and ideas in erm are the way they are because of the speculation: IF you could build a dataflow engine with async logic, and no central clock - what might it look like? So no actual ability to actually construct a machine needs to exist, just the concepts in code.
“The Caltech Asynchronous Microprocessor (also know as CAM) is the world-first asynchronous microprocessor. It was fabricated in 1988 by our research group at Caltech. (The chip was taped-out in December 1988.) It is a 16-bit RISC machine with 16 general-purpose registers. Its peak performance is 5 MIPS at 2V drawing 5.2mA of current, 18 MIPS at 5V drawing 45mA, and 26 MIPS at 10V drawing 105mA in HP 1.6µm CMOS.” - It’s hard to believe that was nearly 30 years ago!
The language that Erm’s C implementation sort of looks like is “CHP”, which is a GPLv3 tool nowadays.
When that first async chip came out from caltech back in 1988, I said - “Eureka! this is the answer!” No central clock, in particular, means that the RFI generated by such a chip is much lower, and then you can have a much more sensitive wireless circuit than otherwise feasible. You have heat problems, you slow down magically. You don’t have heat problems, you speed up.
Power consumption is less, across the board (the numbers turned in above were amazing) but all the async chips since then - and now - never made it out to open source. And risc is a poor map for the instruction set - what Moore has done with his latest 144 Forth processors was more apropos.
There are a bunch of really small adders in the design as well (2-4 bits), and there has been work, here and there, on doing bigger adders with speculative logic - which seemed highly desirable to me as you tried to get to 128 bits wide for data.
Xilinx vs Altera
I chose Xilinx over the other guys because they had a low cost chip that let you hook up virtual memory to the fpga. Which so far, I haven’t seen used particularly well, or maybe I just misunderstand it. Intel bought Altera and there are plans to integrate Xeon with those FPGAs - which sounds really cool, except that I’m not sure they can pull it off. I really should take another look at Altera.
Xilinx’s new ultrascale parts DO seem rather attractive, with a dual A15 core, and that nifty set of memory ports. They also seem to be doing a good job with linux in general.
Seems to have been grabbing up all the cool tools. They can’t possibly be well integrated or well maintained. But I should take a look at them.