## FP 14.1: Intelligent RAM (IRAM): Chips that Remember and Compute

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberley Keeton, Christoforos Kozyrakis, Randi Thomas, Kathy Yelick

Computer Science, EECS, University of California, Berkeley CA

Division of the semiconductor industry into microprocessor and memory camps provides many advantages: fabrication lines can be tailored to a device, packages are tailored to the pinout and power of a device, and the number of memory chips in a computer is independent of the number of processors.

The split has disadvantages as well. While microprocessors have been improving performance by 60% per year, DRAM access time has been improving by 7% per year. This processor-memory performance gap limits many applications. For example, one microprocessor spends 75% of its time in the memory hierarchy for data base and matrix computations [1]. These delays occur despite tremendous resources being spent trying to bridge this gap. Table 1 shows that up to 60% of the area and 90% of the transistors of recent microprocessors are dedicated to the growing "memory gap penalty": on-chip memory latency-hiding hardware such as caches.

The DRAM industry also has difficulties. The number of DRAMs for the minimum memory of PCs is shrinking--from 32 1Mb DRAMs in 1986 to 2 64Mb DRAMs today--because the growth rate of the minimum memory size is half the growth rate of DRAM. A challenge for wide DRAMs is that some customers want parity protection and some do not. Finally, today's cache-oriented microprocessors need lower latency but instead are offered higher bandwidth with higher latency. Hence customers may no longer automatically switch to the larger capacity DRAM because the minimum memory capacity may be too large. The larger capacity DRAM will need to be in a wider configuration that is more expensive per bit than the narrow version of a smaller DRAM. The wider capacity does not match the width needed for error checking; or memory latency is higher. Thus the 256Mb or 1Gb DRAM may be greeted with indifference.

It is time to reconsider unifying logic and memory. Since most of the transistors on this merged chip will be devoted to memory, it is called "intelligent RAM." IRAM is attractive because the gigabit DRAM chip has enough transistors for both a powerful processor and a memory big enough to contain whole programs and data sets. It contains 1024 memory blocks each 1kb wide. It needs more metal layers to accelerate the long lines of 600mm<sup>2</sup> chips [2]. It may require faster transistors for the high-speed interface of synchronous DRAM. Potential advantages of IRAM  $include\ lower\ memory\ latency\ (\div 0.1x), higher\ memory\ bandwidth$ (÷100x), lower system power, adjustable memory width and size, and less board space. Challenges for IRAM include high chip yield given processors have not been repairable via redundancy, high memory retention rates given processors usually need higher power than DRAMs, and a fast processor given logic is slower in a DRAM process.

One microprocessor is described in sufficient detail to allow estimation of performance of an IRAM using a similar organization [1]. Given the breakdown of where time is spent, performance of each piece in an IRAM is estimated. Table 3 shows the performance factor used to scale the Alpha performance parameters to estimate the IRAM speed. Rather than a single number for each category, optimistic and pessimistic factors are used. The latency to IRAM main memory should be 5 to 10 times faster (factor of 0.1 to 0.2) than the 200-300ns latency of typical computers

Table 2 shows optimistic and pessimistic performance for an IRAM organized like a recent microprocessor [1]. The small SPEC92 benchmarks are the poorest performers in IRAM, being 1.2 to 1.8 times slower. The database varies from a little slower to a little faster, while linpack varies from 1.2 to 1.8 times faster. These programs are more representative than SPEC92, replaced in part due to limited memory traffic.

An alternative computing style is vector processing that works on linear arrays of numbers. Vector processors do not need caches, but rely instead on low-latency memory, often made from SRAM, and high bandwidth using 100s of memory banks. Thus a gigabit IRAM memory system naturally matches the needs of a vector processor. An IRAM vector microprocessor might look like Figure 1. In a 0.18µm DRAM process with a 600mm2 chip, an IRAM could have 16 add-multiply units running at 500MHz and 16, 1024bwide memory ports at 50MHz giving a collective 100GB/s of memory bandwidth. It could run linpack at 8GFLOPS, more than five times faster than the fastest Cray vector supercomputer processor (Cray T-90). The popularity of IRAM is limited by the amount of memory on-chip. If IRAMs succeed, IRAM products should increase as memory size expands from graphics today (10 Mb) to the game and embedded markets (32Mb), and to network computers and portable PCs (128-256Mb). The semiconductor industry may soon see head-to-head competition between its currently segregated logic and memory camps.

## Acknowledgments:

This research was supported by DARPA (DABT63-C-0056), the California State MICRO program, and by research grants from Intel and Sun Microsystems.

## References:

- [1] Cvetanovic, Z., D. Bhandarkar, "Performance Characterization of the Alpha 21164," Proc. High Performance Computer Architecture, San Jose, CA, pp. 270-80, Feb.,1996.
- [2] Yoo, H. J., et al., "A 32-Bank 1Gb DRAM with 1GB/s Bandwidth," ISSCC Digest of Technical Papers, pp. 378-379, Feb., 1996.



Figure 1. Organization of a vector IRAM in a 0.18 $\mu$ m DRAM process.

| Year | Microprocessor           | On-Chip<br>Cache Size | Memory Gap Penalty:<br>% Die Area<br>(not counting pad ring) | Memory Gap Penalty:<br>% Transistors |
|------|--------------------------|-----------------------|--------------------------------------------------------------|--------------------------------------|
| 1992 | 1st generation 64b RISC  | I: 8 KB, D: 8kB       | 21.4%                                                        | 59.5%                                |
| 1994 | 2nd generation 64b RISC  | I: 8 KB, D: 8kB,      | 37.4%                                                        | 77.4%                                |
|      |                          | L2: 96kB              |                                                              |                                      |
| 1996 | Low power, embedded RISC | I: 16 KB, D: 16kB     | 60.8%                                                        | 94.5%                                |
| 1989 | 4th generation x86       | 8kB                   | 19.9%                                                        | 50%                                  |
| 1993 | 5th generation x86       | I: 8kB, D: 8kB        | 31.9%                                                        | 32%                                  |
| 1995 | 6th generation x86       | I: 8kB, D: 8kB,       | P: 22.5%                                                     | P: 18.2%                             |
|      | (2 chips, processor      | L2: 512kB             | +L2: 100%                                                    | +L2: 100%                            |
|      | and L2 cache)            |                       | (Total: 64.2%)                                               | (Total: 87.5%)                       |

Table 1: Memory gap penalty for conventional microprocessors (I = instruction, D = data, L2 = level 2).

| Category                             | SPECint92 |      | SPECf | SPECfp9 |      | Database | Sparse Linpack |      |
|--------------------------------------|-----------|------|-------|---------|------|----------|----------------|------|
|                                      | Opt.      | Pes. | Opt.  | Pes.    | Opt. | Pes.     | Opt.           | Pes. |
| Fraction of time in processor        | 1.02      | 1.57 | 0.89  | 1.36    | 0.30 | 0.46     | 0.35           | 0.54 |
| Fraction of time in I cache misses   | 0.04      | 0.05 | 0.01  | 0.01    | 0.18 | 0.21     | 0.00           | 0.00 |
| Fraction of time in D cache misses   | 0.14      | 0.17 | 0.26  | 0.30    | 0.15 | 0.18     | 0.08           | 0.10 |
| Fraction of time in L2 cache misses  | 0.05      | 0.05 | 0.06  | 0.06    | 0.20 | 0.20     | 0.07           | 0.07 |
| Fraction of time in L3 cache misses  | 0.00      | 0.00 | 0.00  | 0.00    | 0.03 | 0.05     | 0.06           | 0.12 |
| Total=ratio of time vs.uprocessor[1] | 1.25      | 1.83 | 1.21  | 1.74    | 0.85 | 1.10     | 0.56           | 0.82 |
| (>1 means IRAM slower)               |           |      |       |         |      |          |                |      |

Table 2: Estimated performance of conventional IRAM for four programs (int=integer, fp=floating point).

| Component of microprocessor | Optimistic<br>Scale | Pessimistic<br>Scale |
|-----------------------------|---------------------|----------------------|
| execution time              | Factor              | Factor               |
| Logic                       | 1.3                 | 2.0                  |
| SRAM                        | 1.1                 | 1.3                  |
| DRAM                        | 0.1                 | 0.2                  |

Table 3: Scale factors for estimating IRAM.