Static Design of Spin Transfer Torques Magnetic Look Up Tables

***Abstract*—Spin Transfer Torque Magnetic Memory (STT-MRAM) is a promising technology for non-volatile storage in which the information is stored in the form of magnetic orientation of a Magnetic Tunnel Junction (MTJ) rather than electric charge. Besides memory applications, this technology is promising for CMOS compatible non-volatile reconfigurable logic design. Given the relatively high power and long delay associated with changing the magnetic state of an MTJ, the most efficient method of designing reconfigurable logic using MTJs is the Look-Up-Table (LUT) based approach. In such STT-LUTs, the write to MTJs occurs only during the configuration phase, while for the normal mode of the operation, the MTJs are only read from. Most existing STT-LUT design are dynamic because of using a shared precharge sense amplifier to read the state of the selected MTJ. Dynamic STT-LUTs are challenging to use along with static combinational logic circuits and are not supported by automated design flow tools. In this paper, we propose a static approach to the design of STT-LUTs and investigate the sensing reliability in the proposed design in detail. The proposed design style utilizes STT-Latches that their sensing reliability is key in determining the overall reliability of the proposed static STT-LUT. The simulation results in a 10nm FinFET CMOS technology shows that the proposed static STT-LUT design exhibits up to ??X read delay reduction, ??X active power reduction, and ???X sensing failure rate reduction compared to the best dynamic STT-LUT design.**

***Keywords—Look up table; low power; magnetic tunnel junction; reconfigurable logic; spin transfer torque.***

# INTRODUCTION

|  |  |
| --- | --- |
| (a) | (b) |
| **Figure 1. Perpendicular MTJ: (a) Parallel (low resistance) and anti-parallel (high resistance) states, (b) R-I characteristics** | |

This research is funded by DARPA.

Scaling of the CMOS technology in nano-scale faces challenges of excessive leakage and power consumption, as well as process variations and reliability issues. These problems are exacerbated in memory applications where minimum sized transistors are used and the leakage dominates the total power consumption. Such challenges have motivated research on alternative devices and materials to extend the Moore’s law beyond 10nm. A promising alternative has been the Magnetic Tunnel Junction (MTJ) device for non-volatile information storage in the magnetic form. An MTJ is formed by two ferromagnetic layers stacked on each other with a thin insulator in between (Figure 1) [1]. This structure is compatible with the standard CMOS process and can be manufactured on top of transistors in between two top metallization layers [2]. One of the ferromagnetic layers (pinned layer) has a hard or fixed magnetization orientation while the magnetization of the other layer (free layer) is programmable to be in one of two orientations: parallel (P) or anti-parallel (AP) to the magnetic orientation of the pinned layer. The insulator has to be thin enough to allow for tunneling current once the MTJ is biased at the nominal supply voltage. Depending on the relative magnetization of the two layers, the effective resistance of the MTJ changes between low (RP) and high (RAP) resistances, representing binary logic “0” and “1” states. The flipping of the magnetization of the free layer (i.e. the write operation) can occur via passing sufficiently large current through the MTJ. The physical mechanism by which the write operation occurs is the magnetic torque created by the spin polarized electrons passing through or reflected from the pinned layer to the free layer. This mechanism is referred to as Spin Transfer Torque (STT).

The primary application of STT has been in memory application (STTRAM). STTRAM offers a CMOS compatible and non-volatile memory and hence a zero leakage alternative to SRAM. However, the main challenge remains to be the relatively high current and delay needed for the write operation [3]. STTRAM is particularly suited for realizing non-volatile and CMOS compatible programmable logic such as on-chip FPGA in which the write operation is infrequent. The basic programmable logic element is a look-up table that utilizes MTJs for storage (STT-LUT) [4,5].

Existing designs of STT-LUTs utilize a shared dynamic precharge sense amplifier to read the state of the MTJ selected by the inputs. The dynamic design of the shared sense amplifier makes the entire STT-LUT a dynamic design. While dynamic designs have the advantage of speed, they suffer from high power consumption, less reliability, and difficulty in integration with static logic and lack of support by automated design flow tools.

In this paper, we propose a static approach to the design of STT-LUTs, that offers improved power and robustness compared to the dynamic designs, and ease of integration with the static logic and automated design flows. The contributions of the paper are as follows:

* Proposing a static approach to the design of STT-LUTs
* Improving the power and robustness of STT-LUT using the proposed approach
* Performing robustness comparisons by combined analysis of impact of CMOS and MTJ variations on read sensing failure rates of STT-LUTs
* Implementation of an adder benchmark circuit using the proposed STT-LUTs

The remainder of the paper is organized as follows. Section 2 offers a background on STT-LUT design and an overview of existing STT-LUT designs. Section 3 presents the proposed static approach to the design of STT-LUTs. Section 4 presents power and performance comparisons of the proposed and conventional STT-LUTs. Robustness of the STT-LUTs under process variations is analyzed and compared in Section 5. Section 6 presents the implementation of an adder benchmark circuit using the static STT-LUTs and its results. Finally, conclusions are drawn in Section 7.

# EXISTING STT-LUT DESIGNS: DYNAMIC

The design concept of a dynamic STT-LUT is shown in Figure 2(a). An n-input STT-LUT requires 2n MTJs that store the binary states. A CMOS selection tree (i.e. multiplexer) select a unique MTJ to be biased for reading and a CMOS sense amplifier for producing full swing output voltage. In the write mode, the data input is written to a unique MTJ. The challenging aspect of STT-LUT design is reliable and power and performance efficient sensing of the MTJ resistive state and converting it to a binary voltage.

|  |
| --- |
| (a)    (b) |
| **Figure 2. STT-LUT design concept: (a) Existing dynamic style (b) Proposed static style** |

In this style, the sense amplifier is shared among all the MTJs and due to the latching property of the sense amplifier, any change on the select inputs of the LUT requires the sense amplifier to re-evaluate and update its output. This requires a dynamic (i.e. precharge based) sense amplifier that periodically (using a clock signal) re-senses the MTJ selected by its inputs regardless of the change on the inputs of the LUT. All existing STT-LUT designs use a variety of dynamic sensing schemes to update the sense amplifier output on a cycle by cycle basis. The SRAM based sensing scheme proposed in [5] uses an SRAM cell that is biased in the metastable condition in the initialization phase and then switches to the full-swing direction according to the resistance difference between the selected MTJ and a reference resistor. This design cannot be cascade because of the initial output voltage level being at around half the supply level (VDD) and hence distorting the sensing of the next stage. The other existing STT-LUT design is the Dynamic Current Mode (DCM) logic based design [4]. This design uses the concept of the dynamic current mode logic [6], by utilizing a dynamic current source to reduce the current of the selection tree above it and hence to reduce the power consumption. However, by doing so, the read delay becomes very long. The value of the capacitance of the dynamic current source creates a trade-off between power and delay and if the capacitance is to be increased to achieve a competitive delay with other styles, its power consumption will become too high [?].

The Precharge Sense Amplifier (PCSA) STT-LUTs presented in [11] use MTJs in an array with single access transistors per MTJ, and hence relying on a separate decoder to provide the decoded MTJ select signals from primary LUT inputs. Separated Precharge Sensing Amplifier (SPCSA) is also presented in [11] as a more reliable style owing to higher voltage bias on the MTJs and a pre-amplification stage before the sense amplifier. The Dynamic Single Rail (DSR) and Dynamic Dual Rail (DDR) STT-LUT styles are presented in [?] and shown to be the most reliable and high performance dynamic styles for STT-LUTs. The DDR STT-LUT style shown in Fig. 3 is chosen as the representative of the state of the art dynamic STT-LUT style.

|  |
| --- |
|  |
| **Figure 3. 2-input Dynamic Dual Rail (DDR) STT-LUT [?]** |

# PROPOSED STATIC STT-LUT DESIGN

|  |
| --- |
| (a)    (b) |
| **Figure 4. STT-Latches: (a) Precharge Sense Amplifier (PCSA) [] and (b) Separated Precharge Sense Amplifier (SPCSA) []** |

Fig. 2(b) shows the proposed Static STT-LUT style. By moving the multiplexer stage after the dynamic sense amplifier and using a static implementation of the multiplexer, the STT-LUT logic from input to output becomes static. In this scheme, each MTJ cell requires a unique dynamic latched sense amplifier. The sense amplifiers need to be enabled once at the beginning to have their outputs be initialize according to the state of their MTJs. This needs to be done once on every power-on. The MTJs retain the content of the LUT in a non-volatile form. In this configuration the MTJs are read only once and for the rest of the time in the active mode, the LUT read power and delay is determined by the static multiplexer. Moreover, by not reading from the MTJs repetitively as in the dynamic STT-LUT style, the stress is removed from the MTJs enhancing their life-time.

The MTJ and sense amplifier form a non-volatile STT latch (STT-latch). An n-input static STT-LUT needs 2n such STT-latches. A variety of STT-latches are proposed in the literature. Fig. 4 shows the Precharge Sensing Amplifier (PCSA) [] and the Separated Precharge Sensing Amplifier (SPCSA) [] STT-latches, that offer better sensing reliability compared to other schemes such as SRAM-based [] and Dynamic current mode [] STT-lacthes.

The multiplexer is a 2n to 1 (2n:1) CMOS multiplexer (MUX) implemented in static style. The optimal CMOS implementation of 2:1 and 4:1 MUX is a combination of tri-state inverters and transmission-gates as shown in Fig. 5. Higher fan-in multiplexers can be implemented by cascade of 4:1 and 2:1 MUXes as shown in Fig. 6. The power and performance of the proposed STT-LUT is determined by the multiplexer, but the reliability of its configuration is determined by the sensing reliability of the STT-latch itself.

# POWER AND PERFORMANCE COMPARSIONS

## Simulation Setup

The designs are simulated in a predictive 16nm CMOS process [7] with nominal supply voltage of 0.7V. In our first simulation attempt, we keep all transistors minimum sized for both designs (NMOS: W/L=30nm/16nm and PMOS: W/L=60nm/16nm). Since the critical operation in programmable logic applications, where these LUTs are used, is the read mode, we simulated these designs in the read mode, where MTJs can be modeled as resistors. The MTJ resistance depends on the MTJ physical geometries and specifically the thickness of the insulator and the 2D cross-section area of the MTJ, and the voltage bias (see Figure 1) [8]. The parallel state resistance (RP) can be modeled as [8]:

|  |  |
| --- | --- |
|  | (1) |

where, *F* and *φ* are the tunneling conductivity and potential barrier height of the insulator, respectively. Considering a circular shape 2D area of radius *r*, the area A is expressed as *A=πr2*. The MTJ parallel resistance is estimated for the 16nm process.

The resistance difference between the two MTJ states is quantified by the Tunneling Magneto Resistance (TMR) parameter defined as follows:

|  |  |
| --- | --- |
|  | (2) |

A higher difference between RP and RAP results in better read performance and reliability. Hence, for the STT-LUTs used in reconfigurable logic application, higher TMR even if it is at the cost of higher written power and delay is desirable. Using crystalline MgO as the insulator, TMR up to 6 is reported [9]. In this study, we choose a TMR value of 4.

Figure 5 shows the simulation waveform of the proposed 2-input DSR STT-LUT. Each input is a monotonic differential signal that is zero when the clock is low and differential when the clock is high. The output also exhibits such a monotonic pattern. The inputs are scanned through all possible combinations and the MTJs are programmed to be at RP and RAP levels such that the outputs toggle in every cycle.

## Results and Comparisons

Delay is measured from rising edge of the clock (input) to the rising edge of the output. The delay of the DCM STT-LUT is dependent on the size of the capacitor of the dynamic current source (Figure 3). We sweep this capacitor from 0 to 100 fF. Figure 6 shows the delay comparison results for LUT fan-in ranging from 2 to 6 inputs. The delay of the DCM STT-LUT reduces by increasing the capacitor value. The proposed design shows significant delay improvement and its delay remains better than that of the DCM style irrespective of the capacitor value for the DCM current source. The improvements range from 3.3X to 16% depending on the fan-in and the capacitor value. Notice that the DCM style fails to operate at fan-in of 6 and hence there is no data for DCM LUT6.

Figure 7 shows the comparison of the active mode power consumption of the DCM and DSR STT-LUTs. The power of the DCM style increases at higher capacitor values due to increase charge transfer in the dynamic current source. Unless the capacitor values are very small, the power of the proposed DSR STT-LUT is lower than that of the DCM style. For high values of the capacitor, the power improvements of the DSR over DCM range from 1.5X to 2.4X depending on the LUT fan-in.

|  |
| --- |
| (a) (b) |
| **Figure 6. Building higher order multiplexers using cascade of 4:1 and 2:1 multiplexers (a) 8:1 mux (b) 16:1 mux** |

Due to the power delay trade-off by changing the capacitor values in the DCM style, the Power-Delay Product (PDP) can present a unified comparison between the two styles. Figure 8 shows the PDP comparisons. There is an optimum capacitor value that results in minimum PDP for the DCM style. However, the DSR style always shows better PDP than the DCM style. Compared to the highest PDP point of the DCM, the DSR reduces the PDP from 1.8X to 3.2X depending on the fan-in. Compared with the minimum PDP point for the DCM, the DSR reduces the PDP from 1.4X to 1.8X depending on the fan-in.

The standby power of the STT-LUTs can be zero if the power supply is turned-off in the standby mode. Given the non-volatile nature of the MTJ storage, the data will not be lost in the standby mode. The active mode leakage power is already included in the total active mode power comparisons shown in Figure 7. Figure 9 shows a separate comparison of only the leakage component of the active mode power. The proposed DSR design shows much leakage power due to the less stacking of OFF transistors in the standby condition, when clock is zero. In the DCM style, when clock is zero (CLK’=VDD) the sense amplifier will be supply gated in addition to the selection trees. In the DSR style however, there is a single shared power gating transistor (REN) that has to be large because it is a shared transistor among multiple LUTs. Nonetheless, this leakage will be zero once the supply is turned off in the standby condition. In the active mode, Figure 8 already showed that the proposed design offers better PDP that the DCM style.

# ROBUSTNESS ANALYSIS AND COMPARISONS

## MTJ Variations

|  |
| --- |
| (a)    (b) |
| **Figure 5. Static multiplexers: (a) 2:1 and (b) 4:1** |

According to Figure 1(b) and Equation (1), the resistance of the MTJ is dependent on bias as well as MTJ geometries. Given the MTJ voltage bias is fixed in the read mode, we concentrate on the influence of MTJ geometrical variations. The two geometries of most influence are the insulator thickness (*tox*) and the cross-section area (A*=πr2*) as expressed by Equation (1). Assume the tox and r exhibit variations represented by *dt* and *dr* from their nominal values, respectively. Treating *dt* and *dr* as uncorrelated normal random variables with mean value of zero and given standard deviations, we can obtain the distributions of MTJ resistances. Figure 10 shows the distributions of RP and RAP and the reference resistor (RREF) with standard deviation of *dt* and *dr* set at 10% of the nominal *tox* (1 nm) and *r* (16 nm) values in a 16nm process node.

The read failure occurs when the RP is increased or RAP is decreased. An accurate estimate of reliability against MTJ process variations can be obtained by applying statistical variations to *tox* and *r* of the MTJ and measuring the failure rate of the STT\_LUT read operation.

## Transistor Variations

The primary source of variations in bulk CMOS transistors is random threshold voltage (Vt) variations caused by Random Dopant Fluctuations (RDF) [10]. The RDF induced Vt variation from the nominal Vt (Vt0) (dvt=Vt-Vt0) for a transistor of size L (channel length) and W (channel width), follows a normal distribution with a standard deviation inversely proportional to the square root of the transistor area as follows [10]:

|  |  |
| --- | --- |
|  | (3) |

where all the process parameters are lumped into σdvt0 representing the standard deviation of the threshold voltage variation of a minimum sized transistor with channel length and width of Lmin and Wmin. Increasing the transistor area provides a means to reduce the influence of variations at the cost of increase in area and power. To quantify the STT robustness against Vt variations, the read sensing failure rate can be measured by applying statistical variations to the threshold voltages of transistors in the STT-LUT.

|  |  |
| --- | --- |
| (a) | (b) |
| **Figure 10. Distributions of MTJ RP, RAP, and RREF at 16nm process for (a) TMR=1 and (b) TMR=4** | |

## Robustness Comparisons

There are two possible failures in the read mode: read sensing failure and read disturbance failure. The read sensing failure is related to the incorrect sensing by the sense amplifier due to the presence of mismatch caused by CMOS and MTJ variations resulting in low sensing margin. The read disturbance failure is a read operation in which the read current passing through the MTJ exceeds the critical write current resulting in the flipping of the state of the MTJ (i.e. write operation). In STT-LUTs the read sensing failure is the dominant type because of the stack of transistors in the read path (i.e. in the multiplexer tree) limits the read current. Hence, we focus on the read sensing failures. Random statistical variations are applied to MTJ parameters (*tox* and *r*) and threshold voltage (*dvt*) and read sensing failure rates of the STT-LUTs are measured. The standard deviations of *tox* and *r* variations are set at 10% of their nominal values. In the 16nm

|  |
| --- |
|  |
| **Figure 11. Monte-Carlo statistical simulation waveforms showing STT-LUT read sensing failures** |

CMOS model used, the NMOS threshold voltage of a minimum sized transistor with the short channel effects applied was found to be 233 mV. The standard variation of *Vt* variation was set at 30mV which is 13% of the nominal value. Figure 11 shows waveforms from statistical monte-carlo simulations showing failure cases. All the MTJ cells are read from the STT-LUT, and read failure is defined if a wrong value is read from any cells or if the delay of reading the correct values of a cell exceeds 500pS (half the evaluation cycle time). If reading from any of the MTJ cells fail, the entire STT-LUT is considered faulty.

Figure 12 shows the comparison of delay distributions and failure rates under intra-die variations of these process parameters for the 2-input STT-LUTs. The capacitor of the DCM STT-LUT is chosen for its minimum PDP point (C=3fF from Figure 8). Figure 12(a) shows the delay distributions of the success cases. DSR offers a lower and tighter delay distribution with 29% reduction in the mean and 45% reduction is the standard deviation of the delay. Figure 12(b) shows the delay histogram with the failure cases lumped into a single bin at 500pS. The proposed DSR style reduces the failure rate by 38%.

|  |
| --- |
| 1. (b) |
| **Figure 12. DCM and DSR STT-LUT variability comparisons: (a) delay distributions for successful reads (b) Delay histograms showing read sensing failure rates** |

Increasing the capacitor of the DCM STT-LUT should reduce the read sensing failure rates. Figure 13 shows the comparisons of the failure rates of the DCM and DSR as the capacitor is varied. These failure rates are measured under intra-die variations of the MTJ and transistor. Regardless of the sizing of the capacitor of the DCM, the proposed DSR STT-LUT shows significantly lower failure rates for all fan-ins. Depending on the fan-in and the capacitor of the DCM, the improvements vary from 26% to 58%.

To gain a better understanding of the influence of various sources of variability on the STT-LUT read sensing failure rates, Figure 14 shows the DSR failure rates under intra and inter variations of MTJ and CMOS, individually and combined. First, it is observed that the inter-die variations (alone) do not cause any read sensing failures. That is because inter-die variations affect all transistors and MTJs in an LUT uniformly, and therefore, do not cause any mismatch among symmetric devices. It is also observed that CMOS (intra-die) variations play a more significant role in determining the failure rate than MTJ variations.

|  |
| --- |
|  |
| **Figure 13. Read sensing failure rate comparisons of DCM and DSR STT-LUT** |
|  |
| **Figure 14. DSR read sensing failure rates under intra- and inter-die variations of MTJ and CMOS** |
|  |
| **Figure 15. Impact of inter-die Vt shift on intra-die induced read sensing failure rates** |

|  |
| --- |
|  |
| **Figure 16. Proposed Dynamic Dual Rail (DDR) STT-LUT** |

Although the inter-die variations do not directly cause read sensing failures, the inter-die variation modify the impact of intra-die variations. Figure 15 shows the intra-die induced failure rates as the inter-dive Vt shift is varied from -0.15 V to +0.15 V for differed LUT fan-ins of the DSR style. It is observed that inter-die Vt shift towards high Vt increases the read sensing failure rates, but the shift toward low Vt does not have much influence. That is because at higher inter-die Vt corners, the effective read current is reduced, making less sensing margin for the sense amplifier.

# SECOND PROPOSED STT-LUT DESIGN

The proposed DSR style reduces the failure rates significantly compared to the conventional DCM STT-LUT style; however, the failure rates are still fairly high. We propose another new STT-LUT design style that further enhances the robustness by utilizing dual differential MTJs per bit and eliminating the reference resistor. The schematic of the proposed Dynamic Dual Rail (DDR) style for a 2-input STT-LUT is shown in Fig. 16. By utilizing two differentially programed MTJs per bit, the sensing margin for the sense amplifier is enhanced resulting in shorter delay and lower failure rates. Fig. 17 shows the comparisons of the proposed DSR and DDR STT-LUTs. The DDR scheme shows lower delay, active power, and PDP, similar active leakage, and significantly lower read sensing failure rates. These benefits are the cost of increase in area. Compared to the conventional DCM STT-LUT and depending on the fan-in, the proposed DDR STT-LUT reduces the read delay by 39% to 44%, the active power by 0% to 20%, and the read sensing failure by 9X to 441X.

|  |  |
| --- | --- |
| (a) | (b) |
| (c) | (d) |
| (e) | (f) |
| **Figure 17. Comparisons of proposed DSR and DDR STT-LUTs** | |

# CONCLUSION

In this paper, two new designs for STT-LUTs are presented that operates using dynamic voltage mode logic and outperform the conventional dynamic current mode STT-LUT in all aspects of delay, active mode power, and robustness against process variations. The comprehensive analysis of the proposed designs and their comparison against the conventional design shows that the proposed designs are able to offer higher fan-in STT-LUTs under a given power, performance, and robustness constraints.

# ACKNOWLEDGMENTS

This research is funded by DARPA.

# REFERENCES

[1] C. Augustine, et. al. “Spin-Transfer Torque MRAMs for Low Power Memories: Perspective and Prospective,” IEEE Sensors Journal, vol. 12, no. 4, pp. 756 – 766, 2012.

[2] J. P. Wang, et. al., “Programmable spintronic logic devices for reconfigurable computation and beyond history and outlook,” J. Nano-Electron. Optoelectron., vol. 3, no. 1, pp. 12–23, Mar. 2008

[3] M. Rasquinha, et. al. “An energy efficient cache design using Spin Torque Transfer (STT) RAM,” ACM/IEEE International Symposium on Low-Power Electronics and Design, pp. 389-394, 2010

[4] D. Suzuki et. al., “Fabrication of a nonvolatile lookup-table circuit chip using magneto/semiconductor-hybrid structure for an immediate-power-up field programmable gate array,” Symposium on VLSI Circuits, pp. 80 –81, 2009.

[5] W. Zhao, et. al., “Spin Transfer Torque (STT)-MRAM-Based Runtime Reconfiguration FPGA Circuit,” ACM Transactions on Embedded Computing Systems, vol. 9, no. 2, 2009

[6] M. Allam et. al. Dynamic current mode logic (dycml): a new low-power high-performance logic style. Solid-State Circuits, IEEE Journal of, 36(3):550 –558, mar 2001.

[7] PTM: <http://www.eas.asu.edu/~ptm>

[8] Z. Xu, C. Yang, M. Mao, K. B. Sutaria, C. Chakrabarti, Y. Cao, “Compact modeling of STT-MTJ devices,” Solid-State Electronics, vol. 102, pp. 76-81, Dec. 2014

[9] S. Ikeda, et. al., “Tunnel magnetoresistance of 604% at 300 K by suppression of Ta diffusion in CoFeB/MgO/CoFeB pseudo-spin-valves annealed at high temperature,” Appl. Phys. Lett, vol. 93, pp. 082508-1–082508-3, 2008.

[10] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices. New York: Cambridge Univ. Press, 1998.

[11] E. Deng, et. al., “Design Optimization and Analysis of Multicontext STT-MTJ/CMOS Logic Circuits,” IEEE Trans. on Nanotechnology, vol. 14, no. 1, pp. 169-177, 2015.