

# Tiny Tate Bilinear Pairing Core Specification

Author: Homer Hsing homer.hsing@gmail.com

Rev. 0.1 April 29, 2012



This page has been intentionally left blank.



### **Revision History**

| Rev. | Date       | Author      | Description |
|------|------------|-------------|-------------|
| 0.1  | 04/28/2012 | Homer Hsing | First Draft |
|      |            |             |             |



# **Contents**

| 1  |
|----|
| 3  |
| 6  |
| 7  |
| 10 |
|    |
| 13 |
|    |



## Introduction

Tiny Tate Bilinear Pairing core is for calculating a special type of Tate bilinear pairing called reduced  $\eta_T$  pairing. Its features are:

- ✓ super-singular elliptic curve  $E: y^2 = x^3 x + 1$
- ✓ the field is the Galois field  $GF(3^m)$ , m = 97
- ✓ the irreducible polynomial is  $x^{97} + x^{12} + 2$
- ✓ vendor independent code
- ✓ very low hardware cost ( $\leq$  0.2 US dollar)

#### Mathematical background

The elliptic curve  $E: y^2 = x^3 - x + 1$  contains a large cyclic subgroup of prime order l. Also l divides  $3^{6m} - 1$  and not any  $3^{j \times m} - 1, j < 6$ , and  $l^2$  does not divide  $\#E^{[2]}$  Now  $E(GF(3^m))$  contains an l-torsion group  $E[l](GF(3^m))$ , i.e., the set of elements P of  $E(GF(3^m))$  satisfying  $l \cdot P = 0$ , where 0 is the point at infinity.  $[^{4]}E(GF(3^{6m}))$  also contains an l-torsion group  $E[l](GF(3^{6m}))$ . The Tate bilinear pairing (of order l) is a bilinear map between  $E[l](GF(3^m))$  and  $E[l](GF(3^{6m}))$  to an element of the multiplicative group  $GF(3^{6m})^*$ ,  $[^{2}]$  i.e.

$$e_l: E[l](GF(3^m)) \times E[l](GF(3^{6m})) \to GF(3^{6m})^*$$
 (1)

Let  $P \in E[l](GF(3^m))$ ,  $Q \in E[l](GF(3^{6m}))$ , let  $f_{l,P}$  be a rational function on the curve with divisor l[P] - l[O], there exists a divisor  $D_Q$  equivalent to [Q] - [O], with a support disjoint from the support of  $f_{l,P}$ . Then the Tate pairing of order l is defined by

$$e(P,Q) = f_{l,p}(D_Q)^{(3^{6m}-1)/l}$$
 (2)

It is necessary to raise the value on the right hand side to the power  $\epsilon = (3^{6m} - 1) / l$  to obtain a unique value in  $GF(3^{6m})$ . Consider  $P = (x_1, y_1), Q = (x_2, y_2) \in E[l](GF(3^m))$  i.e.  $x_1, y_1, x_2, y_2 \in GF(3^m)$  and define a distortion map  $\phi$  as

$$\phi(Q) = \phi(x_2, y_2) = (\rho - x_2, \sigma y_2) \tag{3}$$



, where  $\rho, \sigma \in GF(3^{6m})$  satisfying  $\rho^3 - \rho - 1 = 0$  and  $\sigma^2 + 1 = 0$ . Then one can define the modified Tate pairing  $\hat{e}$  for all points  $P, Q \in E[l](GF(3^m))$  as

$$\hat{e}(P,Q) = e_l(P,\phi(Q))^{\epsilon} = \tau \in GF(3^{6m})^*$$
(4)

Generally speaking, the modified Tate pairing is a transformation that takes in two elements in the elliptic curve point group and outputs a nonzero element in the extension field  $GF(3^{6m})$ .

The modified Tate pairing performs in two stages: [2]

Stage 1: calculation of  $\phi(Q) \in E[l](GF(3^{6m}))$  in the rational function  $f_P$  on  $E(GF(3^m))$  such that  $div(f_P) \sim l[P] - l[O]$ , i.e.  $e_l(P, \phi(Q)) = f_P(\phi(Q)) = t \in GF(3^{6m})^*$ . This is by Algorithm 1 (Modified Duursma-Lee algorithm) below.

#### Algorithm 1 (Modified Duursma-Lee algorithm)

Input:  $P = (x_1, y_1), Q = (x_2, y_2) \in E[l](GF(3^m))$ 

Output:  $t = e_l(P, \phi(Q)) \in GF(3^{6m})^*$ 

1: 
$$t = 1 \in GF(3^{6m})^*$$
,  $\alpha = x_1$ ,  $\beta = y_1$ ,  $x = x_2^3$ ,  $y = y_2^3$ ,  $\mu = 0 \in GF(3^m)$ ,  $d = m \mod 3$ 

2: for i from 0 to m-1 do

3: 
$$\alpha = \alpha^9, \beta = \beta^9, \mu = \alpha + x + d$$

4: 
$$\gamma = -\mu^2 - \beta y \sigma - \mu \rho - \rho^2$$
,  $t = t^3 \gamma$ ,  $y = -y$ ,  $d = d - 1 \mod 3$ 

5: next *i* 

6: return *t* 

Stage 2: raising t to the power  $\epsilon_1 = \epsilon/3^{3m} = 3^{3m} - 1$ 

In [3], Barreto et al. introduced the  $\eta_T$  pairing, which improved above algorithm. The  $\eta_T$  pairing of two points  $P, Q \in E[l](GF(3^m))$  is

$$\eta_T(P,Q) = \begin{cases} f_{T,P}(\phi(Q)), & \text{if } \mu b = -1\\ f_{-T,-P}(\phi(Q)), & \text{if } \mu b = 1 \end{cases}$$

, where  $b=1, T=3^m-\#E\big(GF(3^m)\big)$  and  $\mu=1$  if  $m\equiv \pm 1 \pmod{12}, \mu=-1$  if  $m\equiv \pm 5 \pmod{12}$ . To assure the obtained value is unique, we have to compute the reduced  $\eta_T$  pairing defined as  $\eta_T(P,Q)^M$  where  $M=(3^{6m}-1)/\#E\big(GF(3^m)\big).^{[4]}$ 



## **Architecture**

#### Architecture of the core



Figure 1: architecture of the core

#### **Processing element**

This is a unified module for addition, subtraction, multiplication and cubing. It has four registers  $R_0$ ,  $R_1$ ,  $R_2$  and  $R_3$  supporting following functionalities.

$$\checkmark$$
  $R_3 \leftarrow R_1 + R_2$ 

$$\checkmark R_3 \leftarrow R_1 - R_2$$

 $\checkmark$   $R_3 \leftarrow (R_1)^{3^{times}}$ , where *times* is specified by instructions.

$$\checkmark R_3 \leftarrow R_0 \cdot R_1$$

The functionality depends on a control word. The control word  $\{c_0, c_1, ..., c_{10}\}$  for the processing module is as follows.

Table 1: the control word for the processing element

|                    | $c_0$ | $c_1$ | $c_2$ | $c_3$ | $c_4$ | $c_5$ | <i>c</i> <sub>6</sub> | <i>c</i> <sub>7</sub> | $c_8$ | $c_9$ | <i>c</i> <sub>10</sub> |
|--------------------|-------|-------|-------|-------|-------|-------|-----------------------|-----------------------|-------|-------|------------------------|
| set R <sub>0</sub> |       |       |       |       | 1     | 0     |                       |                       |       |       |                        |



| set R <sub>1</sub>       | 1 | 1 |   |   |   |   |   |   |   |   |   |
|--------------------------|---|---|---|---|---|---|---|---|---|---|---|
| $\operatorname{set} R_2$ |   |   | 1 | 1 |   |   |   |   |   |   |   |
| add/sub                  |   |   |   |   |   |   | 1 | 0 | 0 | 0 | 1 |
| cubic                    | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| mult                     | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |

#### **RAM**

This is a dual-port RAM storing 64 words which is 198 bits wide.

#### **ROM** (instruction memory)

This module stores instructions. Each instruction contains five fields:

- ✓ SRC1 (6 bits): address for the first operand
- ✓ SRC2 (6 bits): address for the second operand if available
- $\checkmark$  OP (2 bits): operation type
- ✓ TIMES (6 bits): how many times an operation should be repeated
- ✓ DEST (6 bits): address for the operation result

#### **Control unit**

The control unit contains a finite state machine. The states are as follows.

Table 2: the state of the control unit

| State             | Description                         | successor   |
|-------------------|-------------------------------------|-------------|
| START             | the default after reset             | READ_SRC1   |
| READ_SRC1 $(S_1)$ | for reading the first operator      | READ_SRC2   |
| READ_SRC2 $(S_2)$ | for reading the second operator     | DON or CALC |
| CALC              | for arithmetic                      | WAIT        |
| WAIT              | arithmetic result $\rightarrow R_3$ | WRITE       |
| WRITE             | $R_3 \to RAM$                       | READ_SRC1   |
| DON               | the default after the computation   | DON         |

Its output signals are as follows.

Table 3: output signals of the control unit

| arithmetic type | state | ram_a_addr | ram_b_addr    | pe_ctrl    |
|-----------------|-------|------------|---------------|------------|
| add/sub         | $S_1$ | src1       | op_add/op_sub | set R0, R1 |



| cubic | $S_2$ $S_1$ | src2<br>src1 | op_cubic | set R2<br>set R0,R1,R2 |
|-------|-------------|--------------|----------|------------------------|
|       | $S_2$       | src2         |          |                        |
| mult  | $S_1$       | src1         |          | set R1,R2              |
|       | $S_2$       | src2         | src2     | set R0                 |



# Interface

The Tiny Tate Bilinear Pairing core implements the signals shown in the table below.

Input signals are synchronous and sampled at the rising edge of the clock.

Output signals are driven by flip-flops, and not directly connected to input signals by combinational logic.

For signals wider than 1 bit, the range is most significant bit down to least significant bit.

**Table 4: Interface signals** 

| Signal name | Width | In/Out | Description                                                                                                                                         |
|-------------|-------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| clk         | 1     | In     | clock                                                                                                                                               |
| reset       | 1     | In     | active high synchronous reset                                                                                                                       |
| sel         | 1     | In     | active high for reading/writing RAM from outside                                                                                                    |
| addr        | 6     | In     | RAM address                                                                                                                                         |
| W           | 1     | In     | active high for writing RAM; low for reading RAM                                                                                                    |
| update      | 1     | In     | active high for updating the I/O buffer                                                                                                             |
| ready       | 1     | In     | active high if it is ready to read/write                                                                                                            |
| i           | 1     | In     | serial input                                                                                                                                        |
| 0           | 1     | Out    | serial output whose value is valid when the <i>done</i> signal is asserted                                                                          |
| done        | 1     | Out    | The termination signal. It is inactive low after the <i>reset</i> signal asserted, and is active high only if the Tate pairing computation is done. |



# **Timing**

The entire computing process is

- ✓ write input to RAM
- ✓ wait for the end of the calculation, the time when the *done* signal is asserted
- ✓ read the result from RAM

#### Memory address of the input/output

Assume that the input is  $P=(x_P,y_P)$ ,  $Q=(x_Q,y_Q)$ , assume the output is  $\eta_T(P,Q)^M=T$ ,  $T=t_0+t_1\sigma+t_2\rho+t_3\rho\sigma+t_4\rho^2+t_5\rho^2\sigma$ , assume each of  $\{x_P,y_P,x_Q,y_Q,t_0,t_1,...,t_5\}$  is 198-bit wide, equal to the width of RAM. The memory address for input/output is defined below.

Table 5: memory address for input/output

| input/output | address |
|--------------|---------|
| $x_P$        | 3       |
| $y_P$        | 5       |
| $x_Q$        | 6       |
| $y_o$        | 7       |
| $t_0$        | 9       |
| $t_1$        | 10      |
| $t_2$        | 11      |
| $t_3$        | 12      |
| $t_4$        | 13      |
| $t_4 \ t_5$  | 14      |



#### Write to RAM

In order to write a value V[197:0] to a specified RAM address Addr[5:0], firstly let sel=1, addr=Addr[5:0], w=1, and comply with the timing below.

**Table 6: timing for writing to RAM** 

| clock cycle | signal name/value |        |        |  |  |
|-------------|-------------------|--------|--------|--|--|
|             | update            | ready  | i      |  |  |
| T           | 1                 | 0 or 1 | 0 or 1 |  |  |
| T+1         | 0                 | 1      | V[0]   |  |  |
| T+2         | 0                 | 1      | V[1]   |  |  |
|             |                   |        |        |  |  |
| T + 198     | 0                 | 1      | V[197] |  |  |

Then let w = 0.

When write the RAM, the *reset* signal should be asserted and keep asserted until the termination of writing.

#### Wait for the end of the computation

Wait until the signal *done* is active.

#### **Read from RAM**

In order to read from a specified RAM address Addr[5:0], firstly let sel=1, addr=Addr[5:0], w=0, and comply with the timing below.

**Table 7: timing for reading from RAM** 

| clock cycle | signal name/value |        |              |  |  |
|-------------|-------------------|--------|--------------|--|--|
|             | update            | ready  | 0            |  |  |
| T           | 1                 | 0 or 1 | 0 or 1       |  |  |
| T+1         | 0                 | 1      | V[0]         |  |  |
| T+2         | 0                 | 1      | <i>V</i> [1] |  |  |
|             |                   |        |              |  |  |
| T + 198     | 0                 | 1      | V[197]       |  |  |

Then keep w = 0.



You may use a 198 bits wide register to record V.



# **FPGA Implementation**

#### Synthesis results (ISE)

**Table 8: Synthesis result (ISE)** 

| Device                     | Xilinx Spartan 3 XC3S200-5PQ208 |
|----------------------------|---------------------------------|
| Number of Slice Flip Flops | 1,319                           |
| Number of 4 input LUTs     | 3,028                           |
| Number of occupied Slices  | 1,730                           |
| Number of bonded IOBs      | 15                              |
| Minimum period             | 10.455ns                        |
| Maximum Frequency          | 95.6MHz                         |

<sup>\*</sup> Synthesis tool is Xilinx ISE 13.4.

#### **Synthesis results (Quartus)**

**Table 9: Synthesis result (Quartus)** 

| Device                    | Altera Cyclone II EP2C20F484C7 |
|---------------------------|--------------------------------|
| Total logic elements      | 3,637                          |
| Dedicated logic registers | 1,310                          |
| Total memory bits         | 25,984                         |
| Total pins                | 15                             |

<sup>\*</sup> Synthesis tool is Altera Quartus II 11.1.

#### **Speed**

The core computes one Tate pairing in 1.05 milliseconds if with a 50MHz clock.

<sup>\*</sup> Related Xilinx synthesis configuration files are in the directory synthesis/xilinx.



#### **Hardware price**

The core applies to Xilinx Spartan 3 XC3S200. The price of that FPGA <sup>[5]</sup> is less than 0.2 US dollar per piece.

#### **Compared to Tate Bilinear Pairing core**

**Table 10: Compared to Tate Bilinear Pairing core** 

|                            | Tiny Tate Bilinear Pairing core | Tate Bilinear Pairing core |
|----------------------------|---------------------------------|----------------------------|
| Device                     | Xilinx Spartan 3 XC3S200        | Xilinx Virtex 4 XC4VLX200  |
| Number of Slice Flip Flops | 1,319                           | 31,383                     |
| Number of 4 input LUTs     | 3,028                           | 47,083                     |
| Number of occupied Slices  | 1,730                           | 30,149                     |
| Computation time           | 1.02ms                          | 0.76 ms                    |



## **Test bench**

The file "testbench/simulation.do" is a batch file for ModelSim to compile the HDL files, setup the wave file, and begin function simulation. In order to make it work properly, the working directory of ModelSim must be the directory of "testbench".

The file "testbench/test\_pairing.v" is the main test bench for the Tate Bilinear Pairing core. The test bench is self-checked. It feeds input data to the core and compares the correct result with the output of the core. If the output is wrong, the test bench will display an error message.

If the function of the core is wrong, some other self-checking test benches in the "testbench" directory may help. The object of each test bench is listed in the table below.

Table 11: the object being tested by each test bench

| File name    | the object being tested                                                        |
|--------------|--------------------------------------------------------------------------------|
| test_const.v | the module outputting constant and the control word for the processing element |
| test_fsm.v   | control unit                                                                   |
| test_pe.v    | processing element                                                             |
| test_ram.v   | RAM                                                                            |
| test_tiny.v  | the module computing the Tate bilinear pairing                                 |



## References

- [1] I.Duursma, H.S.Lee. Tate pairing implementation for hyper-elliptic curves  $y^2 = x^p x + d$ .
- [2] T.Kerins, W.P.Marnane, E.M.Popovici, and P.S.L.M.Barreto. Efficient hardware for the Tate pairing calculation in characteristic three.
- [3] P.Barreto, S.Galbraith, C.O hEigeartaigh, and M.Scott. Efficient pairing computation on supersingular abelian varieties.
- [4] J.Beuchat, N.Brisebarre, J.Detrey, E.Okamoto, M.Shirase, and T.Takagi. Algorithms and arithmetic operators for computing the  $\eta_T$  pairing in characteristic three.
- [5] Price of Xilinx Spartan 3 FPGA. http://www.alibaba.com/showroom/xc3s200.html