



# 数据缓存预取: 最终报告

演讲:邓开文、赵龙征洋、杜宇航

组员: 田慧楠、冯俊杰、王一鸣、张一凡、吴柏宁、冯昭栋、张乐然、金宇坤

2024/6/7







- 基本项(必做,难度3颗星):
  - 1. 理解多级缓存技术中缓存预取技术的作用与运作机理,调研先进处理 器中缓存预取技术的应用。
  - 2. 阅读玄铁C910缓存预取单元的源码(ct lsu pfu.v),将C910预取 单元的运作机制撰写为文字报告。
  - 3. 使用Chisel实现预取单元的硬件结构,将生成的Verilog替换玄铁源码中 的ct\_lsu\_pfu.v文件,然后在开源玄铁平台上能正确跑通Coremark。
- 挑战项(选做,难度6颗星):
  - 1. 调研顶会上更加先进的预取算法,并进行比较,生成洞察报告。
  - 2. 选择一个预取算法进行硬件实现,并和原预取单元进行PPA比较。









Add files via upload wym-FDU on 2024/5

Delete main/scala/so o wym-FDU on 2024/5/

Create pfb

Create pfb Janleron on 2024

Create ct Isu pfu

• fzd on 2024/5/28 niversity







## 2 预取技术





• 预取方式: 软件预取、硬件预取

• 软件预取: 实时、非实时

• 硬件预取: 指令预取、数据预取

• 预取技术: 流式、地址相关、空间相关、操作相关





## 2 预取技术





- 预取方式: 软件预取、硬件预取
  - 软件预取: 实时、非实时
  - 硬件预取: 指令预取、数据预取
- 预取技术: 流式、地址相关、空间相关、操作相关
- 先进应用:可编程预取 (POWER7)、自适应硬件预取 (Xeon Phi) ......

#### **POWER7** prefetcher

Programmable and allow user to set different parameter including

- prefetch depth: how many lines in advance to prefetch
- *prefetch on stores:* whether to prefetch store operations
- *stride-N*: whether to prefetch streams with a stride larger than one cache block



#### 3 Prefetching on Intel Xeon Phi[3]

#### 3.2 Self-Adaptive hw prefetching

When complier is in -O2 or -O3 configuration

- Software prefetching on!
- Still flexible for programmer to interfere manually
- If sw prefetching works well (prefetch for the majority of the access, or reduce L2 demand misses), the hw prefetcher will not even get triggered very aggressively, i.e., hw prefetcher will throttle itself



Fig. 1. Intel(c) Xeon Phi(c) coprocessor block diagram

| Configuration |                      |                                                                                                             |
|---------------|----------------------|-------------------------------------------------------------------------------------------------------------|
| HW only       | SW only              | HW+SW                                                                                                       |
| 2.70          | 2.63                 | 2.77                                                                                                        |
| 0.37          | 0.45                 | 0.42                                                                                                        |
| 0.34          | 0.92                 | 0.95                                                                                                        |
| 0.29          | 0.33                 | 0.48                                                                                                        |
|               | 2.70<br>0.37<br>0.34 | HW only         SW only           2.70         2.63           0.37         0.45           0.34         0.92 |





## 3 C910 PFU架构





- RTL代码量: 10个文件, 6896行
  - 代码特点:由.vp文件生成,缩写多,持续赋值语句多,可读性较差
- •模式:全局预取模式+多流预取模式,并行+仲裁





## 3 C910 PFU架构





- 全局预取模式
  - 对流水线上的访存指令做地址预测。
  - GSDB分析步长并维护置信度, GPFB计算预取地址。
  - 对于密集矩阵和数组访问的情况下十分有效。







## 3 C910 PFU架构





- 多流预取模式
  - 对最近出现缓存缺失的若干条访存指令做地址预测。
  - PMB记录并识别待测指令, SDB分析步长并维护置信度, PFB计算预取地址。

• 对于交错跨步幅的序列十分有效,比如矩阵向量乘法。









# 4 Chisel化报告

赵龙征洋 2024/06/07





## 4 Chisel 化报告





```
VCUNT SIM: CoreMark 1.0 : 6.367926 (iterations/sec)/MHz
2K performance run parameters for coremark.
ERROR: ee ptr in iis not a datatype that holds an int pointer!
ERROR: Please modify the datatypes in core portme.h!
CoreMark Size
              : 666
Total ticks
Total time (secs): 42
Iterations/Sec
Iterations
Compiler version : GCC10.2.0
Compiler flags
                : -0
Memory location : STACK
seedcrc
                : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate
             : 0x8e3a
[0]crcfinal
               : 0x72be
Errors detected

    * simulation finished successfully

$finish called from file "../logical/tb/tb.v", line 273.
$finish at simulation time
                                  35040550000
          VCS Simulation Report
Time: 3504055000000 fs
CPU Time:
           218.550 seconds; Data structure size: 73.9Mb
```

服务器上源码跑分结果

## 4 Chisel化报告





```
VCUNT SIM: CoreMark 1.0 : 6.383087 (iterations/sec)/MHz
                                                                        Notice: timing checks disabled with +notimingcheck at compile-time
VCUNT SIM: CoreMark has been run 2 times, one times cost 156706 cycles !
                                                                        Chronologic VCS simulator copyright 1991-2018
                                                                                                                                               2K performance run parameters for coremark.
                                                                        Contains Synopsys proprietary information.
                                                                        Compiler version 0-2018.09-SP2 Full64; Runtime version 0-2018.09-SP2 Full64; May 31 01:04 2024
VCUNT SIM: CoreMark 1.0 : 6.381377 (iterations/sec)/MHz
                                                                                                                                               ERROR: ee ptr int is not a datatype that holds an int pointer!
                                                                               ****** Init Program *******
2K performance run parameters for coremark.
                                                                              ******* Wipe memory to 0 *******
                                                                                                                                                ERROR: Please modify the datatypes in core portme.h!
                                                                               ****** Read program *******
ERROR: ee ptr int is not a datatype that holds an int pointer!
                                                                                                                                               CoreMark Size
                                                                              ******* Load program to memory *******
ERROR: Please modify the datatypes in core portme.h!
                                                                         ERROR! Please define ee ptr int to a type that holds a pointer!
                                                                                                                                               Total ticks
                                                                                                                                               Total time (secs): 42
                                                                         VCUNT SIM: CoreMark has been run 2 times, one times cost 156706 cycles !
Total ticks
                                                                                                                                               Iterations/Sec
Total time (secs): 42
                                                                         VCUNT SIM: CoreMark 1.0 : 6.381377 (iterations/sec)/MHz
                                                                        2K performance run parameters for coremark.
Iterations/Sec : 0
                                                                                                                                               Iterations
                                                                                                                                                                      : 2
                                                                         ERROR: ee ptr int is not a datatype that holds an int pointer!
Iterations
               : 2
                                                                                                                                               Compiler version : GCC10.2.0
                                                                        ERROR: Please modify the datatypes in core portme.h!
Compiler version : GCC10.2.0
                                                                        CoreMark Size : 666
                                                                                                                                               Compiler flags
Compiler flags : -0
                                                                        Total time (secs): 42
                                                                                                                                               Memory location : STACK
Memory location : STACK
                                                                        Iterations/Sec : 0
                                                                                                                                               seedcrc
                                                                                                                                                                      : 0xe9f5
seedcrc
               : 0xe9f5
                                                                        Compiler version : GCC10.2.0
                                                                                                                                                [0]crclist
                                                                                                                                                                      : 0xe714
[0]crclist
               : 0xe714
                                                                        Compiler flags : -0
                                                                         Memory location : STACK
[0]crcmatrix
              : 0x1fd7
                                                                                                                                               [0]crcmatrix
                                                                                                                                                                      : 0x1fd7
[0]crcstate
               : 0x8e3a
                                                                                                                                               [0]crcstate
                                                                                                                                                                      : 0x8e3a
                                                                         [0]crclist
                                                                                     : 0xe714
[0]crcfinal
               : 0x72be
                                                                                    : 0x1fd7
                                                                                                                                                                      : 0x72be
                                                                                                                                                [0]crcfinal
                                                                         [0] crcstate
                                                                                     : 0x8e3a
Errors detected
                                                                         [0]crcfinal
                                                                                                                                                Errors detected
*************
                                                                        Errors detected
    simulation finished successfully

    * simulation finished successfully

                                                                                                                                                      simulation finished successfully
*************
                                                                                                                                                ******************
                                                                         gpfb
     pfu顶层
                                                                                                                                                          sdb entry
```

- 跑分均为6.38,两个6.381377,一个6.383087。
- •目前还没有分析跑分比原版高的原因,应该要分析生成的Verilog和原来的Verilog,猜测应该是Chisel对原来的逻辑进行了优化,裁剪了冗余。



## 4 Chisel化报告





```
VCUNT SIM: CoreMark has been run 2 times, one times cost 157037 cycles !
VCUNT SIM: CoreMark 1.0 : 6.367926 (iterations/sec)/MHz
2K performance run parameters for coremark.
ERROR: ee_ptr_in iis not a datatype that holds an int po2K performance run parameters for coremark.
                                                           VCUNT SIM: CoreMark 1.0 : 6.367926 (iterations/sec)/MHz
ERROR: Please modify the datatypes in core portme.h!
                                                           ERROR: ee ptr in iis not a datatype that holds an int pointer!
CoreMark Size
                                                           ERROR: Please modify the datatypes in core portme.h!
Total ticks
                 : -1
                                                           CoreMark Size
                                                                         : 666
Total time (secs): 42
                                                           Total ticks
                                                                          : -1
Iterations/Sec
                                                           Total time (secs): 42
Iterations
                                                           Iterations/Sec
Compiler version : GCC10.2.0
                                                           Iterations
                                                                          : 2
Compiler flags : -0
                                                           Compiler version : GCC10.2.0
Memory location : STACK
                                                           Compiler flags : -0
                 : 0xe9f5
seedcrc
                                                           Memory location : STACK
[0]crclist
                 : 0xe714
                                                           seedcrc
                                                                           : 0xe9f5
[0]crcmatrix
                 : 0x1fd7
                                                           [0]crclist
                                                                          : 0xe714
[0]crcstate
                 : 0x8e3a
                                                           [0]crcmatrix
                                                                          : 0x1fd7
[0]crcfinal
                 : 0x72be
                                                           [0]crcstate
                                                                          : 0x8e3a
                                                                          : 0x72be
Errors detected
                                                           [0]crcfinal
                                                           Errors detected
                                                           *****************************
     simulation finished successfully
                                                               simulation finished successfully
                                                           *************
```

```
VCUNT SIM: CoreMark 1.0 : 6.367926 (iterations/sec)/MHz
2K performance run parameters for coremark.
ERROR: ee ptr in iis not a datatype that holds an int pointer!
ERROR: Please modify the datatypes in core portme.h!
CoreMark Size
Total ticks
Total time (secs): 42
Iterations/Sec : 0
Iterations
Compiler version : GCC10.2.0
Compiler flags : -0
Memory location : STACK
seedcrc
                : 0xe9f5
[0]crclist
               : 0xe714
[0]crcmatrix
               : 0x1fd7
[0]crcstate
               : 0x8e3a
[0]crcfinal
               : 0x72be
Errors detected
*************
    simulation finished successfully
***************************
$finish called from file "../logical/tb/tb.v", line 273.
```

pmb

sdb cmp

```
ERROR! Pease define ee ptr int to a type that holds a pointer!
VCUNT SIM: CoreMark has been run 2 times, one times cost 157037 cycles
VCUNT SIM: CoreMark 1.0 : 6.367926 (iterations/sec)/MHz
2K performance run parameters for coremark.
ERROR: ee ptr in iss not a datatype that holds an int pointer!
ERROR: Please modify the datatypes in core portme.h!
CoreMark Size : 666
Total ticks
Total time (secs): 42
Iterations/Sec : 0
Iterations
Compiler version : GCC10.2.0
Compiler flags : -0
Memory location : STACK
               : 0xe9f5
[@]crclist
               : 0xe714
[0]crcmatrix
              : 0x1fd7
[0]crcstate
               : 0x8e3a
[0]crcfinal
Errors detected

    simulation finished successfully
```

pfb

Coremark跑分均为6.367926,与源代码跑分一致

gsdb





## 4 Chisel 化报告







### 一些建议:

- 虚拟机的作用在于一些软件必须跑在Linux,或者在Linux比 Windows系统更好。
- •对于IDEA也许可以出一个Windows部署Chisel项目的教程,我们 小组最终所有人都是在Windows上面重新下载了一个IDEA并克隆 了一个Chisel模板来做、给的虚拟机并没有发挥大的作用。
- 建议虚拟机可以打包好一个包含有VCS和Verdi的破解版的,大家 也不用频繁麻烦助教上传文件了,可以很方便在自己本地debug, 根据一位同学在自己虚拟机上面跑Coremark来看,个人电脑跑仿 真10分钟左右一次Coremark。







# 5 顶会论文阅读洞察

杜宇航 2024/06/07











### Offset-prefetching

When access x, prefetch x+D





### Late prefetches time of access hurt performance









### **Best-Offset-prefetching**

- Try to identify the <u>single best offset</u>.
- New method for evaluating offsets.

Recent Requests(RR) table

evaluate offset d: if hit, increase score and pick the highest one.

Take into account both coverage and timeliness.







#### **Best-Offset-prefetching**

- Prefetch and evaluate.
- Shut the prefetch sometimes.
- When prefetch is off, offset evaluation continues.











#### Hardware evaluation

- One score per offset in the list: 52offsets 5-bit scores  $\rightarrow$  260bits
- RR table: 256 entries, 12-bit tags → 3072 bits
- 3 adders: 64B line, 2MB page → 15-bit adders
- Misc. logic

#### **Main Weakness**

Tradeoff between prefetch coverage and timeliness



### **Bingo Spatial Data Prefetcher [2]**





#### From TAGE to BINGO





TAGE-like Prefetcher: use different table to predict

BINGO: single history table using different index

### Bingo Spatial Data Prefetcher [2]





#### **Hardware Evaluation**



**Using 16 K entries** 

Total storage requirement is 119KB

**Accounting for 6% of the LLC** 







Make prefetch a reinforcement learning problem

**Problem:** Lack inherent systems awareness;

Lack in-silicon customizability

Pythia: takes adaptive prefetch decisions using multiple features

can be customized in silicon for target workloads

propose a practical implementation of RL in hardware







#### Basics of RL and formulating prefetching as RL













#### **Hardware Evaluation and Performance**



| Structure | Description                                                                                                                                                                                       | Size    |
|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| QVStore   | <ul> <li># vaults = 2</li> <li># planes in each vault = 3</li> <li># entries in each plane = feature dimension (128) × action dimension (16)</li> <li>Entry size = Q-value width (16b)</li> </ul> | 24 KB   |
| EQ        | <ul> <li># entries = 256</li> <li>Entry size = state (21b) + action index (5b) + reward (5b) + filled-bit (1b) + address (16b)</li> </ul>                                                         | 1.5 KB  |
| Total     |                                                                                                                                                                                                   | 25.5 KB |







#### Reference







- [1] P. Michaud, "Best-offset hardware prefetching," 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, 2016, pp. 469-480, doi: 10.1109/HPCA.2016.7446087.
- [2] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran and H. Sarbazi-Azad, "Bingo Spatial Data Prefetcher," 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 2019, pp. 399-411, doi: 10.1109/HPCA.2019.00053.
- [3] Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu. 2021. Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '21). Association for Computing Machinery, New York, NY, USA, 1121–1137. https://doi.org/10.1145/3466752.3480114







## Thanks!

高考加油

