# Your name:
Cindy Yang

In [1]:
import pandas as pd
import numpy as np

## Part 1: Single PE Modeling
We will start with a simple design consisting of a single PE as shown in the figure below. The PE contains a MAC unit to do multiplication and accumulation, and a scratchpad to store data locally for reuse. We also provide you with the loop nest for this single PE design in the figure below. Please find the necessary Accelergy/Timeloop descriptions at `designs/singlePE`.

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/singlePE/figures/PE_arch.png" alt="PE Architecture" style="margin:100px 0px 30px 70px; width:35%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/singlePE/figures/PE_loopnest.png" alt="PE Loopnest" style="width:40%">
  </div>
</div>

## Question 1 Introduction

### Question 1.1
Assuming you cannot reorder the provided loop nest, if you can only store one datatype (datatypes inlcude *filter weights, input activations, output activations*) inside the PE scratchpad to maximize data reuse inside the PE, which datatye will you choose? In 1 or 2 sentences, explain why.

I would choose the filter weights because the outermost 4 loops correspond to the indices for the weight values, so the computation is weight stationary. If we keep the weights on the PE, we will be able to repeatedly reuse them before switching out the data.




### Question 1.2 
Take a look at the `design/singlePE/arch/single_PE_arch.yaml` file. This file describes the hardware structure of the architecture. Please fill in the chart below:

*Hint: the operand registers of the mac unit belong to the same memory level*

In [5]:
# the Question 1.2 chart
d = {'# of memory levels (including DRAM and registers)': [3],   # DRAM, registers 
     '  # of bits used to represent a data': [16],                # fill in your answer here
     '  size of local scrachpad (bytes)': [36],                   # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 # of memory levels (including DRAM and registers)    # of bits used to represent a data    size of local scrachpad (bytes)
                        3                                            16                                  36                


Take a look at the compound component descriptions at `designs/singlePE/arch/components`. These files describe the hardware details of each component in the design. 

1. Are these compound components composed of single subcomponent or multiple subcomponents?
   
   These components are each composed of a single subcomponent
   

2. According to description of the `mac_compute` compound component, is our architecture capable of performing floating point computations? In 1 or 2 sentences, explain why.

    No, its single component is of class 'intmac', which is for integer computations only.
    


### Question 1.3
The command below performs static hardware charaterizations using **Accelergy**. You do not need to worry about the warning messages.

Examine the file `designs/singlePE/output/ERT.yaml`. Please fill in the chart below (**note that the implicit energy unit for the ERT is pJ**)

In [3]:
%%bash
cd designs/singlePE/
accelergy arch/ -o output

    _                _                      
   / \   ___ ___ ___| | ___ _ __ __ _ _   _ 
  / _ \ / __/ __/ _ \ |/ _ \ '__/ _` | | | |
 / ___ \ (_| (_|  __/ |  __/ | | (_| | |_| |
/_/   \_\___\___\___|_|\___|_|  \__, |\__, |
                                |___/ |___/ 

Info: generating outputs according to the following specified output flags... 
 Please use the -f flag to update the preference (default to all output files) 
{'ERT': 1, 'ERT_summary': 1, 'ART': 1, 'ART_summary': 1, 'energy_estimation': 1, 'flattened_arch': 1}
Info: Parsing file arch/single_PE_arch.yaml for architecture info 
Info: Parsing file arch/components/mac_compute.yaml for compound_components info 
Info: Parsing file arch/components/reg_storage.yaml for compound_components info 
Info: Parsing file arch/components/smart_storage.yaml for compound_components info 
Info: config file located: /home/workspace/.config/accelergy/accelergy_config.yaml 
config file content: 
 {'estimator_plug_ins': ['/usr/local/share/acce

In [4]:
# the Question 1.3 chart
d = {'DRAM read': [512],           # fill in your answer here
     ' DRAM write': [512],         # fill in your answer here
     ' scrachpad read': [0.226],     # fill in your answer here
     ' scrachpad write': [0.226],    # fill in your answer here
     ' mac compute': [3.275],        # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 DRAM read   DRAM write   scrachpad read   scrachpad write   mac compute
   512         512           0.226             0.226           3.275    


### Question 1.4 

Take a look at the `design/singlePE/map/map.yaml` file. This file describes a mapping for a certain workload. By examining the mapping, can you tell what are the values of `M0`, `N0`, `C0`, `R`, `S`, `P`, `Q` in the loop nest above? For each of them, if you can, specifiy the value in the following chart; if you can't, state why in this cell. 

provide your explanantion to why you cannot tell some of the values here...

In [29]:
# the Question 1.4 chart, put down nan if you cannot tell what the value is 
d = {'M0': [2],   # fill in your answer here
     'N0': [1],   # fill in your answer here
     'C0': [1],   # fill in your answer here
     'S':  ['Factor set to 0, varies depending on weight tensor size'],   # fill in your answer here
     'R':  ['Factor set to 0, varies depending on weight tensor size'],   # fill in your answer here
     'P':  ['Factor set to 0, varies depending on workload size'],   # fill in your answer here
     'Q':  ['Factor set to 0, varies depending on workload size']    # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 M0  N0  C0                            S                                                       R                                                    P                                                  Q                         
 2   1   1  Factor set to 0, varies depending on weight tensor size Factor set to 0, varies depending on weight tensor size Factor set to 0, varies depending on workload size Factor set to 0, varies depending on workload size


### Question 1.5
The command below performs **Timeloop** runtime simulation of your design, and **Accelergy** is queried as the backend to provide energy estimations for each simulated component (that's why you will see the Accelergy related outputs as well (*e.g.,* `timeloop-model.ERT.yaml`))

1. Take a look at `timeloop-model.map.txt`, can you now tell the dimensions of the layer shape by looking at the produced mapping? In 1 or 2 sentences, explain why. Take a look at the `timeloop-model.stats.txt`, and fill in the following chart.

   Yes, we can see the dimensions of the layer shape, (C=32, M=64, R=3, S=3, P=5, Q=5, N=2), because the timeloop mapping shows the ranges of the loops explicitly. 

In [13]:
%%bash
cd designs/singlePE/
timeloop-model arch/*.yaml arch/components/*.yaml map/map.yaml ../../layer_shapes/small_layer.yaml

execute:/usr/local/bin/accelergy arch/single_PE_arch.yaml arch/components/mac_compute.yaml arch/components/reg_storage.yaml arch/components/smart_storage.yaml map/map.yaml ../../layer_shapes/small_layer.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
Generate Accelergy ART (area reference table) to replace internal area model.
Utilization = 1.00 | pJ/MACC =  392.023




2. Run simulaiton on another layer shape by running the command `timeloop-model arch/*yaml arch/components/*.yaml map/map.yaml ../../layer_shapes/medium_layer.yaml`. **Note that your previous outputs from Timeloop might be overwritten by this run if you don't move it to an output folder**

   Fill in the second row in the chart below. Does the `pJ/MACC` value change? In 1 or 2 sentences, explain why. 

   Yes, the pJ/MACC decreased for medium_layer, which had a larger input/output fmap size than small_layer. Our architecture is weight stationary, and the pJ/MACC for each weight is amortized across all inputs it's applied to, so a larger input/output feature map results in a lower pJ/MACC.

In [19]:
%%bash
cd designs/singlePE/
timeloop-model arch/*.yaml arch/components/*.yaml map/map.yaml ../../layer_shapes/medium_layer.yaml

execute:/usr/local/bin/accelergy arch/single_PE_arch.yaml arch/components/mac_compute.yaml arch/components/reg_storage.yaml arch/components/smart_storage.yaml map/map.yaml ../../layer_shapes/medium_layer.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
Generate Accelergy ART (area reference table) to replace internal area model.
Utilization = 1.00 | pJ/MACC =  387.455




3. What's the benefit of allowing a factor of 0, e.g., R=0, in mapping specification (*hint: we used the same `map.yaml` for 2 different layer shapes*)?

   It allows for more flexibility in defining the mapping so that a factor can update depending on the workload.
   

In [12]:
# the Question 1.5.1 and 1.5.2 chart
d = {'layer shape': ['small_layer', 'medium_layer'],
     '  number of cycles': [921600, 8294400],                # fill in your answer here
     '  mac energy total (pJ)': [3018240, 27164160],           # fill in your answer here
     '  scratchpad total energy (pJ)': [16662.53, 16662.53],    # fill in your answer here  
     '  DRAM total energy (pJ)':  [353484800, 3186081792],         # fill in your answer here  # hint: all datatypes
     '  pJ/MACC':  [392.02, 387.46]                         # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

layer shape     number of cycles    mac energy total (pJ)    scratchpad total energy (pJ)    DRAM total energy (pJ)    pJ/MACC
 small_layer        921600                3018240                     16662.53                     353484800          392.02  
medium_layer       8294400               27164160                     16662.53                    3186081792          387.46  


**Since now you have an understanding of the input and output files of the tools, we now would like you to write your own input files and feed it to the evluation system.**


### Question 1.6

Many modern accelerator designs integrate address generators into their storages. The address generator is responsible for generating a sequence of read and write addresses for the memory, *i.e.,* for each read and write, the address is generated locally by the address generator. Typically, the address generator can be represented as an adder.

In this question, we would like you to update the compound component definition for the scratchpad to reflect the existence of such an additional address generator. To be specific:

    1. name of the address generator: address_generator
    2. class of the address generator: intadder
    3. attributes associated with the address generator: datawidth (hint: log2 function can be used), technology, latency
    4. you also need to specify the role your address generator plays when the storage is read and written

Navigate to **`designs/singlePE_ag/arch/components/smart_storage.yaml`** to apply your updates...

1. After you have updated your architecture description, naviagte to the desgins root folder `designs/singlePE_ag` and run   `accelergy arch/ -o output -v 1` (the command cell below). Examine the `ERT.yaml` and `ERT_summary_verbose.yaml` files in the output folder, and fill in the chart below. 


2. Without rerunning Timeloop simulation for the `small_layer.yaml` workload, can you infer from the `ERT.yaml` how much more energy will the local scrachpad consume? In 1 or 2 sentences, explain why.

   provide your answer here...
   
  
3. If we have a huge workload and running simulations of it takes hours, how would using compound components help us when we perform design space exploration (*hint: can you avoid rerunning simulations when you change the details of a compound component*)?

   provide your answer here...
   

In [41]:
%%bash
cd designs/singlePE_ag/
accelergy arch/ -o output -v 1

    _                _                      
   / \   ___ ___ ___| | ___ _ __ __ _ _   _ 
  / _ \ / __/ __/ _ \ |/ _ \ '__/ _` | | | |
 / ___ \ (_| (_|  __/ |  __/ | | (_| | |_| |
/_/   \_\___\___\___|_|\___|_|  \__, |\__, |
                                |___/ |___/ 

Info: generating outputs according to the following specified output flags... 
 Please use the -f flag to update the preference (default to all output files) 
{'ERT': 1, 'ERT_summary': 1, 'ART': 1, 'ART_summary': 1, 'energy_estimation': 1, 'flattened_arch': 1}


Traceback (most recent call last):
  File "/usr/local/bin/accelergy", line 33, in <module>
    sys.exit(load_entry_point('accelergy==0.3', 'console_scripts', 'accelergy')())
  File "/usr/local/lib/python3.6/dist-packages/accelergy/accelergy_console.py", line 72, in main
    raw_dicts = RawInputs2Dicts(raw_input_info)
  File "/usr/local/lib/python3.6/dist-packages/accelergy/raw_inputs_2_dicts.py", line 21, in __init__
    self.load_and_construct_dicts()
  File "/usr/local/lib/python3.6/dist-packages/accelergy/raw_inputs_2_dicts.py", line 66, in load_and_construct_dicts
    for loaded_content in loaded_content_list:
TypeError: 'NoneType' object is not iterable


CalledProcessError: Command 'b'cd designs/singlePE_ag/\naccelergy arch/ -o output -v 1\n'' returned non-zero exit status 1.

In [42]:
# Question 1.6 chart
d = {'read energy of the scratchpad (pJ)': [],  # fill in your answer here
     'write energy of the scratchpad (pJ)': [], # fill in your answer here
     'address generation energy (pJ)': []       # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

Empty DataFrame
Columns: [read energy of the scratchpad (pJ), write energy of the scratchpad (pJ), address generation energy (pJ)]
Index: []


### Question 1.7
So far, we have been focusing on studying the dataflow described in the provided loop next above. In this question, we would like you to update the mapping to represent a new loop nest shown below. 

Navigate to **`designs/singlePE_os/map/map.yaml`**. Please set the bounds in the file according to the layer shape described in `layer_shapes/small_layer.yaml`  (**note that some of the inner bounds are set for you**) and **only keep outputs inside the scratchpad**.

After you have updated the mapping, go to `designs/singlePE_os/` and run the command `timeloop-model arch/*yaml arch/components/*.yaml map/map.yaml ../../layer_shapes/small_layer.yaml` to perform runtime simualtions (run the command cell below). Please fill in the chart below:

<div class="row">
  <div class="column">
    <img align="center" src="designs/singlePE_os/figures/PE_loopnest.png" alt="PE Architecture" style="margin:0px 0px 70px 70px; width:50%">
  </div>
</div>

In [43]:
%%bash
cd designs/singlePE_os/
timeloop-model arch/*yaml arch/components/*.yaml map/map.yaml ../../layer_shapes/small_layer.yaml

execute:/usr/local/bin/accelergy arch/single_PE_arch.yaml arch/components/mac_compute.yaml arch/components/reg_storage.yaml arch/components/smart_storage.yaml map/map.yaml ../../layer_shapes/small_layer.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
Generate Accelergy ART (area reference table) to replace internal area model.
Utilization = 1.00 | pJ/MACC =  287.405


In [44]:
# the Question 1.7 chart
d = {'layer shape': ['small_layer'],    
     'number of cycles': [921600],          # fill in your answer here
     'mac Energy':  [3018240.00],               # fill in your answer here
     'scratchpad Energy (pJ)': [68704],    # fill in your answer here
     'DRAM Energy (pJ)': [261734400],          # fill in your answer here
     'pJ/MAC':[287.40]                      # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

layer shape  number of cycles  mac Energy  scratchpad Energy (pJ)  DRAM Energy (pJ)  pJ/MAC
small_layer       921600       3018240.0           68704              261734400      287.4 
