# Memory Hierarchy

## 2 Procedure

### 2.1 Understanding the Program
Without delving into the details of the signal processing application, analyze the flow of the C program.
Observe the data access patterns and identify the critical sequence of accesses which may have a larger
impact on the performance of the system.

```
R: Acessos feitos no processamento dos Macroblocks (L241 a L281).
```

### 2.3 Cache L1

#### 2.3.1 Theory of Cache

1. Explain the different types of cache misses: compulsory, capacity, and conflict



```
Compulsory: Cache misses inevitáveis por conteúdos que nunca foram carregados ao iniciar uma dada execução pela primeira vez (cold start).
Capacity: Ocorrem quando um dado nível de cache não tem capacidade para armazenar toda a informação que uma dada execução necessita, sendo necessário substituir blocos.
Conflict: Dão-se em caches de mapeamento direto ou set-associative quando uma dada execução pretende guardar um bloco num endereço onde já reside outro bloco, sendo necessário substituí-lo.

```

2. Explain the different types of cache writing-policies.


```
Há dois tipos de cache writing-policies:
Write-through: As escritas são imediatamente replicadas em memória. Por este motivo, as escritas são mais lentas.
Write-back: As escritas são guardadas em cache. Apenas são replicadas em memória quando é necessário libertar o respetivo bloco da cache.
```

### 2.3.2 Cache L1: dimension and block size

a) Consider a memory hierarchy composed of a single cache memory (L1), which interconnects the
SDRAM frame memory and the CPU.\
Considering the characteristics of the available memory devices (see Table 1), and the maximum total cost of the memory hierarchy, determine the maximum
storage space of cache L1.
- NOTES:
    - the size of any of the memory modules (frame buffer, any cache) must be an integer power of 2:
        - L1_size = $2^{MAX}$;
    - do not forget to consider the cost of the 128 kByte frame memory.

In [1]:
from math import pow

BUDGET = 0.02
PRICE_PER_MBYTE_L1 = 10
PRICE_PER_MBYTE_SDRAM = 0.01
SIZE_SDRAM = pow(2, 17)

def calculate_price(size_in_bytes, price_per_Mbyte):
    price_per_byte = price_per_Mbyte / pow(2,20)

    return price_per_byte * size_in_bytes


SDRAM_PRICE = calculate_price(SIZE_SDRAM, PRICE_PER_MBYTE_SDRAM)

print(f"SDRAM price: {SDRAM_PRICE}€")

final_budget = BUDGET - SDRAM_PRICE

print(f"Budget antes de L1: {final_budget}€")

i = 0
while (calculate_price(pow(2, i), PRICE_PER_MBYTE_L1) < final_budget):
    i += 1
i -=1

print(f"Valor de MAX é: {i}")
print(f"Ou seja, L1_size = {pow(2, i)}B")

remaining_budget = final_budget - calculate_price(pow(2, i), PRICE_PER_MBYTE_L1)

print(f"Fun fact: Sobra {remaining_budget}€ de budget, ou seja, o preço total é {BUDGET - remaining_budget}€")

SDRAM price: 0.00125€
Budget antes de L1: 0.01875€
Valor de MAX é: 10
Ou seja, L1_size = 1024.0B
Fun fact: Sobra 0.008984375€ de budget, ou seja, o preço total é 0.011015625000000001€


b) Consider three different dimensions for the L1 data cache: L1_size $\in$ {$ 2^{MAX}, 2^{MAX−1}, 2^{MAX−2} $}.\
For each of these dimensions, and assuming a direct mapping configuration, use the dineroIV
simulator to evaluate the resulting average data miss-rate considering the following block sizes:
- Block_size $\in$ {$8, 16, 32, 64$}.\
Fill the following table with the obtained data:

$$
\begin{matrix}
& L1\_size = 2^{10} & L1\_size = 2^{9} & L1\_size = 2^{8}\\
Block size = 8\ Bytes & 0.0305 & 0.1247 & 0.1960\\
Block size = 16\ Bytes & 0.0363 & 0.1184 & 0.1829\\
Block size = 32\ Bytes & 0.0770 & 0.1492 & 0.2288\\
Block size = 64\ Bytes & 0.1181 & 0.2021 & 0.3340
\end{matrix}
$$

c) For each L1 cache size, plot the variation of the miss-rate with the size of the block. 

![Graph L1 cache size miss rate with size of the block](./plot1.png#gh-dark-mode-only)

d) By considering the obtained results, select two L1 cache configurations (dimension and block size)
that offer the best trade-off between the cost of the device and the resulting average miss-rate.\
Label in the previous plot the two configurations chosen.

In [2]:
print("Size = 2¹⁰")
print(calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.0305 + SDRAM_PRICE)
print(calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.0363 + SDRAM_PRICE)
print(calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.0770 + SDRAM_PRICE)
print(calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.1181 + SDRAM_PRICE)
print("Size = 2⁹")
print(calculate_price(pow(2,9), PRICE_PER_MBYTE_L1) * 0.1247 + SDRAM_PRICE)
print(calculate_price(pow(2,9), PRICE_PER_MBYTE_L1) * 0.1184 + SDRAM_PRICE)
print(calculate_price(pow(2,9), PRICE_PER_MBYTE_L1) * 0.1492 + SDRAM_PRICE)
print(calculate_price(pow(2,9), PRICE_PER_MBYTE_L1) * 0.2021 + SDRAM_PRICE)
print("Size = 2⁸")
print(calculate_price(pow(2,8), PRICE_PER_MBYTE_L1) * 0.1960 + SDRAM_PRICE)
print(calculate_price(pow(2,8), PRICE_PER_MBYTE_L1) * 0.1829 + SDRAM_PRICE)
print(calculate_price(pow(2,8), PRICE_PER_MBYTE_L1) * 0.2288 + SDRAM_PRICE)
print(calculate_price(pow(2,8), PRICE_PER_MBYTE_L1) * 0.3340 + SDRAM_PRICE)

Size = 2¹⁰
0.0015478515625
0.0016044921875
0.002001953125
0.0024033203125000003
Size = 2⁹
0.0018588867187500002
0.001828125
0.001978515625
0.0022368164062499998
Size = 2⁸
0.001728515625
0.001696533203125
0.00180859375
0.0020654296875


In [3]:
L1_CONFIG_1_COST = calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.0305
L1_CONFIG_2_COST = calculate_price(pow(2,10), PRICE_PER_MBYTE_L1) * 0.0363

price_miss_rate_config_1 = L1_CONFIG_1_COST + SDRAM_PRICE # Não concordo que a SDRAM_PRICE seja incluída
price_miss_rate_config_2 = L1_CONFIG_2_COST + SDRAM_PRICE # Não concordo que a SDRAM_PRICE seja incluída

print(price_miss_rate_config_1)
print(price_miss_rate_config_2)

0.0015478515625
0.0016044921875


$$
\begin{matrix}
 & L1\ config \ 1\\
Cache\ size & 2^{10}\\
Block\ size & 2^{3}\\
Miss\ rate & 0.0305\\
Cost & 0.0015478515625
\end{matrix}
$$

$$
\begin{matrix}
 & L1\ config \ 2\\
Cache\ size & 2^{10}\\
Block\ size & 2^{4}\\
Miss\ rate & 0.0363\\
Cost & 0.0016044921875
\end{matrix}
$$

#### 2.3.3 Cache L1: set associativity

a) For each of the two L1 cache setups previously selected, evaluate the compulsory, capacity, conflict and total miss-rates when the following configurations are considered:
- set associativity of 1 (direct-mapped), 2, 4, 8.

Fill the following table with the obtained data:

In [4]:
from functools import reduce
import operator

# data [total compulsory, total capacity, total conflict]
total_fetches = 12304896
l1_c1_n_ways_data = [
  [9504, 21010, 344230], # 1-way
  [9504, 21436, 408141], # 2-way
  [9504, 23465, 272], # 4-way
  [9504, 23401, 145], # 8-way
]
l1_c2_n_ways_data = [
  [4752, 13540, 428645], # 1-way
  [4752, 14715, 428356], # 2-way
  [4752, 14695, 0], # 4-way
  [4752, 14939, 0], # 8-way
]

# obtained fraction to miss rate, as x is to 1
def adjust_fraction(obtained, miss_rate):
  return obtained * miss_rate

def print_l1_n_way_table(total_fetches, data):
  # convert to fractions
  data_fractions = list(map(lambda x: list(map(lambda y: y / total_fetches, x)), data))

  print(" Miss Rate |  1-way  |  2-way  |  4-way  |  8-way")
  print("--------------------------------------------------")
  for row, row_name in enumerate(("Compulsory", "  Capacity", "  Conflict")):
    row_data = list(map(lambda x: x[row], data_fractions))
    print("{} | {:.5f} | {:.5f} | {:.5f} | {:.5f}".format(row_name, *row_data))
  total_data = list(map(lambda x: reduce(operator.add, x) / total_fetches, data))
  print("     Total | {:.5f} | {:.5f} | {:.5f} | {:.5f}".format(*total_data))

print("L1 Config 1")
print_l1_n_way_table(total_fetches, l1_c1_n_ways_data)
print()
print("L1 Config 2")
print_l1_n_way_table(total_fetches, l1_c2_n_ways_data)

L1 Config 1
 Miss Rate |  1-way  |  2-way  |  4-way  |  8-way
--------------------------------------------------
Compulsory | 0.00077 | 0.00077 | 0.00077 | 0.00077
  Capacity | 0.00171 | 0.00174 | 0.00191 | 0.00190
  Conflict | 0.02798 | 0.03317 | 0.00002 | 0.00001
     Total | 0.03045 | 0.03568 | 0.00270 | 0.00269

L1 Config 2
 Miss Rate |  1-way  |  2-way  |  4-way  |  8-way
--------------------------------------------------
Compulsory | 0.00039 | 0.00039 | 0.00039 | 0.00039
  Capacity | 0.00110 | 0.00120 | 0.00119 | 0.00121
  Conflict | 0.03484 | 0.03481 | 0.00000 | 0.00000
     Total | 0.03632 | 0.03639 | 0.00158 | 0.00160


$$
\begin{matrix}
 &  & L1\ config\ 1 &  & \\
Miss\ rate & 1-way & 2-way & 4-way & 8-way\\
Compulsory & 0.00077 & 0.00077 & 0.00077 & 0.00077\\
Capacity   & 0.00171 & 0.00174 & 0.00191 & 0.00190\\
Conflict   & 0.02798 & 0.03317 & 0.00002 & 0.00001\\
Total      & 0.03046 & 0.03568 & 0.00270 & 0.00268
\end{matrix}
$$

$$
\begin{matrix}
 &  & L1\ config\ 2 &  & \\
Miss\ rate & 1-way & 2-way & 4-way & 8-way\\
Compulsory & 0.00039 & 0.00039 & 0.00039 & 0.00039\\
Capacity   & 0.00110 & 0.00120 & 0.00119 & 0.00121\\
Conflict   & 0.03484 & 0.03481 & 0.00000 & 0.00000\\
Total      & 0.03633 & 0.03640 & 0.00158 & 0.00160
\end{matrix}
$$

b) For each L1 cache setup, draw a plot with the variation of the obtained compulsory, capacity,
conflict and total miss-rates for the considered set associativity ways.

![Graph L1 cache size miss rate (detailed) with n-way](./plot2.png#gh-dark-mode-only)

c) Comment the results above

> Em ambas as configurações, para 1-way e 2-way, existem muitos _cache misses_ devido a conflitos (azul).
> Tal acontece porque se acede várias vezes a endereços que terminam na mesma combinação de bits, isto é, que partilham o mesmo _index_ na cache.  
> No entanto, para 4-way e 8-way, deixam de haver conflitos, o que baixa consideravelmente a _miss rate_.

d) Write the expression that provides the mean access time as a function of the L1 cache hit
and miss
rates, the L1 cache hit
and miss
access times, and the time penalty
associated to each associativity level, as expressed in Table 1.\
 Consider a non-blocking critical word-first load policy, where the bus occupancy rate has a lower impact in the performance of the
cache.

> // TODO

In [5]:
def mean_acess_time(hit_rate, miss_rate, hit_time, miss_time, time_penalty_num_of_ways):
    ## odeio de se ter de fzr assim
    return (hit_time + time_penalty_num_of_ways) * hit_rate + (hit_time + time_penalty_num_of_ways + miss_time) * miss_rate


e) Evaluate the mean access time of each configuration, considering the obtained miss-rates and the
time penalty associated to each associativity level.\
Evaluate the resulting cost function, as defined
in Eq. 1 (including the frame memory).\
Fill the following table with the obtained data:

In [6]:
from math import log2

TIME_PENALTY_1_WAY = 2 * 0.35 * log2(1)
TIME_PENALTY_2_WAY = 2 * 0.35 * log2(2)
TIME_PENALTY_4_WAY = 2 * 0.35 * log2(4)
TIME_PENALTY_8_WAY = 2 * 0.35 * log2(8)
TIME_PENALTIES = [TIME_PENALTY_1_WAY, TIME_PENALTY_2_WAY, TIME_PENALTY_4_WAY, TIME_PENALTY_8_WAY]

HIT_TIME = 2 * 0.7
MISS_TIME = 140

l1_c1_miss_rates = [0.0305, 0.0357, 0.0027, 0.0027]
l1_c2_miss_rates = [0.0363, 0.0364, 0.0016, 0.0016]

print("####### Access Time #######")
print("L1 config 1")
for i, miss_rate in enumerate(l1_c1_miss_rates):
    print(mean_acess_time(1 - miss_rate, miss_rate, HIT_TIME, MISS_TIME, TIME_PENALTIES[i]))
print("L1 config 2")
for i, miss_rate in enumerate(l1_c2_miss_rates):
    print(mean_acess_time(1 - miss_rate, miss_rate, HIT_TIME, MISS_TIME, TIME_PENALTIES[i]))
print("\n####### Price #######")
print("Config 1 and 2 price")
CONFIG_PRICE = calculate_price(pow(2, 10), PRICE_PER_MBYTE_L1) + SDRAM_PRICE
print(CONFIG_PRICE)
print("\n####### Cost Function #######")
print("L1 config 1")
for i, miss_rate in enumerate(l1_c1_miss_rates):
    print(mean_acess_time(1 - miss_rate, miss_rate, HIT_TIME, MISS_TIME, TIME_PENALTIES[i]) * CONFIG_PRICE)
print("L1 config 2")
for i, miss_rate in enumerate(l1_c2_miss_rates):
    print(mean_acess_time(1 - miss_rate, miss_rate, HIT_TIME, MISS_TIME, TIME_PENALTIES[i]) * CONFIG_PRICE)

####### Access Time #######
L1 config 1
5.67
7.097999999999999
3.1779999999999995
3.8779999999999992
L1 config 2
6.481999999999999
7.196
3.024
3.7239999999999993

####### Price #######
Config 1 and 2 price
0.011015625

####### Cost Function #######
L1 config 1
0.06245859374999999
0.07818890624999998
0.03500765624999999
0.04271859374999999
L1 config 2
0.07140328124999999
0.0792684375
0.03331125
0.04102218749999999


$$
\begin{matrix}
 &  & L1\ config\ 1 &  & \\
 & 1-way & 2-way & 4-way & 8-way\\
Miss\ rate &   0.0305 & 0.0357 & 0.0027 & 0.0027 \\
Acess\ time & 5.67 & 7.098 & 3.178 & 3.8779999\\
Price & & 0.011015625  & & \\
Cost\ function & 0.06245859374999999 & 0.07818890625 & 0.03500765625  & 0.04271859374999999
\end{matrix}
$$
$$
\begin{matrix}
 &  & L1\ config\ 2 &  & \\
 & 1-way & 2-way & 4-way & 8-way\\
Miss\ rate &  0.0363 & 0.0364 & 0.0016  & 0.0016 \\
Acess\ time & 6.481999 & 7.196  & 3.024 & 3.723999\\
Price & & 0.011015625 & &  \\
Cost\ function & 0.07140328124999999 & 0.0792684375  & 0.03331125 & 0.041022187499999994
\end{matrix}
$$

f) Draw conclusions:

> O nosso objetivo é selecionar a configuração com a menor _cost function_.
> Conseguimos facilmente observar que tanto na _config 1_ como na _config 2_, a cache com 4-way association é a que tem menor _cost function_, sendo a configuração ótima em termos de associatividade.  
> Entre ambas, deveremos selecionar a L1 config 2 4-way, visto que é a que tem a menor _cost function_ de todas.

#### 2.3.4 Cache L1: write policy

a) By analyzing the sequence of memory accesses generated by the motion estimation algorithm (see
Fig. 3), select the best setup for the cache writing-policy: write-back versus write-through, writeallocate versus write-not-allocate.\
Justify. (Note that the number of writes is much smaller than
the number of reads.)

> // TODO

#### 2.3.5 Cache L1: final selection

a) By considering the obtained results, select the L1 cache setup that offers the best compromise
between the cost of the device and the resulting average access time.

$$
\begin{matrix}
 & L1\ config\\
Cache\ dimension & 2^{10}\\
Block\ size & 2^4\\
Associativity & 4-ways\\
Write\ policy & TODO\\
Miss\ rate & 0.0016\\
Acess\ time & 3.024\\
Price & 0.01102\\
Cost\ function & 0.03331
\end{matrix}
$$

### 2.4 Cache L2

#### 2.4.1 Cache L2: dimension

a) Considering the maximum cost of the whole memory hierarchy, as well as the price of L1 cache
and the 128 kByte frame memory, determine the maximum storage space of L2 cache (an integer
power of 2), considering the characteristics of the available memory devices (see Table 1).

In [7]:
from math import pow

BUDGET = 0.02
PRICE_PER_MBYTE_L2 = 0.4
PRICE_PER_MBYTE_SDRAM = 0.01
L1_FINAL_PRICE = 0.01102
SIZE_SDRAM = pow(2,17)

def calculate_price(size_in_bytes, price_per_Mbyte):
    
    price_per_byte = price_per_Mbyte / pow(2,20)

    return price_per_byte * size_in_bytes


SDRAM_PRICE = calculate_price(SIZE_SDRAM,PRICE_PER_MBYTE_SDRAM)

print("SDRAM price: ", SDRAM_PRICE)

print("L1 Price: ", L1_FINAL_PRICE )

l2_budget = BUDGET - (SDRAM_PRICE + L1_FINAL_PRICE)
print("Budget antes de L2: ", l2_budget)


i = 0
while (calculate_price(pow(2,i),PRICE_PER_MBYTE_L2) < l2_budget):
    i += 1 
i -=1

print(f"Tam. máximo de L2 é: 2^{i}")

l2_price = calculate_price(pow(2, i), PRICE_PER_MBYTE_L2)

print(f"Sanity check: Excedente orçamental: {l2_budget - l2_price}")


SDRAM price:  0.00125
L1 Price:  0.01102
Budget antes de L2:  0.007730000000000001
Tam. máximo de L2 é: 2^14
Sanity check: Excedente orçamental: 0.0014800000000000004


b) For the obtained maximum storage space for L2 cache, adopting a direct mapping configuration,
use dineroIV simulator to evaluate the resulting average data miss-rate considering the following
block sizes: (1 × L1_block), (2 × L1_block), (4 × L1_block) and (8 × L1_block).\
Fill the following table with the obtained data:

|         | Block size    | # Miss-rate |
|--------------|-----------|------------|
| Block size = (1x L1_block) | 16 |0.4773      |        
|  Block size = (2x L1_block)   | 32  | 0.2386  |   
|  Block size = (4x L1_block)    | 64 | 0.1193  |         
|  Block size = (8x L1_block)    | 128| 0.0598  |     

> Miss rates obtained w/ `dineroIV -l1-usize 1k -l1-ubsize 16 -l1-uassoc 4 -l2-usize 16k -l2-ubsize <block-size> < trace.log` 

c) Plot the variation of the miss-rate with the size of the block. 

![Graph L1 cache size miss rate with size of the block](./l2-cache-plot.png#gh-dark-mode-only)

d) From the obtained results, select the block size that offers the best trade-off between the resulting
average miss-rate and the time penalty associated with each data fetch from the primary memory.\
Justify.

> L2 Block Size = 128B

> O time penalty é constante (visto que o tempo de acesso à Frame Memory é constante = 140ns - não depende do block size), pelo que escolhemos o block size com menor miss-rate.

#### 2.4.2 Cache L2

a) Evaluate the compulsory, capacity, conflict and total miss-rates for the direct-mapped L2 data
cache.\
Fill the following table with the obtained data:


0.3741
0.6259

$$
\begin{matrix}
& Miss-rate\\
Compulsory &  0,02237118 \\
Capacity   &  0,3742882 \\
Conflict   & 0 \\
Total      & 0.0598
\end{matrix}
$$
Obtained w/ `dineroIV -l1-usize 1024 -l1-ubsize 16 -l1-uassoc 4 -l2-usize 16k -l2-ubsize 128 -l2-uccc < trace.log`

b) Plot the variation of the obtained compulsory, capacity, conflict and total miss-rate.

c) Write the expression which provides the mean access time as a function of the L1 and L2 cache
hit and miss  rates, L1 and L2 cache hit and miss  access
times, and the time penalty, as expressed in table 1.

In [8]:
def mean_acess_time_L2(
    L1_hit_rate,
    L2_hit_rate,
    L1_miss_rate,
    L2_miss_rate,
    L1_hit_time,
    L2_hit_time,
    L1_miss_time,
    L2_miss_time,
    time_penalty_num_of_ways_L1,
    time_penalty_num_of_ways_L2
):

    # Obg diogo correia por me obrigares a escrever este monstro
    return (L1_hit_time + time_penalty_num_of_ways_L1) * L1_hit_rate + (
        (
            L1_hit_time
            + time_penalty_num_of_ways_L1
            + L2_hit_time
            + time_penalty_num_of_ways_L2
        )
        * L2_hit_rate
        + (L1_hit_time + time_penalty_num_of_ways_L1 + L2_hit_time + L2_miss_time)
        * L2_miss_rate
    ) * L1_miss_rate

In [9]:
from math import log2

TIME_PENALTY_4_WAY = 2 * 0.35 * log2(4)
TIME_PENALTY_1_WAY = 10 * 0.55 * log2(1)
HIT_TIME_L1 = 2 * 0.7
MISS_TIME_L2 = 140
MISS_TIME_L1 = 0 # not used
HIT_TIME_L2 = 10*0.7

print(mean_acess_time_L2(1 - 0.0016, 1 - 0.0598, 0.0016, 0.0598, HIT_TIME_L1, HIT_TIME_L2,MISS_TIME_L1,MISS_TIME_L2 , TIME_PENALTY_4_WAY, TIME_PENALTY_1_WAY))

2.8245951999999996


d) Evaluate the mean access time provided by the chosen configuration, considering the obtained
miss-rate and the time penalty. Evaluate the resulting cost function, as defined in Eq. 1.\
Fill the following table with the obtained data:

$$
\begin{matrix}
 Miss Rate &\\
Acess time & 2.8245951999999996\\
Price   & \\
Cost function   & \\
\end{matrix}
$$

### 2.5 Memory Hierarchy Configuration

a) By considering the obtained results, fill the following table with the selected characteristics for L1
and L2 cache memories, as well as the corresponding performance results of the overall memory
hierarchy.

$$
\begin{matrix}
 & Cache\ L1 & Cache\ L2 & Frame\ Memory\\
Dimension\ ( Bytes) & 2^{10} &  & 128\ *\ 1024\\
Block\ size\ ( Bytes) & 2^{4} &  & -\\
Associativity & 4-ways &  & -\\
Write\ policy & TODO &  & -\\
Local\ Miss\ rate\ ( \%) & 0.16 &  & -\\
Price\ ( € ) & 0.01102 &  & \\
Global\ Miss\ rate\ ( \%) &  &  & \\
Global\ access\ rate\ ( ns) &  &  & \\
Total\ Price\ ( € ) &  &  & \\
Cost\ function\ ( € \ *\ ns) &  &  & 
\end{matrix}
$$