#### 1 - Inspect the scripts provided in the course page. Identify the layers of the neural network, as well as the relevant instructions for evaluating the MX-compatible formats. It is also recommended to read the specifications regarding the MX formats and data types, presented in this document (OCP Microscaling Formats (MX) Specification).

<small>

```
self.conv1 = mx.Conv2d(1, 32, 3, 1, mx_specs=mx_specs)
self.conv2 = mx.Conv2d(32, 64, 3, 1, mx_specs=mx_specs)
self.dropout1 = nn.Dropout(0.25)	
self.dropout2 = nn.Dropout(0.5)	
self.fc1 = mx.Linear(9216, 128, mx_specs=mx_specs)
self.fc2 = mx.Linear(128, 10, mx_specs=mx_specs)

(...)

x = self.conv1(x)
x = mx.relu(x, mx_specs=self.mx_specs)
x = self.conv2(x)
x = mx.relu(x, mx_specs=self.mx_specs)
x = F.max_pool2d(x, 2)              
x = self.dropout1(x)
x = torch.flatten(x, 1)             
x = self.fc1(x)
x = mx.relu(x, mx_specs=self.mx_specs)
x = self.dropout2(x)
x = self.fc2(x)
output = mx.simd_log(mx.softmax(x, dim=1, mx_specs=self.mx_specs), mx_specs=self.mx_specs)  
```
</small>

| Layer Order | Layer Type | Implementation | Configuration | MX-Compatible? |
| :--- | :--- | :--- | :--- | :---: |
| 1 | Convolution | `mx.Conv2d` | Input: 1, Output: 32, Kernel: 3x3 | Yes |
| 2 | Activation | `mx.relu` | Elementwise ReLU | Yes |
| 3 | Convolution | `mx.Conv2d` | Input: 32, Output: 64, Kernel: 3x3 | Yes |
| 4 | Activation | `mx.relu` | Elementwise ReLU | Yes |
| 5 | Pooling | `F.max_pool2d` | Max Pooling (2x2) | No |
| 6 | Dropout | `nn.Dropout` | Probability: 0.25 | No |
| 7 | Flatten | `torch.flatten` | Reshapes feature maps to vector | No |
| 8 | Fully Connected | `mx.Linear` | Input: 9216, Output: 128 | Yes |
| 9 | Activation | `mx.relu` | Elementwise ReLU | Yes |
| 10 | Dropout | `nn.Dropout` | Probability: 0.5 | No |
| 11 | Fully Connected | `mx.Linear` | Input: 128, Output: 10 | Yes |
| 12 | Output | `mx.softmax` + `mx.simd_log` | LogSoftmax | Yes |

In order to emulate different accelerators we need to change the mx_specs dictionary in the `main()` function of `pytorch_code_mx.py`, that will be called as the `mx_specs=` argument on the mx functions . 
The main configurations are:
*   **Format of matrix multiplication:**
    *   `mx_specs['w_elem_format']`: Weight precision
    *   `mx_specs['a_elem_format']`: Input activation precision.
*   **Elementwise Formats:** The precision for operations like ReLU and Softmax is controlled by:
    *   `mx_specs['bfloat']`: If non-zero, uses Brain Float format.
    *   `mx_specs['fp']`: If `bfloat` is `0`, this defines the total bit-width for a custom floating-point format (exponent is fixed at 5 bits). Otherwise it is ignored.
* **MX format:**
    *   `mx_specs['block_size']`: Set to 32 to use MX formats, every 32 elemtsn share one scale factor.
* **Shared scale:**
    *   `mx_specs['scale_bits']`: Set to 8 Bits for the shared scale factor.
* **CUDA acceleration:**
    *   `mx_specs['custom_cuda']`: True if we want CUDA acceleration.


#### 2- Assess the impact of using some of the MX formats for matrix multiplication operations, considering different  elementwise  operations  (bfloat  and  fpX).  Note  that,  when  performing  elementwise operations with the fp option, the exponent always features a width of 5 bits, as explained in the documentation.  As  a  result,  setting  the  fp  entry  of  mx_specs  to  7  (i.e. the minimum supported value), results in a mantissa of 1 bit. Acquire the accuracy after 5 and 10 epochs for the evaluated settings.  These  results  should  be  displayed  in  the  tables  presented  in  the  last  page  of  this assignment.  Comment on your findings

| Format | int2 | int8 | fp8_e4m3 | fp8_e5m2 | bfloat16 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Accuracy after 5 epochs (%)** | 85 | 92  | 91 | 91  | 92  |
| **Accuracy after 10 epochs (%)** | 82 | 93 | 92  | 92 | 92  |



<small>

```
mx_specs['bfloat'] = 16
# mx_specs['fp'] is ignored
mx_precision = 'CHANGE THIS' # Options: 'int2', 'int8', 'fp8_e4m3', 'fp8_e5m2', 'bfloat16'
```

| Format | fp16 | fp12 | fp10 | fp8 |
| :--- | :--- | :--- | :--- | :--- |
| **Accuracy after 5 epochs (%)** | 91 | 91 | 91 | 91 |
| **Accuracy after 10 epochs (%)** | 92 | 92 | 92 | 91 |

<small>

```
mx_specs['bfloat'] = 0 # disable so we can use fp format
mx_specs['fp'] = CHANGE THIS # Options: 16, 12, 10, 8
mx_precision = 'fp8_e4m3'
```



#### 3 - Consider the possibility of implementing a hardware accelerator for the presented neural network. Assuming  that  it  should  feature  an  accuracy  of  90%,  select  the  data  formats  that  allow  you to minimize  the  size  of  the  accelerator.  What accuracy did you obtain? Note: It might be useful to check which data formats are supported by the microxcaling library

(bruteforce.py)
| MatMul Format   | Elem Format  | Acc @ 5    | Acc @ 10
| :--- | :--- | :--- | :--- |
int2            | bfloat16     | 84.93      | 86.20
int2            | fp16         | 23.11      | 10.00
int2            | fp12         | 81.85      | 82.23
int2            | fp10         | 79.38      | 79.90
int2            | fp8          | 78.82      | 80.06
int8            | bfloat16     | 91.44      | 92.29
int8            | fp16         | 89.41      | 88.78
int8            | fp12         | 91.56      | 92.33
int8            | fp10         | 91.61      | 92.40
int8            | fp8          | 90.49      | 91.06
fp8_e4m3        | bfloat16     | 91.27      | 92.13
fp8_e4m3        | fp16         | 90.84      | 90.44
fp8_e4m3        | fp12         | 91.16      | 91.87
fp8_e4m3        | fp10         | 90.93      | 91.83
fp8_e4m3        | fp8          | 90.61      | 91.12
fp8_e5m2        | bfloat16     | 91.69      | 92.03
fp8_e5m2        | fp16         | 90.48      | 83.44
fp8_e5m2        | fp12         | 90.93      | 91.53
fp8_e5m2        | fp10         | 90.96      | 91.95
fp8_e5m2        | fp8          | 90.76      | 91.29
bfloat16        | bfloat16     | 91.57      | 92.21
bfloat16        | fp16         | 89.86      | 81.04
bfloat16        | fp12         | 91.79      | 92.27
bfloat16        | fp10         | 91.26      | 92.14
bfloat16        | fp8          | 90.40      | 90.86

To achieve the target accuracy of 90% while minimizing the bit-width (size) of the accelerator, we selected the following configuration. For weights and activations we choose fp8_e4m3 (8-bit). We also evaluated int2, which performed worse (max accuracy was 86.20% with bfloat16, and much lower with fp formats). All 8-bit formats (int8, fp8_e4m3, fp8_e5m2) successfully exceeded 90%. We select fp8_e4m3 as it is the standard format for inference in the Microscaling specification (but other >90% accuracy combinations can be used). For elementwise operations we adopted fp8 (8-bit). We tested elementwise formats descending from bfloat16 down to fp8 and the results shown that the custom fp8 format (1 sign, 5 exponent, 2 mantissa) is sufficient. Even with this low precision, the network maintained high accuracy.
With this configuration (fp8_e4m3 + fp8), we obtained an accuracy of 91.12%. 

By selecting fp8 for both matrix multiplications and elementwise operations, the accelerator can run entirely on 8-bit data paths. This halves the memory requirement compared to a 16-bit system (like bfloat16) and significantly reduces the complexity of Arithmetic Logic Units (ALUs), satisfying the goal of minimizing the accelerator size.

porque Ã© que quando corremos com cpu temos NAN e quando corremos com grafica nao temos?? explicaar isso!!!

4- (Optional) Select the data formats in order to obtain the worst possible accuracy for the considered 
network.  How  much  did  you  obtain?  Explain  your  methodology  and  comment  on  any  notable 
behaviour that you find during training.

Question 4 Response (Optional)

1. Selected Data Formats:
To obtain the worst possible accuracy, we identified the following combination from our grid search results:

Matrix Multiplication Format: int2

This limits the weights to only 2 bits (4 unique values), providing the lowest information capacity.

Elementwise Format: fp16 (Custom)

This uses the library's custom floating-point format with a fixed 5-bit exponent.

2. Accuracy Obtained:
We obtained an accuracy of 10.00% after 10 epochs.

3. Methodology:
We performed a comprehensive grid search across all supported matrix multiplication and elementwise format combinations. We looked for the configuration that resulted in the lowest validation accuracy. While int2 generally performed poorly compared to 8-bit formats, this specific combination resulted in a complete model failure.

4. Comments on Notable Behavior:

Random Guessing: Since the Fashion-MNIST dataset has 10 classes, an accuracy of 10.00% represents pure random guessing. The model completely failed to learn any distinguishing features of the images.

"Unlearning" / Divergence: It is notable that at Epoch 5, this configuration had an accuracy of 23.11%, but by Epoch 10, it collapsed to 10.00%. This behavior is characteristic of numerical divergence. The model likely encountered exploding gradients or accumulated numerical errors (NaNs) in the later epochs, destroying whatever small patterns it had initially learned.

Instability of int2: The int2 format proved to be highly unstable. While it managed to achieve ~80% accuracy when paired with robust elementwise formats (like bfloat16), it collapsed when paired with the custom fp16 format. This highlights that extremely low-precision weights (int2) require high-precision accumulators and activations to remain stable; without them, training fails.