In [2]:
import numpy as np
import torch

![](./imgs/scaling_by_real_value.png)

In [33]:
def scaling_by_real(x):
    m = 3
    s = np.max(x) / (2**(m-1) - 1)
    
    res = np.rint(x / s) * s
    return res

inp = np.array([0.7, 1.4, 2.50, 6.001, 7.2])
quant_inp = scaling_by_real(inp)

print(quant_inp)

[0.  2.4 2.4 7.2 7.2]


![](./imgs/scaling_by_pot.png)

In [42]:
def round_to_power_of_two(n):
    return 2**np.ceil(np.log2(n))

def scaling_by_pot(x):
    m = 3
    s = np.max(x) / (2**(m-1) - 1)
    s = round_to_power_of_two(s)
    
    res = np.rint(x / s) * s
    return res

inp = np.array([0.7, 1.4, 2.50, 6.001, 7.2])
quant_inp = scaling_by_pot(inp)

print(quant_inp)

[0. 0. 4. 8. 8.]


![](./imgs/scaling_by_multi_real.png)

In [37]:
inp = np.array([0.7, 1.4, 2.50, 6.001, 7.2])
# partition
inp_1 = inp[[0, 3, 4]]
inp_2 = inp[[1, 2]]

quant_inp_1 = scaling_by_real(inp_1)
quant_inp_2 = scaling_by_real(inp_2)

print(quant_inp_1[[0]], quant_inp_2, quant_inp_1[[1,2]])

[0.] [1.66666667 2.5       ] [7.2 7.2]


![](./imgs/two_level_scaling.png)

In [43]:
inp = np.array([0.7, 1.4, 2.50, 6.001, 7.2])

# first-level (real) scaling
m = 3 # number of bit 
s = np.max(inp) / (2**(m-1) - 1)
inp /= s
# print(inp)

# partitioning
inp_1 = inp[[0, 3, 4]]
inp_2 = inp[[1, 2]]
# print(inp_1, inp_2)

# second-level (PoT) scaling and quantization
quant_inp_1 = scaling_by_pot(inp_1)
quant_inp_2 = scaling_by_pot(inp_2)

# Rescale by the first-level scaling factor
quant_inp_1 *= s
quant_inp_2 *= s

print(quant_inp_1[[0]], quant_inp_2, quant_inp_1[[1,2]])

[0.] [1.2 2.4] [7.2 7.2]


![](./imgs/unification.png)

The aforementioned two-level scaling framework is quite general and can explain many existing quantization methods. 

$s$ is the scale and $ss$ is the sub-scale. 

If scale (or subscale) is a power-of-two number, we can use bit shift to implement it and we call it "Hardware Type", otherwise, we call it "Software Type". 

$k_1$ (or $k_2$) is the block size (granularity of the scaling) of the scale (or subscale). A smaller block size means a finer granularity, which is more flexible but also more expensive.



## MX Format

We next introduce the MX format under the framework of two-level scaling. 

![](./imgs/three_examples.png)

The MX format has the following properties:
1. Both the first and second level scaling are power-of-two, which is convenient for hardware implementation.
2. Both the first and second level granularity are fine-grained, which is flexible for different applications. In addition, the second level's scaling factor has only 1 bit to reduce the overhead of storing the scaling factor.

We can compute the average bit-width of the MX format, 

![](./imgs/mx_average_bit.png)

## Hardware Implementation (MX Dot Product)

![](./imgs/hw_pipeline.png)

Suppose the length of input vectors is $r$. Each vector has $\frac{r}{k_1}$ blocks, and thus $\frac{r}{k_1}$ different 1st level scaling factors $s$.

Dot product of two vectors consists of two operations: point-wise multiplication (also known as Hadamard product) and reduction (summing up the results of point-wise multiplication). 

The point-wise multiplication and reduction can be done by multiplying each block of the first vector with the corresponding block of the second vector and summing up results. Then, we can sum up the results of all blocks to get the final result of dot product. 

![](./imgs/dot_product.png)