# COE3DQ5 Project Description 2023 Hardware Implementation of an Image Decompressor

#### 1 Introduction

#### 1.1 Objective

In this project, students will gain experience in digital system design by implementing the custom McMaster Image Compression revision 17 (.mic17) image compression specification in hardware. Compressed data for a 320 x 240 pixel image will be delivered to the Altera DE2 board via the universal asynchronous receiver/transmitter (UART) interface from a personal computer (PC) and stored in the external static random access memory (SRAM). The image decoding circuitry will read the compressed data, recover the image using a custom digital circuit designed by you and store it back to the SRAM, from where the video graphics array (VGA) controller will read it and display it to the monitor.

#### 1.2 Preparation

- Revise the first five labs, especially finite state machines, VGA, SRAM and UART interfaces
- Read this document and get familiarized with the C source code

#### 1.3 Organization of this Document

Section 2 discusses how an image is compressed with the .mic17 specification. In Section 3, image decompression is demonstrated by example with a 16x16 image. Section 4 presents the project guidelines (the project is divided into 3 milestones). A summary is provided in Appendix A.

# 2 Image Compression

Figure 1 shows the conceptual flow for basic image compression and decompression. The size of the arrow roughly indicates the relative amount of data at each stage. Generally, we start with an image in the standard red/green/blue (RGB) format for image compression. This is converted to YUV format, where Y represents brightness, and U and V are chrominance components (which contain colour information). The YUV image is downsampled and then transformed into the frequency domain. This result is quantized, and finally, we apply lossless encoding to obtain the compressed information bitstream. Decoding proceeds in the reverse order: lossless decoding and dequantization (restores the frequency domain image), then the inverse signal transform (which brings the image back to the spatial domain), and finally, upsampling and colourspace conversion (produces an RGB image).

Compression enables smaller file sizes, which reduces storage and transmission costs. We convert the RGB image to YUV because the human eye is more sensitive to brightness than colour, so if we want to obtain more lossy compression (by reducing the quality), it will be more effective to compress the colour information. In the RGB format, there is no immediate way to compress colour more than brightness. YUV information is compressed by downsampling U and V in the .mic17 specification. Additionally, the human eye is less sensitive to high-frequency components in an image (compared to low frequencies). Transforming into the frequency domain enables us to compress higher-frequency components more than lower-frequency ones. We compress frequency information by applying quantization. Finally, we use lossless compression as a final effort to reduce the file size.



Figure 1. Flow of basic image compression (down the left) and decompression (up the right).

Downsampling and quantization are not identically reversible, so some information is lost during compression. The decompressed image only approximates the original image. A good compression scheme leverages this loss of information to reduce the compressed file size but will attempt to allow information loss only in a way that does not drastically affect the quality of the decompressed image.

The focus of this project is the hardware implementation of a compression scheme. The .mic17 specification is not meant to rival existing compression standards for images. Simplifying choices have been made in developing the specification to facilitate a straightforward hardware implementation. This limits the capabilities in terms of the achievable compression/quality tradeoff. Even so, the concepts described above still apply in a general sense to the existing compression schemes. The following subsections will discuss each stage of image compression in more detail.

# 2.1 Colourspace Conversion and Downsampling

Figure 2 shows how a 320x240 image can be decomposed into its red, green, and blue components. Each pixel (composed of 1 red, 1 green, and 1 blue value) is independently transformed into the YUV colourspace. Each resulting pixel contains 1 Y (brightness), 1 U (blue chroma), and 1 V (red chroma). The chroma components have only colour information (no brightness). In Figure 2, the bottom of the motorcycle is red, and the corresponding bright area in the V image indicates there are large values at these pixels, representing a strong red colouring. Note this says nothing about whether it is bright red or dark red, as the brightness comes only from Y.



Figure 2. Colourspace conversion (RGB to YUV) and horizontal downsampling of U and V.

Each RGB pixel is converted to a YUV pixel by performing the following matrix multiplication:

$$\begin{bmatrix} Y \\ U \\ V \end{bmatrix} = \begin{bmatrix} 0.257 & 0.504 & 0.098 \\ -0.148 & -0.291 & 0.439 \\ 0.439 & -0.368 & -0.071 \end{bmatrix} \begin{bmatrix} R \\ G \\ B \end{bmatrix} + \begin{bmatrix} 16 \\ 128 \\ 128 \end{bmatrix}$$
(1)

The terminology YUV originates from the video transmission field, and different notations (depending on the video/colour/transmission standard) can be found in the literature (such as YIQ, YPbPr, YCbCr).

The human eye is much more sensitive to brightness than colour (due to the presence of more rods than cones in the eye). For this reason, we choose to downsample U and V (the chroma components) to reduce the amount of data while having little effect on how the image is perceived. We horizontally downsample by 2, so the U and V images become half the original width (no vertical downsampling, so the height is unchanged). Therefore, only 2/3 of the total data remains.

#### 2.2 Signal Transform (into the Frequency Domain) and Quantization

The subsequent compression stage involves transforming the image from the spatial domain to the frequency domain. High spatial frequencies (brightness/colour transitions that occur in a small amount of space) are less noticeable to the human eye, so they may be reduced in accuracy (or even removed) without drastically affecting the overall quality of the image.

In the .mic17 specification, we use a 2-dimensional discrete cosine transform (DCT), and we work on blocks of 8x8 pixels (each 8x8 block is processed independently; we do not consider images with heights and widths not divisible by 8). The 2D DCT is computed with the following matrix multiplications:

$$S' = CSC^{T} \text{ where each element } [C]_{i,j} = \begin{cases} \sqrt{\frac{1}{8}} \cos\left(\frac{i\pi}{8}(j+0.5)\right) & \text{if } i=0\\ \sqrt{\frac{2}{8}} \cos\left(\frac{i\pi}{8}(j+0.5)\right) & \text{otherwise} \end{cases}$$
(2)

S is the original 8x8 block (spatial domain) and S' is the transformed 8x8 block (frequency domain). In Equation 2 above, i and j are the row and column indexes, respectively, each goes from 0 to 7 inclusive.

To obtain compression, we quantize S'. Each element in S' is divided by a (possibly) different quantization coefficient and then rounded to the nearest integer. A larger coefficient means that we obtain a smaller magnitude after division, which requires fewer bits to be stored. The top left corner corresponds to frequency 0. The frequencies increase as we move down and to the right in the matrix. Thus, we should choose smaller quantization coefficients for the upper left part of the matrix and large coefficients for the lower right part to exploit the human eye's perception.

For typical images, many values of S' will become 0 after quantization, and these zeros can be stored efficiently, as shown in Section 2.3. Even if nonzero, the division results in an integer that requires fewer bits to be represented. Two possible quantization matrices can be used in .mic17:

$$\boldsymbol{Q_0} = \begin{bmatrix} 8 & 4 & 8 & 8 & 16 & 16 & 32 & 32 \\ 4 & 8 & 8 & 16 & 16 & 32 & 32 & 64 \\ 8 & 8 & 16 & 16 & 32 & 32 & 64 & 64 \\ 8 & 16 & 16 & 32 & 32 & 64 & 64 & 64 \\ 16 & 16 & 32 & 32 & 64 & 64 & 64 & 64 \\ 16 & 32 & 32 & 64 & 64 & 64 & 64 & 64 \\ 32 & 32 & 64 & 64 & 64 & 64 & 64 & 64 \\ 32 & 64 & 64 & 64 & 64 & 64 & 64 \end{bmatrix} \quad \boldsymbol{Q_1} = \begin{bmatrix} 8 & 2 & 2 & 2 & 4 & 4 & 8 & 8 \\ 2 & 2 & 2 & 4 & 4 & 8 & 8 & 16 & 16 \\ 2 & 4 & 4 & 8 & 8 & 16 & 16 & 16 & 32 \\ 4 & 8 & 8 & 16 & 16 & 16 & 32 & 32 \\ 8 & 8 & 16 & 16 & 16 & 32 & 32 & 32 \\ 8 & 8 & 16 & 16 & 16 & 32 & 32 & 32 \end{bmatrix}$$
(3)

S' is divided element-wise by one of the quantization matrices (suppose we pick  $Q_0$ ) and then it is rounded to the nearest integer:

$$Z = round(S' ./ Q_0) \tag{4}$$

If we look at each element of **Z**:

$$[Z]_{i,j} = round([S']_{i,j} / [Q_0]_{i,j})$$
 (5)

**Z** will later be losslessly coded.

After dividing each colour plane (Y, U, and V) into blocks of 8x8 pixels, we perform DCT and quantization to each block. Figure 3 shows this on 2 blocks of Y data. The image is 320x240, so there will be (320/8)\*(240/8) = 1200 Y blocks and 600 blocks each of downsampled U and downsampled V (since these planes are half as wide). All the Y blocks are processed, then all the U, and finally all the V blocks. Within each colour plane, we go across a row of blocks first, then down to the next row. Note that 1 row of blocks is 8 pixels high.

The block in Figure 3(a) is nearly uniform, so after DCT, in Figure 3(c2), there is a large value for frequency 0 and small values for the high frequencies. After quantization, we end up with primarily zeros, which can be stored efficiently. Many images contain nearly uniform blocks, so the DCT followed by quantization provides a way to represent the image with much less data. Although this leads to reduced quality of the decompressed image, the significant features of the original block are still preserved.

The block in Figure 3(b) contains more variation due to the details of the motorcycle's wheel. The DCT and quantization results are shown in Figures 3(d2) and (d3), respectively. In the presence of more details (captured by the higher frequencies), more data is needed to represent the block (compared to the block from Figure 3(a)). Even so, the amount of data (in Figure 3(d3)) is reduced from that required to represent the intensity of each pixel explicitly (the block in Figure 3(b)).



Figure 3. DCT (c1 $\rightarrow$ c2 and d1 $\rightarrow$ d2) and quantization with Q<sub>0</sub> (c2 $\rightarrow$ c3 and d2 $\rightarrow$ d3) of 2 blocks of Y data.

# 2.3 Lossless Coding

In the final decoding stage, we need a mechanism for efficiently storing the data after DCT and quantization. We start with a simple 8-byte header, which indicates the image's width and height, and which quantization matrix was used. After this, the losslessly coded blocks are packed into the bitstream in the same order described at the top of this page. To losslessly code one block, the elements of  $\mathbf{Z}$  (see Equation 4) are read out of the block in the order shown in Figure 4. This is known as the scan pattern.



Figure 4. Scan pattern (order of values to be read from an 8x8 block).

Copyright © Jason Thong, Adam Kinsman, and Nicola Nicolici Digital Systems Design Course at McMaster University, Canada Many images naturally have small values in the high frequencies, so after computing the DCT and then quantizing, we typically have many zeros in the lower right part of the matrix. The above scan pattern, which is standard for many mainstream video/image compression schemes, deals with the low frequencies first (upper left) and the high frequencies last (lower right). This is beneficial because, in this order, we can get many consecutive zeros (from the high-frequency part), which can be efficiently coded. For each 8x8 block, as we follow the order of the scan pattern, values are encoded as follows, and a variable length prefix-coded header indicates what data will follow (a higher priority is given to the cases higher in the list):

- 1. Only zeros remain in the block, output the ZEROS\_TO\_END header (bits 111).
- 2. A run of zeros exists, but some nonzero coefficients still remain in the block. Output the ZERO\_RUN header (bits 110), and then:
  - a. If the run of zeros is 8 or more long, output the 3 bits 000; now the run of zeros will be 8 elements shorter, so repeat step 2 if necessary.
  - b. If the run of zeros is less than 8 in length (i.e., 1-7 inclusive), output a 3-bit unsigned number indicating the length of the run.
- 3. A run of ones exists. Output the POSITIVE\_ONE\_RUN header (bits 101), and then:
  - a. If the run of ones is 4 or more in length, output the 2 bits 00; now the run will be 4 elements shorter, so repeat step 3 if necessary.
  - b. If the run of ones is less than 4, output a 2-bit unsigned number to indicate the length.
- 4. Apply the same rules in step 3 to runs of -1 but using the NEGATIVE ONE RUN header (bits 100).
- 5. If there is a value between -8 and 7 inclusive, output the 4\_BIT header (bits 01) followed by the 4-bit signed value. Notice that -1, 0 and +1 are always encoded as runs (not encoded here).
- 6. If there is a value between -256 and 255 inclusive, output the 9\_BIT header (bits 00) followed by the 9-bit signed value.

Figure 5 shows the coding of the final (quantized) blocks from Figure 3, in both cases, the blocks are scanned according to Figure 4. For the first block, the first value (127) needs to be coded as 9-bit signed, so we output 000 (the header) followed by 001111111 (the 9-bit signed representation of 127). The next value is -1, which we encode as a run of length 1, so we output 100 (header) followed by 01 (length 1 run). The next value is -2, which is representable on 4 bits signed. Following a similar approach, we output 01 (header) and then 1110 (-2 on 4-bit signed). Now only zeros remain in this block, so we add 111 (ZEROS\_TO\_END header) to the bitstream to finish this block.



Figure 5. Lossless coding of each block from Figure 3. Header bits are coloured red for illustrative purposes only.

Coding of the second block from Figure 3 proceeds in the same way, however, more data is required due to the presence of more nonzero values and the typically larger magnitude of the values. As before, encoding proceeds until the last nonzero value in the scanned sequence (-1) is reached, after which a ZEROS\_TO\_END header is placed in the bitstream. When the last block of the downsampled V plane has been coded, zero padding ensures the file ends at the end of a 16-bit word (for compliance with the 16-bit wide memory we use).

# 3 Image Decompression

In this section, the full decompression process for an image of size 16 x 16 is performed for the purpose of illustrating the flow of data from one stage to another and elaborating the internal details of each step in the decompression. Since the aim of this project is the hardware implementation of the decompression specification, *during the decompression, we only use fixed point arithmetic*, which has a simpler hardware design and generally uses fewer logic resources compared to floating point.

# 3.1 Lossless Decoding and Dequantization (Milestone 3 in Hardware)

Our reference image "example.ppm" is shown in Figure 6. The encoded bitstream is given in Figure 7. The header is 8 bytes. The first 4 bytes must be "DEADBEEF" in hex, which identifies a .mic17 file. After this, in the following 2 bytes, the most significant bit indicates which quantization matrix to use (in this example, bit 0 means use  $Q_0$ ) and the 15 least significant bits indicate the width (bits 000 0000 0001 0000 indicate a width of 16). The next 2 bytes indicate the height, which is also 16 in our example. After the header, the main content of the image, the sequence of compressed blocks, begins.



Figure 6. The reference image example.ppm to be compressed.

```
DE AD BE EF 00 10 00 10 0E 80 29 7A 9E D3 EC AD 5E 7B 1A B7 5C 9E 76 D1 AE 4E 57 58 E3 5C E3 8B D8 66 F6 EF 67 10 F6 7D DE 50 07 B0 EA CA C7 26 6A 9C 5D 7A 38 C7 18 E2 AE 55 CE DB 3D 8C 74 8E 89 6E F6 51 48 39 80 87 93 02 8E B0 ED 5E 7A 27 DD D4 AA 96 31 AE 55 C5 4A E5 6C 6B BB D8 D7 0C 52 DE 38 C7 1A 88 35 92 60 F8 55 A7 5A CA A7 DE 57 4E AE E3 1A B9 5D 63 78 E3 5C 5E C3 37 B7 7B 37 84 10 16 03 20 38 F9 80 B5 5E C6 36 71 55 1D 67 87 93 CE 39 D7 18 E3 57 B1 8E 18 63 1E 23 C0 F9 F6 01 CB 81 13 DA BB 4F 25 56 5D AA 69 29 58 A7 2C 63 8C 6B 8D 63 94 71 AE 55 C6 BC 3F 00 AA 9F 6C 4F A1 F6 C5 D7 9E C5 48 DD 44 A3 9D 72 AC B1 95 78 78 3E CA B4 F3 2B BA 96 31 C5 3C A4 F6 35 8E D1 C6 35 E0
```

Figure 7. The compressed bitstream example.mic17; the order of the bytes is across first and then down.

Figure 8 depicts the block decoding process. Variable length prefix coding is used for the header, which means depending on the first few bits of the header, we may need to read additional bits to determine the remainder of the header. We start by reading 2 bits: if we get 00, then there is a 9-bit value that follows; if we get 01, then there is a 4-bit value that follows; otherwise, we need to read another bit. Merging this with the 2-bit header bits already read gives us effectively a 3-bit header.

Cases 00X and 01X were covered (X is don't care). If this 3-bit header is 100, we have a run of -1 values, and 2 bits follow to indicate the length (which is 1 to 4 inclusive); if the header is 101, then we have a run of +1 values (2 bits follow for length); if the header is 110 then we have a run of zeros (3 bits follow for length, range of 1 to 8 inclusive); and if the header is 111, we have a run of zeros to the end of block.

Following this procedure, values are decoded one after another and placed in the 8x8 block according to the scan pattern of Figure 4. A block is complete either when a ZEROS\_TO\_END header is decoded (in which case the rest of the block is filled with 0s) or when 64 values (including from runs) have been decoded. The values are signed (2's complement format). Note for 4-bit or 9-bit coefficients, the header code LSB becomes the MSB of the coefficient, so the LSBs of the coefficient are read after the header.



Figure 8. Lossless decoding of a block based on the prefix code.

After the mic17 header, the bitstream begins with "00001110100000000101001..." ("0E 80 29..." in hex). Figures 9 and 10 on the next two pages show the results of the decoding example.mic17. The Y blocks appear first in the bitstream, row by row. Y 0,0 contains Y data for pixel rows 0-7, and columns 0-7 of the Y plane from the downsampled image (which is equivalent to the Y plane from the converted image since Y is not downsampled). Similarly, Y 0,1 contains rows 0-7, columns 8-15; Y 1,0 contains rows 8-15, columns 0-7; Y 1,1 contains rows 8-15, columns 8-15.

| Y 0 0                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                        |                                                                                                                                                                                                                     |     |                                           |                                     |                                     |                                   |                                    |                                  |                                         |                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|-------------------------------------------|-------------------------------------|-------------------------------------|-----------------------------------|------------------------------------|----------------------------------|-----------------------------------------|---------------------------------------|
| 00001110100 : 116<br>00000001010 : 10<br>010111 : 7<br>10101 : 1<br>00111101101 : -19<br>00111101101 : -10<br>010101 : 5<br>10101 : 1<br>011110 : -2<br>011110 : -2<br>110001 : 0<br>10101 : 1<br>01101 : 1<br>01101 : 1<br>01101 : -5<br>10101 : 1<br>011011 : -5<br>10101 : 1<br>011011 : -5<br>10101 : 0<br>011110 : -2                                                                                            | 011101 : -3 10110 : 1 10001 : 1 110101 : 1 11010 : 0 011100 : -4 10101 : 1 110101 : 0 0 0 10001 : -1 110001 : 0 10101 : 1 110011 : 0 0 10101 : 1 110011 : 0 0 0 10101 : -1 110001 : 0 0 0 10101 : -2                   | 110000 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                                                                            | (a) | 116<br>7<br>1<br>-2<br>0<br>-1<br>1       | 10<br>-19<br>-2<br>1<br>1<br>0<br>0 | -10<br>1<br>-5<br>1<br>0<br>1<br>-1 | 5<br>1<br>-3<br>-4<br>0<br>0<br>0 | 0<br>-2<br>1<br>-1<br>-2<br>0<br>0 | 0<br>0<br>0<br>0<br>0<br>-2<br>0 | 0<br>0<br>0<br>0<br>0<br>0<br>-2        | 0<br>0<br>0<br>0<br>0<br>0<br>0<br>0  |
| Y 0 1         00001111011       : 123         00111110111       : -9         011110       : -2         0101000       : 4         00000001111       : -15         011000       : -8         011101       : -3         010101       : 5         10001       : -1         110010       : 0         011001       : -7         10101       : 1         010011       : 3         10001       : -1         011101       : -3 | 011110 : -2<br>10001 : -1<br>110001 : 0<br>10001 : -1<br>110001 : 0<br>10001 : -1<br>110001 : 0<br>010101 : 5<br>110010 : 0<br>010101 : 1<br>110011 : 1<br>110011 : 1 1<br>110110 : 0 0 0<br>011110 : -2<br>110001 : 0 | 10001 : -1<br>110100 : 0 0 0<br>0<br>10001 : -1<br>110100 : 0 0 0<br>0<br>010010 : 2<br>110111 : 0 0 0<br>0 0 0<br>011110 : -2<br>110010 : 0 0<br>10001 : -1<br>010010 : 2                                          | (b) | 123<br>-2<br>4<br>-1<br>0<br>-1<br>0      | -9<br>15<br>5<br>0<br>0<br>-1<br>0  | -8<br>6<br>-7<br>-1<br>0<br>0       | -3<br>1<br>-2<br>5<br>1<br>0<br>0 | 3<br>-3<br>0<br>1<br>-2<br>-1<br>0 | -1<br>0<br>0<br>0<br>0<br>2<br>0 | 1<br>0<br>-1<br>0<br>0<br>0<br>-2<br>-1 | 0<br>0<br>0<br>0<br>0<br>0<br>0<br>0  |
| V 1 0  00001110011 : 115 011000 : 8 011110 : -2 010011 : 3 00000010100 : 20 011101 : -3 011000 : -8 011101 : -3 10101 : 1 011110 : -2 011110 : -2 011110 : -2 011110 : -2 10001 : -1 00111110111 : -9 011101 : -3 010010 : 2 10101 : 1                                                                                                                                                                                | 010010 : 2<br>110001 : 0<br>10001 : -1<br>10101 : 1<br>110010 : 0<br>010101 : 1<br>110001 : 0<br>010100 : 4<br>10101 : 1<br>110010 : 0<br>10110 : 1<br>110001 : 0<br>10110 : 1<br>110001 : 0<br>0 0 0                  | 011110 : -2<br>110001 : 0<br>10101 : 1<br>110000 : 0 0 0<br>0 0 0<br>110001 : 0<br>0 0 0<br>110001 : 2<br>110111 : 0 0 0 0<br>0 0 0<br>0 0 0<br>110001 : -1<br>110001 : 0<br>10001 : -1<br>110001 : 0<br>101001 : 0 | (c) | 115<br>-2<br>3<br>-2<br>-2<br>0<br>0      | 8<br>20<br>1<br>-1<br>1<br>0<br>0   | -3<br>-3<br>-9<br>-1<br>0<br>0      | -8<br>-3<br>0<br>4<br>0<br>0      | 2<br>2<br>1<br>1<br>-2<br>0<br>0   | 1<br>0<br>0<br>0<br>0<br>2<br>0  | 0<br>1<br>1<br>0<br>0<br>0<br>-1        | 1<br>0<br>0<br>0<br>0<br>0<br>-1<br>1 |
| V 1 1  00010000011 : 131 010110 : 6 010010 : 2 011000 : -8 00111110000 : -16 10101 : 1 011010 : -6 011101 : -3 011010 : -6 110010 : 0 10101 : 1 00111110111 : -9 10010 : -1 -1 10111 : 1 1 1 010011 : 3 10101 : 3                                                                                                                                                                                                     | 011101 : -3 110001 : 0 10001 : -1 10101 : 1 011100 : -4 10101 : 1 110101 : 0 0 0 0 0 10001 : -1 10111 : 1 1 1 10001 : -1 110001 : 0 10101 : 1 110001 : 0 011110 : 0                                                    | 110000 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                                                                            | (d) | 131<br>2<br>-8<br>0<br>0<br>-3<br>0<br>-1 | 6<br>-16<br>-6<br>1<br>1<br>-1      | 1<br>-3<br>-9<br>3<br>1<br>1<br>0   | -6<br>-1<br>1<br>-4<br>1<br>0     | -1<br>1<br>1<br>-1<br>-2<br>0<br>0 | 1<br>0<br>0<br>0<br>0<br>-2<br>0 | 0<br>0<br>0<br>0<br>0<br>0<br>0<br>0    | 0<br>0<br>0<br>0<br>0<br>0<br>0<br>0  |

Figure 9. Lossless decoding of the Y blocks in example.mic17. Header bits are coloured in red for illustrative purposes only, the bitstream has no marker for header bits versus data bits.

Next in the bitstream comes the data for the U and V planes of the downsampled image, recall that they have half the columns of the Y plane due to downsampling. In the same way as for Y, U 0,0 contains U data for pixel rows 0-7, columns 0-7 of the downsampled image; U 1,0 contains rows 8-15, columns 0-7; V 0,0 contains V data for rows 0-7, columns 0-7; V 1,0 contains rows 8-15, columns 0-7.

| U 0 0                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                            |                                                                                                                                                                      |     |                                             |                                     |                                      |                                     |                                  |                                   |                                   |                                  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|---------------------------------------------|-------------------------------------|--------------------------------------|-------------------------------------|----------------------------------|-----------------------------------|-----------------------------------|----------------------------------|
| 00010000010 : 130<br>00000010110 : 22<br>00000011001 : 25<br>00000001110 : 14<br>00111110011 : -13<br>00000001011 : 11<br>010101 : 5<br>011110 : -2<br>110001 : 0<br>10001 : -1<br>10110 : 1<br>011100 : -4<br>010101 : 5<br>010100 : -4                                                                                                                                                        | 011101 : -3<br>011010 : -6<br>011010 : -6<br>011110 : -2<br>010011 : 3<br>110011 : 0 0 0<br>10001 : -1<br>110011 : 0 0 0<br>10101 : 1<br>110001 : 0<br>10001 : -1<br>110001 : 0<br>10101 : 1<br>010101 : 1 | 110001 : 0<br>10001 : -1<br>110000 : 0 0 0<br>0 0 0<br>110000 : 0 0 0<br>0 0 0<br>110001 : 0<br>10001 : -1<br>111 : zeros<br>to<br>end                               | (a) | 130<br>25<br>14<br>-1<br>1<br>0<br>0        | 22<br>-13<br>0<br>1<br>3<br>0<br>-1 | 11<br>-2<br>-4<br>-2<br>-1<br>0<br>0 | 5<br>5<br>-6<br>0<br>-2<br>0<br>0   | 4<br>-6<br>0<br>1<br>0<br>0<br>0 | -3<br>0<br>0<br>0<br>0<br>0       | 1<br>-1<br>0<br>0<br>-1<br>0<br>0 | 0<br>0<br>0<br>0<br>0<br>0       |
| U 1 0                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                            |                                                                                                                                                                      |     |                                             |                                     |                                      |                                     |                                  |                                   |                                   |                                  |
| 000100011111       : 143         00000011111       : 31         00111110110       : -10         00000001110       : 14         010111       : 7         00000010001       : 17         0011110110       : -19         010111       : 7         011010       : -6         011110       : -2         010010       : 2         10101       : 1         011001       : -7         011101       : -3 | 10101 : 1<br>010011 : 3<br>010010 : 2<br>010100 : 4<br>10101 : 1<br>10001 : -1<br>010011 : 3<br>10010 : -1 -1<br>110001 : 0<br>10001 : -1<br>110001 : 0<br>10001 : -1<br>110001 : 1<br>110001 : 0          | 10101 : 1<br>10001 : -1<br>110010 : 0 0<br>10001 : -1<br>110001 : 0<br>10101 : 1<br>110010 : 0 0<br>10101 : 1<br>110001 : 0<br>10101 : 1<br>111 : zeros<br>to<br>end | (b) | 143<br>-10<br>14<br>-2<br>2<br>3<br>-1<br>1 | 31<br>7<br>-6<br>1<br>-1<br>-1<br>0 | 17<br>7<br>-7<br>1<br>0<br>-1<br>0   | -19<br>-3<br>4<br>-1<br>0<br>1<br>0 | 1<br>2<br>0<br>0<br>0<br>0<br>0  | 3<br>-1<br>-1<br>1<br>0<br>0<br>0 | 1<br>1<br>0<br>0<br>0<br>0<br>0   | 0<br>0<br>0<br>0<br>0<br>0       |
| v 0 0                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                            |                                                                                                                                                                      |     |                                             |                                     |                                      |                                     |                                  |                                   |                                   |                                  |
| 00001111110 : 126<br>00000001010 : 10<br>10101 : 1<br>00111110110 : -10<br>110001 : 0<br>00111110110 : -12<br>00111110110 : -12<br>00111110110 : -10<br>110001 : 0<br>011110 : -2<br>011110 : -2<br>011110 : -2<br>110001 : 0<br>010100 : 4<br>10001 : -1                                                                                                                                       | 10111 : 1 1 1<br>010100 : 4<br>010010 : 2<br>10001 : -1<br>110011 : 0 0 0<br>10101 : 1<br>110010 : 0 0<br>10101 : 1<br>10010 : -1 -1<br>110001 : 0<br>10010 : -1 -1<br>110010 : -1 -1                      | 111 : zeros<br>to<br>end                                                                                                                                             | (c) | 126<br>1<br>-10<br>-2<br>-2<br>0<br>0       | 10<br>0<br>-3<br>0<br>-1<br>0<br>0  | -12<br>0<br>4<br>2<br>1<br>0<br>0    | -10<br>-1<br>4<br>0<br>1<br>0<br>0  | 1<br>1<br>0<br>-1<br>0<br>0<br>0 | 1<br>1<br>-1<br>0<br>0<br>0<br>0  | -1<br>0<br>0<br>0<br>0<br>0<br>0  | -1<br>0<br>0<br>0<br>0<br>0<br>0 |
| v 1 0                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                            |                                                                                                                                                                      |     |                                             |                                     |                                      |                                     |                                  |                                   |                                   |                                  |
| 000011111000       : 120         00111110110       : -10         011010       : 5         011010       : -6         011110       : -7         011001       : -7         010111       : 7         011010       : -3         010010       : 2         110001       : 0         10001       : -1         110001       : 0         010011       : 3         110010       : 0                        | 10010 : -1 -1 011110 : -2 110001 : 0 10101 : 1 10001 : -1 110110 : 0 0 0 0 0 0 10001 : -1 110001 : 0 10001 : -1 110101 : 1 011 : zeros to end                                                              |                                                                                                                                                                      | (d) | 120<br>5<br>-6<br>0<br>-1<br>-1<br>0<br>0   | -10<br>-2<br>2<br>0<br>1<br>0<br>0  | -7<br>-3<br>3<br>0<br>0<br>0         | 7<br>0<br>-2<br>0<br>0<br>0<br>0    | 0<br>-1<br>0<br>0<br>0<br>0<br>0 | -1<br>0<br>1<br>0<br>0<br>0       | -1<br>-1<br>0<br>0<br>0<br>0<br>0 | 0<br>0<br>0<br>0<br>0            |

Figure 10. Lossless decoding of the downsampled U and V blocks in example.mic17. Header bits are coloured in red for illustrative purposes only, the bitstream has no marker for header bits versus data bits.

Upon completion of decoding, we have 8 blocks of data, 4 for Y (16/8 x 16/8), 2 for U and 2 for V. The matrices in the rightmost column (containing 8x8 matrices) of Figures 9 and 10 collectively make up the quantized image. From the flow illustrated in Figure 1, the next step is dequantization, which leads to the transformed image. This is done using pointwise multiplication between the losslessly decoded data (matrices in the rightmost column above) and the appropriate quantization matrix ( $Q_0$  in this case). For matrices A, B and C, pointwise (or element-wise) multiplication is defined in Equation 6 below.

$$A = B \cdot * C \qquad \langle \longrightarrow [A]_{i,j} = [B]_{i,j} * [C]_{i,j}$$
 (6)

The quantized images are shown on the leftmost column on page 12 and the result of dequantization is shown in the central column.

# 3.2 IDCT (Inverse Discrete Cosine Transform) (Milestone 2 in Hardware)

After dequantizing the blocks, the IDCT must be performed to recover the downsampled image. The IDCT is defined in Equation 7:

$$S = (int) \frac{1}{65536} \left( \mathbf{C}^T \times (int) \frac{1}{256} (\mathbf{S}' \mathbf{C}) \right)$$
 (7)

where the matrix S' is the dequantized 8x8 block, and the output of the transform S is the block of 8x8 samples from the downsampled image. The element on the  $i^{th}$  row and  $j^{th}$  column of the 8x8 matrix C is:

$$[C]_{i,j} = \begin{cases} (int) \left( \sqrt{\frac{1}{8}} \cos\left(\frac{i\pi}{8}(j+0.5)\right) \times 4096 \right) & \text{if } i = 0\\ (int) \left( \sqrt{\frac{2}{8}} \cos\left(\frac{i\pi}{8}(j+0.5)\right) \times 4096 \right) & \text{otherwise} \end{cases}$$
(8)

$$c = \begin{bmatrix} 1448 & 1448 & 1448 & 1448 & 1448 & 1448 & 1448 & 1448 \\ 2008 & 1702 & 1137 & 399 & -399 & -1137 & -1702 & -2008 \\ 1892 & 783 & -783 & -1892 & -1892 & -783 & 783 & 1892 \\ 1702 & -399 & -2008 & -1137 & 1137 & 2008 & 399 & -1702 \\ 1448 & -1448 & -1448 & 1448 & 1448 & -1448 & 1448 & 1448 \\ 1137 & -2008 & 399 & 1702 & -1702 & -399 & 2008 & -1137 \\ 783 & -1892 & 1892 & -783 & -783 & 1892 & -1892 & 783 \\ 399 & -1137 & 1702 & -2008 & 2008 & -1702 & 1137 & -399 \end{bmatrix}$$

Evaluating all cases in Equation 8 produces Equation 9. Equation 7 is defined similarly to the DCT from Equation 2, however the important differences are the intermediate and final scalar divisions, and the use of fixed-point coefficients. To re-iterate, the reason for choosing fixed-point (i.e., integer) operations is that they facilitate a more efficient hardware implementation when compared to floating-point.

Fixed-point, however, brings in necessary approximations that we achieve by using an intermediate bitwidth larger than the one necessary to store the IDCT outputs. For example, the cosine coefficients (which are between -1 and 1) are left-shifted by 12 positions (i.e., multiplied by 4096 as shown in Equation 7) to obtain an integer-only fixed-point value before processing (-4096 to 4096). Because the sample matrix is multiplied on both sides by the coefficient matrix the data will be left-shifted by 24 positions in total. The first adjustment of a right shift by 8 positions is provided after the post-multiplication S'C and the second adjustment is given by a right shift by 16 positions after the pre-multiplication  $C^T(S'C)$ . The reason why the first adjustment is only on 8 bits is because it retains more precision and thus it is more immune to the truncation errors.

Finally, a clipping function is used to ensure that rounding and overflows that have led to values out of the 8-bit unsigned range (0 to 255) are addressed by saturation (i.e., values smaller than 0 are forced to 0 and values greater than 255 are forced to 255). The results of dequantization and IDCT are as follows:



Figure 11. Dequantization (left to center column) followed by IDCT (center to right column) on each 8x8 block.

#### 0 149 119 110 131 97 89 64 79 127 143 130 8 116 139 128 140 120 122 139 152 17 121 167 107 176 151 157 163 139 33 96 116 148 103 108 139 133 97 12 134 138 85 96 82 7 145 163 152 147 139 153 142 153 147 159 109 18 144 153 23 111 150 135 122 136 195 131 36 146 120 158 10 195 144 153 126 112 124 107 0 101 8 99 146 143 132 153 127 128 131 126 102 124 139 15 111 146 144 139 111 14 120 141 168 114 15 136 169 95 13 122 137 154 127 80 14 63 96 128 107 144 143 143 101 122 143 125 135 179 37 169 186 182 135 131 175 154 155 104 8 134 150 202 163 29 152 135 135 144 141 157 173 178 98 152 164 190 193 115 132 127 145 150 150 110 168 168 175 182 187 193 159 149 124 154 140 94 122 167 104 57 80 97 91 123 174 149 121 111 114 116 114 121 156 154 113 72 54 41 61 139 169 164 162 165 133 84 171 215 203 180 170 145 108 104 106 110 113 114 117 120 112 120 124 118 106 98 130 87 86 86 89 138 181 125 129 129 139 151 142 122 116 118 117 114 113 112 110 187 110 79 105 86 46 104 107 104 97 99 108 151 142 126 126 130 130 133 143 160 174 177 183 186 163 173 139 102 98 100 109 113 116 126 174 171 159 161 132 155 167 62 77 79 29 53 171 174 113 110 100 106 116 114 110 192 165 159 191 168 172 172 136 113 116 137 141 182 174 78 83 68 16 37 160 180 83 90 119 130 134 184 165 177 172 174 180 137 147 157 122 122 122 123 114 103 103 106 104 102 107 106 130 130 127 128 122 110 106 142 160 166 167 150 119 100 145 138 123 122 123 115 108

#### 3.3 Upsampling and Colourspace Conversion (Milestone 1 in Hardware)

Figure 12. The output of IDCT is the downsampled YUV data.

The output of IDCT is downsampled YUV data, where U and V are horizontally downsampled by 2, as shown in Figure 12. The downsampled data becomes the even columns in the upsampled image, and the odd columns are generated by horizontal interpolation. For each row (rows are processed independently), we interpolate with a filter known as a finite impulse response (FIR) filter, which uses a weighted sum of a finite number of samples to produce each reconstructed sample. In our case that finite number is 6, so the filter is called a 6-tap filter. Given n samples in each row of the U and V planes, we produce 2n samples for the corresponding row of the reconstructed plane as follows:

$$U'[j] = \begin{cases} U\left[\frac{j}{2}\right] & j \ even \\ (int)\frac{1}{256}\left(21U\left[\frac{j-5}{2}\right] - 52U\left[\frac{j-3}{2}\right] + 159U\left[\frac{j-1}{2}\right] + 159U\left[\frac{j+1}{2}\right] - 52U\left[\frac{j+3}{2}\right] + 21U\left[\frac{j+5}{2}\right] + 128\right) & j \ odd \end{cases}$$
(10)

where U is the downsampled version and U' is the upsampled version. An analogous definition applies for V. In our example with an image of size 16x16, n is 8. In the project, the image will be of size 320x240, so n will be 320/2 = 160. The borders of **each row** are handled by replicating the respective border sample, i.e., U[0] = U[-1] = U[-2], and U[n-1] = U[n] = U[n+1] = U[n+2]. Note again the use of integer operations. These coefficients are borrowed from the ISO/IEC 13818-2 document on MPEG2 digital video decoding. Applying this filter to the U and V planes, the result in shown in Figure 13.

Having restored the full Y, U and V planes, the only step which remains is the colourspace conversion from YUV back to RGB. As discussed above (Equation 1), the conversion from RGB to YUV is in the form of a matrix multiplication. In essence, the RGB values of a given pixel are viewed as a vector in 3-space, and the matrix multiplication is a change of basis yielding another vector in 3-space. Since this is a linear transformation and the determinant of the matrix is nonzero, we may find its inverse, which will



Figure 13. The result after horizontally upsampling U and V.

provide the transformation back to the RGB basis from the YUV basis. This is applied to each pixel independently. Accounting for the vector addition to first shift the basis, the inverse conversion of the one above is:

$$\begin{bmatrix} R \\ G \\ B \end{bmatrix} = (int) \frac{1}{65536} \left( \begin{bmatrix} 76284 & 0 & 104595 \\ 76284 & -25624 & -53281 \\ 76284 & 132251 & 0 \end{bmatrix} \left( \begin{bmatrix} Y \\ U \\ V \end{bmatrix} - \begin{bmatrix} 16 \\ 128 \\ 128 \end{bmatrix} \right) \right)$$
(11)

As in the case of IDCT, an identical clipping function is used to deal with the under/overflow errors. The outputs R, G, and B are each on 8 bits unsigned. The result is shown in Figure 14.



Figure 14. After colourspace conversion, we obtain the final decompress RGB image.

The final decompressed image, as shown on the left (a) in Figure 15, requires 16x16x3 = 768 bytes (plus a header of 13 bytes), whereas the compressed image required only 230 bytes, giving a compression ratio of more than 3x. The original image appears on the right (c). As you can see, there is some reduction in the quality due to information lost during the compression and decompression process. The center image (b) shows the decompressed image for the case when quantization matrix  $Q_1$  from Equation 3 is used (instead of  $Q_0$  as is the case for this example, image (a)), which during compression retains more frequency information for use during reconstruction. The use of this matrix clearly improves the quality but reduces the compression. This discussion is summarized in the signal to noise ratio (SNR) values and file sizes given for both cases in Table 1.

|                                     | $oldsymbol{Q_0}$ image | $oldsymbol{Q_1}$ image | Original image |
|-------------------------------------|------------------------|------------------------|----------------|
| SNR (dB)                            | 18.65                  | 20.40                  | 8              |
| File size (bytes, including header) | 230                    | 378                    | 781            |
| Compression ratio                   | 3.4                    | 2.1                    | 1              |

Table 1. Quality versus compression tradeoff for the images using quantization matrices  $Q_0$  and  $Q_1$ .



Figure 15. Final decompressed images using quantization matrices  $Q_0$  (a) and  $Q_1$  (b) and the original image (c).

Equation 12 defines the signal to noise ratio where N is the number of pixels, error is the root mean squared error, and the 255 on the numerator is the maximum value of any pixel:

$$SNR(dB) = 20 \log_{10} \left( \frac{255}{\sqrt{\frac{error}{N}}} \right)$$
 where  $error = \sum_{n=0}^{N-1} (reference\_pixel_n - image\_pixel_n)^2$  (12)

This part of the document has provided the barebones of image compression (sufficient to work on the project). Further information on image compression can be found in the following references:

- [1] JPEG: Still Image Data Compression Standard, by William B., Pennebaker, Joan L., Mitchell, Springer; 1 edition (2006), ISBN: 0442012721
- [2] Digital Video and HDTV Algorithms and Interfaces, by Charles Poynton, Publisher: Morgan Kaufmann (2003), ISBN: 1558607927
- [3] Video Codec Design: Developing Image and Video Compression Systems, by Iain Richardson, Publisher: John Wiley & Sons (2002), ISBN: 0471485535
- [4] Introduction to Data Compression, Third Edition, by Khalid Sayood, Publisher: Morgan Kaufmann (2005), ISBN: 012620862X
- [5] Image and Video Compression for Multimedia Engineering: Fundamentals, Algorithms, and Standards, by Yun Q. Shi, Huifang Sun, Publisher: CRC Press (1999), ISBN: 0849334918
- [6] Digital Video Compression, by Peter Symes, Publisher: McGraw-Hill (2003), ISBN: 0071424873

#### 4 Project Guidelines

The design task of implementing the overall decoder has been partitioned into three milestones which can be completed each within one week. As part of the validation methodology, implementation will proceed from the VGA display backward to the compressed input. Starting from the display end enables a visual progress check at each milestone, thereby providing the framework for incremental integration of each newly designed module into the overall system.



Figure 16. Overview of the upsampling and colourspace conversion hardware implementation (milestone 1).

# 4.1 The Software Model and its Integration in the Design Process

Starting with a .mic17 file, there are 3 separate source codes, 1 for each milestone. The input/output files are organized exactly like the SRAM in hardware, that is,  $2^{18}$  = 262144 locations with 16 bits per location. Each source code reads data from a software model of the SRAM, performs the necessary computation, and then writes the data back to the SRAM software model. The decoding process starts at milestone 3, and the source code "decode\_m3" takes a .mic17 file and produces a .sram\_d2 file. Continuing into milestone 2, the source code "decode\_m2" takes a .sram\_d2 file and produces a .sram\_d1 file. Finally, for milestone 1, "decode\_m1" takes a .sram\_d1 file and produces a .sram\_d0 file and a .ppm file.

By separating the decoding process, we can do part of the decoding in software and part of it in hardware. The decoding software handles the first part of the decoding task, and the intermediate data can be passed to a testbench or to the board, where the hardware can take over the remainder of the decoding task to produce the image. In the case of the testbench, detailed feedback can be provided, which aids in tracing implementation errors, while the board provides less feedback but a full practical

operational test. For example, to validate if milestone 1 is working correctly in hardware, the .sram\_d1 file can be used to initialize the SRAM, and after the hardware simulation is finished, the .sram\_d0 file can be used to validate if the contents of SRAM are correct.

#### 4.2 Milestone 1: Upsampling and Colourspace Conversion

Figure 16 shows the implementation and validation plan for milestone 1. In this first week, colourspace conversion and interpolation will be tackled as indicated in the portion of the figure, which shows the flow from Figure 1. This will also require modifying the existing state machines to grant the decoder access to the external memory (by multiplexing). In fact, all the shaded portions of the figure will have to be built or modified. The SRAM Controller, UART SRAM Interface and VGA SRAM Interface shown in the figure have been provided in lab 5.

Figure 17 shows the memory layout, which is used for milestone 1. The last portion of the memory holds the completed RGB image stored in {R0 G0}, {B0 R1}, {G1 B1} ... format (as in lab 5 experiment 1). This is where the decompressed image will reside for all three milestones and, thus, for the completed project as well. It differs from the completed project because downsampled YUV data is stored in the SRAM as decoder stimuli since the earlier parts of the decoder are not yet implemented. By default (from the backbone in lab 5), the UART SRAM interface writes data in the SRAM starting at address 0, and the VGA SRAM interface starts reading from address 0. Clearly, the VGA SRAM interface must be modified to read the RGB data from the correct segment, as indicated in Figure 17. Also, the UART receiver in lab 5 experiment 4 strips the header from the .ppm file, which is no longer needed.



SRAM memory map (256k locations with 16 bits per location)

Figure 17. Memory layout for milestone 1.



Figure 18. Instantiation of a single 32-bit multiplier.

Figure 18 shows how a single 32-bit integer multiplier can be instantiated in your design. As will be elaborated on soon, multipliers must be shared to end up with a resource-efficient implementation because they are costly in terms of logic resources. Specifically, for this milestone, you are given the constraint of 4 multipliers (with input operands on 32 bits or less). Note these 4 multipliers are shared among colourspace conversion and upsampling, thus, there will be a total of three instantiations of the form for Figure 18. Each task comprises more than one multiplication, so the multipliers for each task must be shared. For this reason, the instantiation given in Figure 18 should be used, and the signals Mult\_op\_1, Mult\_op\_2 and Mult\_result should be multiplexed.

By explicitly instantiating multipliers, we are certain to have that many multipliers used in hardware. For example, in the following code snippet:

```
always_comb begin
  if (select == 1'b0) a = b*c;
  else a = d*e;
end
```

The multiplications happen at mutually exclusive times so that we could have 1 hardware multiplier with b and d multiplexed at one input to the multiplier and d and d multiplexed at the other input to the multiplier. However, this is not explicitly stated, so the computer-aided design (CAD) tool (in our case, Quartus) may choose to instantiate 2 hardware multipliers to compute both  $d^*d$  and  $d^*d$  and then take these 2 results and multiplex them to obtain the output d. To avoid this problem, we can explicitly specify that 1 (and only 1) multiplier should be used by multiplexing the operands:

```
always_comb begin
  if (select == 1'b0) begin
    op1 = b;
    op2 = c;
  end
  else begin
    op1 = d;
    op2 = e;
  end
end
assign a = op1*op2;
```

The average utilization of the 4 multipliers for milestone 1 must be at least 75%. The utilization of a multiplier is defined as the number of clock cycles a multiplier is used to perform calculations divided by the total number of clock cycles for the respective milestone. For clarification purposes, consider the following example on a hypothetical milestone whose duration is 100 clock cycles. Let's assume we use two multipliers. The first multiplier performs calculations for 75 clock cycles, hence its utilization is 75%. The second multiplier performs calculations for 95 clock cycles, hence its utilization is 95%. For this hypothetical milestone the average utilization is 85%.

Next, some architectural examples are provided regarding how the multipliers can be shared.

Figure 19 shows a straightforward implementation of a FIR filter, where the samples are stored in a shift-register structure. In this case, each sample undergoes a constant multiplication, and the results are all combined using an adder tree. Although this is attractive from the throughput perspective, as one filtered sample may be obtained every clock cycle, it is costly from the resource standpoint, especially considering that one such filter would be required for each component which must be upsampled (U and V). Furthermore, we must consider if we can receive a new U and a new V sample on every clock cycle, if we cannot, then on some clock cycles, this hardware will be under-utilized. A similar argument applies to whether we can process 1 upsampled U and 1 upsampled V per clock cycle after the FIR filter.



Figure 19. Resource inefficient FIR filter implementation.



Figure 20. Resource-efficient FIR filter implementation.

Figure 20 shows how a single multiplier can be shared to accomplish the same computation as but at the expense of throughput. In this case, the boxed portion is known as a Multiply-Accumulate (MAC) unit, which works by always multiplying the two inputs, adding the result to the register (known as the accumulation register or accumulator), and then storing that result back in the accumulator, to become an operand in the next clock cycle. The FSM guides the selection of both the sample and the constant factor, as well as initialization, which in this case is to account for the +128 in the upsampling equations of Equation 10, also shown in Figure 19. Using this architecture, only one multiplier is required, however, each reconstructed sample requires 7 clock cycles to produce instead of 1 as in the previous case. This is an example of a typical architectural tradeoff between resource usage and throughput.



Figure 21. Resource-efficient FIR filter implementation.

Figure 21 shows a cost-effective FIR implementation which is similar to Figure 20, but with the important distinction that the multiplexer that selects between the different samples (as set up by the FSM) is eliminated. The same result can be obtained by leveraging the shift structure that is already present due to the reuse of most of the samples in calculating adjacent reconstructed samples. Now instead of the FSM setting up select lines on a mux to obtain a desired sample, the calculation proceeds by holding the shift for the first clock cycle (during initialization of the accumulator to 128), and during the next 6 clock cycles, the samples are circularly shifted, each time the contents of the rightmost register being accumulated. At the end of these 6 clock cycles, each of the data returns to its original register, and the accumulation is complete (a new reconstructed sample has been produced). Although identical to Figure 20 in throughput, eliminating the large sample selector mux makes this design substantially more efficient. If one is more adventurous, we can exploit the symmetry of the FIR filter to need only 3 multiplications per upsample, however, this is **not needed** for this year's project; therefore, the number of multiplications for each upsampled odd value for U and V respectively **must** be done with 6 multiplications.



Figure 22. Resource inefficient colourspace conversion implementation.

In addition to the FIR filter implementation, we are faced with the challenge of sharing a multiplier for colourspace conversion. Figure 22 shows the straightforward way to implement Equation 11, with the advantage of high throughput (one full RGB pixel per clock cycle). However, due to the constraints of using only **four multipliers** for both interpolation and colourspace conversion, this solution is not suitable for our implementation.



Figure 23. Resource efficient colourspace conversion implementation.

In the same way that the MAC concept may be used to share multiplications for the FIR filters, it may be used for colourspace conversion. The data path for this is shown in Figure 23. In this case, the use of the mux between samples is sufficient since there are fewer samples from which to select, and the logic for the shift structure is not already in place as was the case for the FIR filter.

As emphasized during lab 5, state tables are a powerful way of capturing and synthesizing information about a state machine. Table 2 provides a good starting point for a state table that can be used for milestone 1. You are strongly urged to use state tables throughout this milestone's development, verification and debugging.

| State code / clock cycle | 0 | 1 | 2 | 3 | 4 | 5 |  |
|--------------------------|---|---|---|---|---|---|--|
| SRAM_address             |   |   |   |   |   |   |  |
| SRAM_read_data           |   |   |   |   |   |   |  |
| SRAM_write_data          |   |   |   |   |   |   |  |
| SRAM_we_n                |   |   |   |   |   |   |  |
| R                        |   |   |   |   |   |   |  |
| G                        |   |   |   |   |   |   |  |
| В                        |   |   |   |   |   |   |  |
| Υ                        |   |   |   |   |   |   |  |
| U'                       |   |   |   |   |   |   |  |
| U[(j+5)/2]               |   |   |   |   |   |   |  |
| U[(j-5)/2]               |   |   |   |   |   |   |  |
| V'                       |   |   |   |   |   |   |  |
| V[(j+5)/2]               |   |   |   |   |   |   |  |
| V[(j+5)/2]               |   |   |   |   |   |   |  |
|                          |   |   |   |   |   |   |  |

Table 2. A starting point for the state table in milestone 1.

As it is the case for any digital design project, one should not be wasteful with hardware resources. In your project report you will need to justify your design decisions, e.g., the need for registers, and their purpose.

#### 4.3 Milestone 2: IDCT

Figure 24 shows the overview for milestone 2, which is the implementation of IDCT. As before, the shaded portions of the figure indicate what needs to be built or modified. In particular, notice that the FSM and the logic in the top state machine (bottom portion of the figure) do not need to be modified since, in the previous milestone, you have made provision for the decoder circuitry (all of which will eventually reside in the block currently labelled "Partial Decoder") to access the SRAM.

As discussed in Section 4.1, the software model can be used to generate the necessary initialization and verification data for the SRAM. This data is used by the testbench for validation, and it can be sent to the board with a working design to get visual confirmation that the IDCT, upsampling and colourspace conversion are working together.



Figure 24. Overview of the IDCT implementation (milestone 2).

In Figure 25, the memory map for milestone 2 is provided, which is similar to that for milestone 1. The pre-IDCT data is stored as 1 value per memory location, whereas the post-IDCT data (YUV) is stored as 2 values per memory location. Note that after IDCT is complete, the RGB data computed in milestone 1 will overwrite some of the pre-IDCT data (used to initialize the SRAM for milestone 2).

Like milestone 1, the blocks are divided into segments of Y, segments of (downsampled) U, and segments of (downsampled) V. However, the way the data is accessed is different. In milestone 2, we want to fetch an entire 8x8 block of pre-IDCT data, perform IDCT and write it back to the YUV memory space. To determine the addressing for writing post-IDCT samples back to the memory, observe Figure 12 and note that the block from block row 0, block column 0 of the Y plane (Y 0,0) contains data from row 0, row1, ... row 7 of the image, all in columns 0 to 7 of the image. Extending this to the case of a 320x240 image, and noting that we have two samples per address, our write addressing for block 0 should be: 0 ... 3, ..., 1120 ... 1123 (as shown in Figure 26). For the second block of the first block row (Y 0,1), we have: 4 ... 7, ..., 1124 ... 1127. Once all the Y blocks have been processed, we proceed to U and V, with the important distinction that these two planes are downsampled and thus contain half the number of columns which Y contained. Thus U 0,0 would reside in addresses (38400+): 0 ... 3, ..., 560 ... 563. The last block V 29,19 would reside in addresses (57600+): 18636 .. 18639, ..., 19196 ... 19199.



Figure 25. Memory layout for milestone 2.



Figure 26. Reading pre-IDCT values and writing back YUV data to the external SRAM.

Like writing, the read addressing will also jump as we move from one pixel row to another within the same 8x8 block. However, now only 1 value is stored per memory location. To go across a row, the address simply increments. Within the same pixel column, to go down a pixel row (still within the same 8x8 block), the address increases by 320 (for Y, but only by 160 for U and V). Finally we add the appropriate offset depending on which colour segment we are in. Thus to read block Y 0,0 we use addresses (76800+): 0 ... 7, ..., 2240 ... 2247 (as in Figure 26). Likewise, to read block Y 0,1 we use (76800+): 8 ... 15, ..., 2248 ... 2255. On the next block row, to read block Y 1,0 we use (76800+): 2560 ... 2567, ..., 4800 ... 4807. For the block U 0,0 we use (153600+): 0 ... 7, ..., 1120 ... 1127 (please note for U and V we have half the columns, hence there are half the number of blocks per row).



Figure 27. MAC unit for block IDCT.

Figure 27 shows a MAC unit which may be used for calculating part of the IDCT on a single block of samples. Recall that the IDCT is simply a pair of matrix multiplications. Since a matrix multiplication with a constant coefficient matrix is nothing more than the sum of a number of products of the samples with constants, the MAC unit will meet the architectural need. The role of the FSM is to initialize the accumulator at the appropriate times and to set up the addresses properly to obtain the correct sample and coefficient in each clock cycle. Using such a unit, the pre- and post-matrix multiplication can be done concurrently (the number of multipliers for this milestone is clarified below).

To facilitate a simple hardware implementation, we can read from dual-port memories both and the fixed-point cosine coefficients (Equation 9) and the data value (pre-IDCT values or intermediate values after the first matrix multiplication). It is your task to determine a suitable memory layout for the dual-port RAMs. Because you are given this freedom, you must provide the appropriate .hex or .mif file to specify the dual-port memory initialization, particularly for the fixed-point cosine coefficients.

To maximize the utilization of the embedded memories, it is recommended that both ports are used and that the memory is organized on 32 bits per location (which is the maximum width allowed). It is your choice as to whether the results after the second matrix multiplication are written directly to the SRAM or buffered first in a dual-port RAM.

In this milestone, you may use 3 dual-port RAMs (of 4Kb capacity and not more than 32 bits per location) and only 4 multipliers are allowed (with input operands on 32 bits or less). Note also, the average utilization of the 4 multipliers for milestone 2 must be at least 95%.

You are encouraged to use a state table such as the one in Table 3 during your development.

| State code / clock cycle | 0 | 1 | 2 | 3 | 4 | 5 |  |
|--------------------------|---|---|---|---|---|---|--|
| SRAM_address             |   |   |   |   |   |   |  |
| SRAM_read_data           |   |   |   |   |   |   |  |
| SRAM_write_data          |   |   |   |   |   |   |  |
| SRAM_we_n                |   |   |   |   |   |   |  |
| Address_0                |   |   |   |   |   |   |  |
| Data_in_0                |   |   |   |   |   |   |  |
| Write_en_0               |   |   |   |   |   |   |  |
| Data_out_0               |   |   |   |   |   |   |  |
| Address_1                |   |   |   |   |   |   |  |
| Data_in_1                |   |   |   |   |   |   |  |
| Write_en_1               |   |   |   |   |   |   |  |
| Data_out_1               |   |   |   |   |   |   |  |
| S                        |   |   |   |   |   |   |  |
| S'                       |   |   |   |   |   |   |  |
| С                        |   |   |   |   |   |   |  |
|                          |   |   |   |   |   |   |  |

Table 3. A starting point for the state table in milestone 2.

# 4.4 Milestone 3: Lossless Decoding and Dequantization



Figure 28. Lossless decoding, dequantization, and complete integration (milestone 3).

Finally, once colourspace conversion, interpolation and IDCT are complete, we can move on to lossless decoding, where the compressed bitstream itself is parsed, and the raw blocks are extracted and dequantized before being passed to the block IDCT, which completes the IDCT and writes the complete block of YUV data (belonging to the downsampled image of Figure 1) back to the SRAM. Once this milestone is complete, you will have a working image decoder.



SRAM memory map (256k locations with 16 bits per location)

Figure 29. Memory layout for milestone 3.

Because the decoding process works on blocks of 8x8, after a block is decoded, it is immediately passed to block IDCT (milestone 2) by buffering into an embedded memory that is written to by milestone 3 and read from by milestone 2. The SRAM read address logic from milestone 2 will no longer be needed when milestones 2 and 3 work concurrently. Nevertheless, it is very important to note that you are strongly recommended to get milestone 3 working independently before integrating it to work concurrently with milestone 2. This will give you a chance first to understand the functionality and work out the implementation details for the lossless decoder and the dequantizer, and only then can you focus on dealing with concurrency-related, that will arise while doing the integration with milestone 2.

The memory layout for milestone 3 is shown above in Figure 29. No memory is allocated for the pre-IDCT data; rather this is processed by milestone 2 before being written back to the SRAM. After this, the RGB data generated by milestone 1 may overwrite the compressed image bitstream.

The basic decoding process is captured in Figure 8 at the beginning of the image decompression example. This flow chart can be used to construct a state table for guiding your state machine implementation. Part of the challenge of lossless decoding is implementing a serializer which enables extracting the coefficient data by reading 16-bit words stored in the SRAM and delivering them to the decoder in a serial fashion. Figure 30 shows the basic architecture of the serializer. A custom FSM is required to drive the SRAM address and to increment it only if all the bits have been extracted from the current memory location. A shift register, whose size is a multiple of 16, e.g., 32 bits, can be used to

buffer the data on-chip, as codes and values may span two memory locations. In each clock cycle, the mux network will choose which bits from the location will be processed and, therefore, also shifted out of the 32-bit shift register. Note the shifting of bits in this 32-bit buffer may happen over more than one clock cycle, depending on your FSM (for example, we may shift out 6 bits by doing a shift by 3 on two consecutive clock cycles).



Figure 30. SRAM data is serialized for extracting coefficients.

Once the coefficients have been extracted, they must be dequantized before being placed in the dual-port RAM to await the block IDCT. Due to the special structure of the two possible quantization matrices (as in Equation 3), which contain only powers of two, a barrel shifter structure can be used to implement the dequantization (the type of the quantization matrix must also be used as an input to this barrel shifter). Furthermore, note the scan sequence of Figure 4 must be implemented so that the values are delivered from the bitstream to the correct locations in the 8x8 block.

To facilitate easier integration between block IDCT and lossless decoding and quantization, you are allowed one extra dual-port RAM (of 4Kb capacity and not more than 32 bits per location), thus a total of 4 dual-port RAMs can be used in milestones 2 and 3 together. This extra RAM can be used for loading the out-of-order pre-IDCT values, and then during the clock cycles when milestone 2 does not need to fetch pre-IDCT values, the content from this extra embedded RAM can be loaded into the embedded RAM shared between milestones 2 and 3. You may also choose to write directly to the embedded RAM shared between milestones 2 and 3 (without the additional embedded RAM), however, this requires more design effort to ensure no scheduling conflicts between writing from milestone 3 and reading from milestone 2.

To handle runs of zeros, there are two basic approaches. One method is to simply write a sequence of zeros when it happens (in different memory locations and thus over several clocks). The other method is to initialize all values to zero, and then when a run of zeros is decoded, we can skip over that many memory locations.

Finally, successful implementation of the lossless decoding and dequantization relies heavily on getting your state machine to work properly and synchronizing the signals between the state machine and the dual-port RAMs (used for IDCT), the serializer, the top-level state machine, etc.

#### **ENJOY!**

# 5 Appendix A: Summary of Data Flow and Computation

#### 5.1 Milestone 1

<u>Input</u>: data is represented as **8-bit unsigned**. Each SRAM location contains 2 values. There are separate sections for Y, U, and V. Within each section, data is stored in raster-scan order (across first, then down).

<u>Computation</u>: it is advised to do *all arithmetic on 32-bit signed*. Upsample the odd values of U and V (no clipping here). Each pixel is independently converted from YUV to RGB, after this, clipping to 8-bit *unsigned* is done. In this milestone, you must use *4 multipliers* with at least 75% utilization (it is requested that the symmetry of the filter coefficients for interpolation **not** be used).

$$U'[j] = \begin{cases} U\left[\frac{j}{2}\right] & j \ even \\ (int)\frac{1}{256}\left(21U\left[\frac{j-5}{2}\right] - 52U\left[\frac{j-3}{2}\right] + 159U\left[\frac{j-1}{2}\right] + 159U\left[\frac{j+1}{2}\right] - 52U\left[\frac{j+3}{2}\right] + 21U\left[\frac{j+5}{2}\right] + 128\right) & j \ odd \\ \begin{bmatrix} R \\ G \\ B \end{bmatrix} = (int)\frac{1}{65536}\left(\begin{bmatrix} 76284 & 0 & 104595 \\ 76284 & -25624 & -53281 \\ 76284 & 132251 & 0 \end{bmatrix}\left(\begin{bmatrix} Y \\ U \\ V \end{bmatrix} - \begin{bmatrix} 16 \\ 128 \\ 128 \end{bmatrix}\right)\right) \end{cases}$$

<u>Output</u>: each of R, G and B are **8-bit unsigned**. Data for 2 pixels requires 3 SRAM locations to store. The memory layout of the RGB segment resembles lab 5 experiment 1 (see Figure 17 for starting address).

#### 5.2 Milestone 2

Input: data is represented as 16-bit signed.

<u>Computation</u>: it is advised to do *all arithmetic on 32-bit signed*. There are 2 matrix multiplications per output 8x8 block, after which each value is clipped to *8-bit unsigned*. Take note of the SRAM access pattern since data is processed in blocks of 8x8. *Up to 3 embedded memories* can be used, and the choice of how these are organized is yours. In this milestone, you must use *4 multipliers* with at least 95% utilization (note: these 4 multipliers are in addition to the 4 multipliers used for milestone 1).

$$\mathbf{S} = (int) \frac{1}{65536} \left( \mathbf{C}^T \times (int) \frac{1}{256} (\mathbf{S}' \mathbf{C}) \right)$$

Output: 8-bit unsigned values are written to the SRAM, 2 values per SRAM location.

#### 5.3 Milestone 3

<u>Input</u>: the .mic17 file is a *bitstream*, to get the next bit, go from the most significant bit to the least significant bit within the same SRAM location, and only then read and process the next SRAM location.

<u>Computation</u>: decode each 8x8 block independently. Each prefix code may produce several values. Values are processed in the zigzag order for each block. For each value, once the position in the 8x8 block is known, dequantization is applied. Which quantization matrix to use must be extracted from the .mic17 header. *One extra embedded memory* (in addition to the 3 used in milestone 2) may be used.

<u>Output</u>: each dequantized value is represented as **16-bit signed**. Before you integrate milestone 3 into milestone 2, it is strongly recommended that you implement it separately, in which case take note of the SRAM access pattern since data is processed in blocks of 8x8.