nydus-image: add documentation for chunk-level deduplication

Signed-off-by: Lin Wang <l.wang@mail.dlut.edu.cn>
dragonflyoss · Jul 2, 2024 · 6e60317 · 6e60317
1 parent 733e57e
commit 6e60317
Show file tree

Hide file tree

Showing 2 changed files with 153 additions and 0 deletions.
diff --git a/docs/chunk-deduplication.md b/docs/chunk-deduplication.md
@@ -0,0 +1,153 @@
+# Chunk-Level Deduplication: Storage Optimization for Nydus Images
+
+## Probntroduction
+
+In container images, there are often a large number of duplicate files or content, and these duplicate parts occupy a large amount of storage space, especially in high-density deployment scenarios. As the number of Nydus images grows, it will bring many problems such as low storage space utilization and excessive consumption of bandwidth resources. To do this, an effective deduplication mechanism (deduplication) needs to be designed to solve this problem.
+
+Unlike traditional OCI, which distributes images at a layer-granular level, the smallest unit of a Nydus image is a chunk, so the deduplication algorithm needs to be deduplicated in chunk units. At the same time, we want to deduplicate multiple aspects of the Nydus image, including between Nydus images and between different versions of the same Nydus image. No matter which deduplication method is essentially to deduplicate the repeated chunks in the image, only one duplicate chunk is retained, and the reference to the chunk is used instead of other duplicate chunks to reduce the storage space occupation, so as to maximize the data transmission and storage capabilities of Nydus and improve the access speed and efficiency of the image.
+
+## General idea
+
+The deduplication algorithm first needs to select the duplicate chunk in the image according to the image information such as the number of occurrences of chunk, chunk size, chunk image to which the chunk belongs and the corresponding version, and generate chunkdict, chunkdict records the unique identifier or fingerprint of chunk, only need to store chunkdict, other images can refer to chunk in chunkdict by reference.
+
+The deduplication algorithm is divided into two parts, the first part is the DBSCAN clustering algorithm, which deduplicates different images; The second part is the exponential smoothing algorithm, which deduplicates different versions within the image.
+
+**The general process is as follows:**
+
+1. Store the image information to the local database,
+2. Extract the image information and call the DBSCAN clustering algorithm to deduplicate different images.
+3. Deduplicate the dictionary content in 2, and call the exponential smoothing algorithm for each image separately for image version deduplication.
+4. Get the deduplication dictionary generated by running the two algorithms and drop the disk.
+5. Generate a chunkdict image and push it to the remote repository
+
+## Algorithm detailed process
+
+### Overall  Input
+
+```shell
+nydusify chunkdict generate --sources \
+ registry.com/redis:nydus_7.0.1,  \
+ registry.com/redis:nydus_7.0.2， \
+ registry.com/redis:nydus_7.0.3   \
+     -- target registry.com/redis:nydus_chunkdict \
+     --source-insecure --target-insecure
+     # Optional
+     --backend-config-file /path/to/backend-config.json \
+     --backend-type oss
+```
+
+## Use the chunk dict image to reduce the incremental size of the new image
+
+```
+nydusify convert
+ --source registry.com/redis:OCI_7.0.4 \
+ --target registry.com/redis:nydus_7.0.4 \
+ --chunk-dict registry.com/redis:nydus_chunkdict
+```
+
+***
+`nydusify chunkdict generate` calls subcommand `nydus-image  chunkdict generate`  to store image information into the database and generate a new bootstrap as chunkdict bootstrap.
+
+Download multiple Nydus images in advance and put them into the repository as datasets, such as selecting 10 consecutive versions of redis and alpine as the image dataset, and execute the  command `nydus-image  chunkdict generate` to store the information of the chunk and blob in the chunk and blob table of the database.
+
+```shell
+# Deposit multiple images into the database
+nydus-image chunkdict generate --source \
+     /path/localhost:5000:redis:nydus_7.0.1/nydus_bootstrap, \
+     /path/localhost:5000:redis:nydus_7.0.2/nydus_bootstrap, \
+     /path/localhost:5000:redis:nydus_7.0.3/nydus_bootstrap  \
+     --bootstrap /path/to/chunkdict_bootstrap\
+     --database /path/to/database.db\
+     --output-json /path/to/nydus_bootstrap_output.json
+```
+
+***
+
+### Deduplication algorithm
+
+#### Algorithm 1 Deduplication between different images (DBSCAN clustering algorithm)
+
+***
+**Basic principle:** DBSCAN is a density-based clustering algorithm, which mainly investigates the connectivity between samples through sample density, samples of the same category, they are closely connected, in other words, there must be samples of the same category not far around any sample of the category. Therefore, it can group a group of objects with high density and close distance, can find clusters of arbitrary shapes, and does not need to specify the number of clusters in advance, which is suitable for high-density deployment scenarios.
+
+**Input:** Read the chunk information in the database and store it in the chunk list. Chunk information includes:image_name, version, chunk_blob_id, chunk_digest, chunk_compressed_size, and so on.
+
+**Output:** The chunk dictionary corresponding to each image cluster
+
+**Basic steps:**
+**1.** Select a part of the version as the training set and the rest as the test set according to a certain proportion of all images.
+
+**2.** Divide all chunks in the training set into a new list according to the image_name, and each list corresponds to an image and all chunk sets in the image.
+
+**3.** These images are done using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm
+Clustering.
+
+***
+3.1  Initialize the core point collection $Omega$ as an empty set,and set the clustering algorithm radius $gamma = 0.5$, and the sample number threshold $MinPts = 10$
+
+3.2  Loop through each image and its corresponding chunk list，and calculate its distance from other images according to the following formula.
+$$ distance (x,y)= \frac{\lvert C(R_x) \cup  C(R_y) \rvert - \lvert C(R_x) \cap  C(R_y) \rvert}{\lvert C(R_x) \cup  C(R_y) \rvert }$$
+where $C(R_x)$ represents the unique chunk set of all training set images in the image. Calculate the number of images based on $distance(x,y)  \leq \gamma$，If there are M y, such that $distance(x,y)  \leq \gamma$, where $M \geq MinPts$, then add the imagex to the core point set, and image y is called the image in the neighborhood of the core image x;
+
+3.3  Initialize the number of cluster classes k=0, and then iterate the core point warehouse collection in turn, and add all the neighboring warehouses in the core point warehouse to the queue, if a warehouse in the neighborhood is also a core warehouse, all warehouses in its neighborhood join the queue, classify the warehouses in the above queue into a cluster class, and continue to traverse the core warehouse collection until all core warehouses are traversed.
+
+3.4  Calculate the frequency of chunks that appear in each class image. Add the chunk that appears in the image above $90%$ in the training set to the dictionary corresponding to the cluster class to generate a set of < cluster classes, and the dictionary > pairs.
+***
+**4.** Adjust the neighborhood radius size and repeat step 3 to obtain multiple deduplication dictionaries.
+
+**5.** Use the test set to evaluate multiple deduplication dictionaries in 4, and select the chunk dictionary corresponding to the test set with the smallest storage space.
+
+**6.** Remove the chunk in the chunk dictionary selected in 5 for all images (training set and test set), and then repeat the operation 1-5 to generate the chunk dictionary until the maximum number of cycles is reached 7, or the discrete image ratio is greater than 80% of the total number of images.
+
+The principle of DBSCAN algorithm how to divide the cluster is shown in the diagram:
+![dbscan algorithm](images/nydus_chunkdict_dbscan_algorithm.png)
+**Remark：** This section of the picture and the associated DBSCAN algorithm description are referenced from : [https://en.wikipedia.org/wiki/DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)
+
+#### Algorithm 2 Deduplication between different versions of the image (exponential smoothing algorithm)
+
+***
+**Basic principle:** Exponential smoothing algorithm is a method for time series data prediction and smoothing, the basic principle is to weighted average the data, give higher weight to the more recent repeated chunks, and constantly update the smoothing value, so the newer chunk has a greater impact on future forecasts, and the impact of older data will gradually weaken.
+
+**Input:** The training set and test set after deduplication in algorithm 1.
+
+**Output:** The chunk dictionary corresponding to each image.
+
+**Basic steps:**
+**1.** Divide all chunks in the training set into a new list according to the image_name, and each list corresponds to an image and all chunk sets in the image.
+
+**2.** The different versions inside each image are sorted chronologically, and each chunk is scored according to the Exponential Smoothing formula.
+$$S_0 =0 ,S_t = \alpha Y_{t-1} +(1- \alpha)S_{t-1} $$
+where, $\alpha=0.5$ , $Y_{t-1}$ indicates whether the chunk appeared in the previous image, 1 if it did, otherwise 0.
+
+**3.** Count the score for each chunk and select all chunks with a score greater than $THs$ as the chunk dictionary. Deduplicate the image version in the test set and calculate the storage space it occupies.
+
+**4.** Modify the value of $THs$ from 0.8 to 0.5 in steps of 0.05 and repeat steps 2 and 3 to generate multiple chunk dictionaries.
+
+**5.** Choose a chunk dictionary that minimizes the test set's storage space.
+***
+
+### Exponential Smoothing Algorithm Test
+
+#### Procedure
+
+**1.** Download 10 versions of each OCI image and count the total size in MB.
+**2.** Convert the OCI images to Nydus format and then count the total size in MB after conversion.
+**3.** Select three versions of each image to generate a chunk dictionary. Use the chunk dictionary to convert the remaining seven versions of the image, and then count the total size in MB after deduplication.
+
+#### Image Information Table
+
+| **Image Name** | **Number of Versions** | **Total Image Size (OCI)** | **Total Image Size (Nydus)** |
+| :------------: | :--------------------: | :------------------------: | :--------------------------: |
+|   **Redis**    |           10           |         341.78 MB          |          419.37 MB           |
+|   **Ubuntu**   |           10           |         290.26 MB          |          308.59 MB           |
+|   **Alpine**   |           10           |          26.9 MB           |           27.55 MB           |
+
+#### Deduplication Results Table
+
+| **Image Name** | **Chunkdict Image Size** | **Total Image Size (Nydus after Deduplicating)** | **Deduplicating Rate** |
+| :------------: | :----------------------: | :----------------------------------------------: | :--------------------: |
+|   **Redis**    |         41.87 MB         |                    319.48 MB                     |         23.82%         |
+|   **Ubuntu**   |         30.8 MB          |                    140.28 MB                     |         54.54%         |
+|   **Alpine**   |         2.74 MB          |                     24.7 MB                      |         10.34%         |
+
+***
diff --git a/docs/images/nydus_chunkdict_dbscan_algorithm.png b/docs/images/nydus_chunkdict_dbscan_algorithm.png