# Compaction Project

## Problem Formulation

Given a set of chunks whose sizes are

$$
S_n = \{d_1, \cdots, d_n\},
$$

where the positive integer $d_i \leq 2048$ for all $i = 1, \cdots, n$. Suppose the remaining operators need time 

$$
f(d_i) = C_1 + d_i \times C_2
$$

to process a data chunk with the size $d_i$. Our goal is to compact the set $S$, i.e., we need a transformation

$$
\mathcal{M}: S_n \rightarrow S'_m \triangleq \{d'_1,  \cdots, d'_m\},
$$

where $\sum_i^n d_i = \sum_j^m d'_j$ and $m$ is an arbitrary integer less than $n$, to minimize 

$$
\sum_j^m f(d'_j) + cost(M, S).
$$

where $cost(\mathcal{M}, S)$ is the cost of the transformation $\mathcal{M}$ on the set $S$. 

The cost of combining two or more chunks into one: $d_i + \cdots + d_j = d'_s \leq 2048$, is 

$$
g(d'_s) = C_3 + d'_s \times C_4.
$$

**Note:** This formulated problem is easier than the real compaction problem because we have the sizes of all data chunks in advance, rather than a chunk stream.

In [58]:
# utils
from termcolor import colored

def print_color(text, color='black'):
    print(colored(text, color))

## 1. Distribution of Chunk Sizes

In [59]:
import numpy as np

# Generate random chunk sizes from a Gaussian distribution
def generate_chunk_sizes(n, mean=64, scale=256):
    return np.maximum(0, np.random.normal(mean, scale, n)).astype(int)

## 2. Define the Processing Cost and Compaction Cost

In [107]:
#             fixed cost      per tuple cost
# probe()     1.5             0.03
# next()      0.9             0.06
# --------------------------------------
# compact()   0.3             0.03
# --------------------------------------

k_pcs_fixed_cost = (1.5 + 0.9)
k_pcs_per_tuple_cost = (0.03 + 0.06)
k_cpt_fixed_cost = 0.3
k_cpt_per_tuple_cost = 0.03

def simulate_join(sizes, compact_func, chunk_factor=1, level=1):
    prc_cost = 0
    cpt_cost = 0
    next_sizes = np.array(sizes)

    for _ in range(level):
        # join
        prc_cost += k_pcs_fixed_cost * len(next_sizes) + np.sum(next_sizes) * k_pcs_per_tuple_cost
        next_sizes = np.repeat(next_sizes // chunk_factor, chunk_factor)

        # compact
        next_sizes, cost = compact_func(next_sizes)
        cpt_cost += cost

    return prc_cost, cpt_cost


def compute_cpt_cost(compacted_chunk_sizes):
    return k_cpt_fixed_cost + np.sum(compacted_chunk_sizes) * k_cpt_per_tuple_cost

## 3. Compaction Strategies

In [108]:
# Strategy 1: Do not compact any chunks
def alg_no_compaction(chunk_sizes):
    return chunk_sizes, 0

# Strategy 2: Fully compact all chunks
def alg_full_compaction(chunk_sizes):
    transformed_sizes = []
    cpt_cost = 0
    cpt_sizes = []
    for size in chunk_sizes:
        if sum(cpt_sizes) + size <= 2048:
            cpt_sizes.append(size)
        else:
            cpt_cost += compute_cpt_cost(cpt_sizes)
            transformed_sizes.append(sum(cpt_sizes))
            cpt_sizes = [size]
    transformed_sizes.append(sum(cpt_sizes))

    return transformed_sizes, cpt_cost

In [110]:
chunk_sizes = generate_chunk_sizes(n=128, mean=2048, scale=0)
chunk_factor = 8
level = 2

grades = {
    "No Compaction": simulate_no_compaction(chunk_sizes, chunk_factor, level), 
    "Full Compaction": simulate_full_compaction(chunk_sizes, chunk_factor, level)
}

for grade in grades:
    print_color(f"{grade} cost: {grades[grade]:.2f} microseconds", 'green')

[32mNo Compaction cost: 52408.32 microseconds[0m
[32mFull Compaction cost: 60249.30 microseconds[0m
