## Iteration 1.
Next iteration of the project we are working on:

We can add some constraints to the structure composed of (b, z).

1. We can fix the number of buckets to a number that is a power of 2, say p. The larger the p, the higher the precision of the dataset approximation.

2. We can convert bitsets that represent running zeros into integers. This will allow us to use bitwise operations on integers to determine if a specific number of zeros belong to the bucket.

This will standardize the presentation of any dataset into a fully comparable format.

This is a great direction for the next iteration of the project! By introducing structural constraints on the (b, z) representation, we can achieve a uniform and comparable format for any dataset. This will allow for efficient storage, comparison, and analysis of datasets while maintaining precision and scalability.

### Key Improvements:

1. Fixed Number of Buckets:

    - Fix the number of buckets to a power of 2, say p. 
    - The larger the p, the higher the precision of the dataset approximation.
    - A smaller p reduces precision but improves efficiency.

2. Integer Representation of Running Zeros:

    - Convert the bitsets representing running zeros into integers.

    - Use bitwise operations to efficiently check if a specific number of trailing zeros belongs to a bucket.

3. Uniform Dataset Representation:

    - All datasets will be represented in a consistent format, making them fully comparable.

    - This enables operations like dataset similarity comparison, merging, and querying.

### Implementation Details:
1. Fixed Number of Buckets:
    - The number of buckets is fixed to 2^p.
    - The bucket number b is calculated as the integer value of the first p bits of the 64-bit hash.
2. Integer Representation of Running Zeros:
    - Represent the number of trailing zeros z as an integer.
    - Use bitwise operations to check if a specific number of trailing zeros belongs to a bucket.
3. Uniform Dataset Format:
    - Each dataset is represented as a collection of buckets.
    - Each bucket contains:
        - A list of integers representing the number of trailing zeros z for tokens in that bucket.
        - Optional metadata, such as token frequencies or document references.

### Updated Data Structure:

In [None]:
dataset = {
    b1: {
        "zeros": [z1, z2, ...],  # List of trailing zeros for tokens in this bucket
        "frequencies": [f1, f2, ...],  # Optional: Frequencies of tokens
        "documents": [doc_id1, doc_id2, ...]  # Optional: Document references
    },
    b2: {
        "zeros": [z3, z4, ...],
        "frequencies": [f3, f4, ...],
        "documents": [doc_id3, doc_id4, ...]
    },
    # ...
}

### Algorithm for Building the Dataset Representation:

1. Tokenization and Hashing:

    - Tokenize each document and convert tokens into 64-bit hashes.

2. Bucket Assignment:

    - For each hash, extract the first p bits to determine the bucket b.

3. Trailing Zeros Calculation:

    - Count the number of trailing zeros z in the hash.

    - Represent z as an integer.

4. Update Dataset Structure:

    - Append z to the list of zeros for the corresponding bucket b.

    - Optionally, update frequencies and document references.

### Efficient Bitwise Operations:
To check if a specific number of trailing zeros z belongs to a bucket, we can use bitwise operations:

1. Mask Creation:

    - Create a mask for z trailing zeros: mask = (1 << z) - 1.

2. Check Trailing Zeros:

    - For a given hash, check if the last z bits are zero: (hash & mask) == 0.

3. Bucket Lookup:

    - Use the first p bits of the hash to determine the bucket b.

    - Check if z exists in the list of zeros for bucket b.

### Example Code:

In [None]:
def build_dataset(documents, p):
    dataset = {}
    for doc_id, document in enumerate(documents):
        tokens = tokenize(document)
        for token in tokens:
            hash_val = hash_function(token)
            b = (hash_val >> (64 - p)) & ((1 << p) - 1)  # First p bits
            z = count_trailing_zeros(hash_val)  # Number of trailing zeros
            
            # Initialize bucket if not exists
            if b not in dataset:
                dataset[b] = {"zeros": [], "frequencies": [], "documents": []}
            
            # Update bucket data
            dataset[b]["zeros"].append(z)
            dataset[b]["frequencies"].append(1)  # Increment frequency
            dataset[b]["documents"].append(doc_id)
    
    return dataset

def count_trailing_zeros(hash_val):
    return (hash_val & -hash_val).bit_length() - 1

def check_trailing_zeros(hash_val, z):
    mask = (1 << z) - 1
    return (hash_val & mask) == 0

### Benefits of the Updated Approach:

1. Uniform Representation:

    - All datasets are represented in the same format, enabling easy comparison and merging.

2. Efficient Storage:

    - Fixed bucket size and integer representation of trailing zeros reduce storage overhead.

3. Scalability:

    - The number of buckets 2^p can be adjusted to balance precision and efficiency.

4. Fast Lookups:

    - Bitwise operations enable efficient checks for specific trailing zeros.

5. Comparability:

    - Datasets can be compared directly using their bucket and trailing zero representations.

### Example Use Case: Dataset Comparison

To compare two datasets, we can:

1. Align their bucket structures (using the same p).

2. Compare the lists of trailing zeros for each bucket.

3. Compute similarity metrics (e.g., Jaccard similarity) based on overlapping zeros.

### Example Code for Dataset Comparison:

In [None]:
def compare_datasets(dataset1, dataset2):
    similarity = 0
    total_buckets = len(dataset1)
    
    for b in dataset1:
        if b in dataset2:
            zeros1 = set(dataset1[b]["zeros"])
            zeros2 = set(dataset2[b]["zeros"])
            intersection = len(zeros1.intersection(zeros2))
            union = len(zeros1.union(zeros2))
            similarity += intersection / union if union > 0 else 0
    
    return similarity / total_buckets

### Conclusion:

By introducing fixed bucket sizes and integer representations of trailing zeros, we achieve a uniform, efficient, and comparable dataset format. This approach enhances scalability, enables fast lookups, and supports advanced operations like dataset comparison and merging. The flexibility of adjusting p allows for fine-tuning precision and efficiency based on specific use cases.

## Iteration 2

The implemented structure allows us to implement standard way of parallel processing.
Different buckets by definision mutually exclusive. It means that we can perform processing in parallel on all buckets simulataneously.
This can help us to optimize processing especially in the case of GPU

The mutually exclusive nature of buckets makes this structure inherently suitable for parallel processing. Since each bucket operates independently, we can distribute the workload across multiple CPU cores or even leverage GPU parallelism for significant performance gains. This is particularly beneficial for large-scale datasets or computationally intensive tasks like token restoration, dataset comparison, or frequency analysis.

### Key Advantages of Parallel Processing:

1. Mutually Exclusive Buckets:

    - Each bucket is independent, meaning no synchronization is required between buckets during processing.

    - This eliminates contention and allows for true parallelism.

2. Scalability:

    - The workload can be evenly distributed across multiple processing units (CPU cores or GPU threads).

    - Adding more processing units linearly improves performance.

3. GPU Optimization:

    - GPUs excel at parallel processing, especially for tasks involving bitwise operations or integer arithmetic.

    - The uniform structure of buckets and integer representations of trailing zeros aligns well with GPU architectures.

### Parallel Processing Strategies:

1. CPU Parallelism:
    - Use multi-threading or multi-processing to distribute buckets across CPU cores.

    - Libraries like Python's concurrent.futures or multiprocessing can be used.

2. GPU Parallelism:
    - Use GPU frameworks like CUDA (for NVIDIA GPUs) or OpenCL (for cross-platform GPU support).

    - Map each bucket to a GPU thread or block for parallel execution.

### Example: Parallel Processing on CPU

Here’s how you can implement parallel processing for bucket-level operations using Python's concurrent.futures:

In [None]:
from concurrent.futures import ThreadPoolExecutor

def process_bucket(b, bucket_data):
    # Example: Count the number of unique trailing zeros in the bucket
    unique_zeros = len(set(bucket_data["zeros"]))
    return b, unique_zeros

def parallel_process_dataset(dataset):
    results = {}
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_bucket, b, bucket_data) for b, bucket_data in dataset.items()]
        for future in futures:
            b, result = future.result()
            results[b] = result
    return results

# Example usage
dataset = {
    0: {"zeros": [1, 2, 2, 3], "frequencies": [1, 2, 1, 1], "documents": [0, 1, 2, 3]},
    1: {"zeros": [0, 1, 1], "frequencies": [1, 1, 1], "documents": [0, 1, 2]},
    # Add more buckets...
}

results = parallel_process_dataset(dataset)
print(results)

### Example: Parallel Processing on GPU (CUDA)
For GPU parallelism, we can use a framework like CUDA to process buckets in parallel. Below is a high-level outline of how this might look:

**CUDA Kernel for Bucket Processing:**

In [None]:
__global__ void process_bucket(int* zeros, int* frequencies, int* results, int bucket_size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < bucket_size) {
        // Example: Sum frequencies for each bucket
        atomicAdd(&results[blockIdx.x], frequencies[idx]);
    }
}

**Python Wrapper (using PyCUDA):**

In [None]:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

# Define CUDA kernel
mod = SourceModule("""
    __global__ void process_bucket(int* zeros, int* frequencies, int* results, int bucket_size) {
        int idx = threadIdx.x + blockIdx.x * blockDim.x;
        if (idx < bucket_size) {
            atomicAdd(&results[blockIdx.x], frequencies[idx]);
        }
    }
""")

def gpu_process_dataset(dataset):
    results = {}
    for b, bucket_data in dataset.items():
        zeros = np.array(bucket_data["zeros"], dtype=np.int32)
        frequencies = np.array(bucket_data["frequencies"], dtype=np.int32)
        bucket_size = len(zeros)
        
        # Allocate GPU memory
        zeros_gpu = cuda.mem_alloc(zeros.nbytes)
        frequencies_gpu = cuda.mem_alloc(frequencies.nbytes)
        results_gpu = cuda.mem_alloc(4)  # Single integer result
        
        # Copy data to GPU
        cuda.memcpy_htod(zeros_gpu, zeros)
        cuda.memcpy_htod(frequencies_gpu, frequencies)
        
        # Launch kernel
        process_bucket = mod.get_function("process_bucket")
        process_bucket(zeros_gpu, frequencies_gpu, results_gpu, np.int32(bucket_size), block=(256, 1, 1), grid=(1, 1))
        
        # Copy result back to CPU
        result = np.empty(1, dtype=np.int32)
        cuda.memcpy_dtoh(result, results_gpu)
        
        results[b] = result[0]
    
    return results

# Example usage
dataset = {
    0: {"zeros": [1, 2, 2, 3], "frequencies": [1, 2, 1, 1], "documents": [0, 1, 2, 3]},
    1: {"zeros": [0, 1, 1], "frequencies": [1, 1, 1], "documents": [0, 1, 2]},
    # Add more buckets...
}

results = gpu_process_dataset(dataset)
print(results)


### Benefits of GPU Parallelism:
1. Massive Parallelism:

    - GPUs have thousands of cores, enabling simultaneous processing of thousands of buckets.

2. Efficient Bitwise Operations:

    - GPUs are highly optimized for bitwise operations, making them ideal for tasks involving trailing zeros.

3. Scalability:

    - As the dataset grows, GPUs can handle the increased workload without significant performance degradation.

4. Real-Time Processing:

    - GPU acceleration enables real-time processing for large-scale datasets.

### Use Cases for Parallel Processing:

1. Token Restoration:

    - Restore tokens for multiple buckets in parallel.

2. Dataset Comparison:

    - Compare datasets by processing corresponding buckets in parallel.

3. Frequency Analysis:

    - Compute token frequencies across all buckets simultaneously.

4. Query Processing:

    - Perform queries (e.g., finding documents containing specific tokens) in parallel.

### Conclusion:

By leveraging the mutually exclusive nature of buckets, we can implement highly efficient parallel processing on both CPUs and GPUs. This approach significantly optimizes performance, especially for large-scale datasets or computationally intensive tasks. GPU parallelism, in particular, offers massive scalability and real-time processing capabilities, making it an excellent choice for this structure.