# **Project 1: Integration of Insertion Sort and Merge Sort**

## Team 5

| **Group Member** | **Matriculation Number** |
|---------|-------|
| Jyoshika Barathimogan | |
| Kong Fook Wah | U2421655E |
| Kris Khor Hai Xiang | |

## **Purpose of Project 1**

This project aims to find out the optimal value of threshold S for the maximum efficiency of hybrid sort between Insertion Sort and Merge Sort. By using Insertion Sort to sort array of size less than or equals to S, the merge sort should have visble improvement in terms of the number of comparisons that need to be made to complete the sorting process.

## **Sequence of build**

- (a) Implementation of Hybrid Algorithm

- (b) Generate arrays: n ∈{10^3 , ... 10 ^ 7}

- (c) Analysis: (i) fix S, vary n, (ii) fix n, vary S

- (d) Compare with pure mergesort at n = 10^7 using optimal S found in (c)


Library involved in the experiment

| **Module** | **Purpose** |
|--------|----------|
| `random` | Generate arrays with random numbers |
| `matplotlib` | Visualise the results of the experiment |
| `tqdm` | Show the progress of the algorithm when running |
| `math` | To plot *nlogn* graph for comparison |

## **Reproducibility & Constant Variable**

To ensure that the environment of the experiment stays constant, we will implement random module with a constant seed, that is 42.

In [None]:
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
import math
import time

random.seed(42) #to make sure that the way the number is  generated is consistent 

## **(a) Algorithm Implementation**

**Counter class**

In [None]:
class Counter:
    comparison = 0

**Insertion Sort (in place)**

In [None]:
def insertionSort(array, low, high, counter: Counter):

    for i in range(low + 1, high):
        key = array[i]
        j = i - 1

        while j >= low:
            counter.comparison += 1

            if array[j] > key:
                array[j + 1] = array[j]
                j -= 1
            else:
                break

        array[j + 1] = key

**Merge Sort *(in place)***

In [None]:
#aux: auxiliary storage
def merge(array, low, mid, high, aux, counter: Counter):
    
    i, j, k = low, mid, 0
    
    while i < mid and j < high:
        counter.comparison += 1
        
        if array[i] <= array[j]:
            aux[k] = array[i]; i += 1
        else:
            aux[k] = array[j]; j += 1
        k += 1
        
    while i < mid:  
        aux[k] = array[i]
        i += 1
        k += 1
        
    while j < high:   
        aux[k] = array[j]
        j += 1
        k += 1
        
    array[low:high] = aux[:k]


def mergeSort(array, low, high, aux, counter: Counter):
    
    n = high - low
    
    if n <= 1:
        return

    mid = (low + high) // 2

    mergeSort(array, low, mid, aux, counter)
    mergeSort(array, mid + 1, high, aux, counter)
    merge(array, low, mid, high, aux, counter)


**Hybrid Sort *(in place)***

In [None]:
def hybridSort(array, low, high, S, aux, counter: Counter):
    
    n = high - low
    
    if n <= 1:
        return

    if n <= S:
        insertionSort(array, low, high, counter)
        return
    
    mid = (low + high) // 2

    mergeSort(array, low, mid, aux, counter)
    mergeSort(array, mid + 1, high, aux, counter)
    merge(array, low, mid, high, aux, counter)

## Testing Hybrid Sort algorithm

In [None]:
array = [random.randint(-50, 50) for _ in range(50)]

S = 5

aux = [None] * 50
counter = Counter()

hybridSort(array, 0, 50, S, aux, counter)
print("Array after sorting: ", array)
print("Key Comparison of the array: ", counter.comparison)

It works. Good. Now let's test the efficiency.

## **(b) Generating input data**

To avoid the heavy runtime of algorithm due to linear stepping, we sample input sizes on a logarithmic grid with a ratio of around 2 to 2.5 times between consecutive n.

Below code utilises `random` module to generate random sets of integers in varying sizes of array.

In [None]:
#Such spacing balance accuracy and runtime while preserving the nlogn trend with far fewer experiment
sizes = [1_000, 2_000, 5_000, 
        10_000, 20_000, 50_000,
        100_000, 200_000, 500_000,
        1_000_000, 2_000_000, 5_000_000,
        10_000_000]

#here we allow the integers in the range to be from 0 to 10_000

for size in sizes:
    arr = [random.randint(0, 10**4) for _ in range(size)]
    print(arr)
    

## **(c) Analysis of Time Complexity**

To keep the experiement as fair as possible, we use (1) Median and (2) Mean as a comparator to compare the different conditions

We report **median runtime** over mulitple trials to mitigate outliers that could potentially be caused by our computer limitations such as background apps and chaching issues

For algorithm work, we report **mean key comparisons** as an estimate of the expected comparisons over random inputs.

In [None]:
# Create a tqdm progress bar
list_ = []
with tqdm(total=10, desc="Processing") as pbar:
    for i in range(100):
        # Simulate some work
        time.sleep(0.1)
        pbar.update(1)

        # Access elapsed and remaining time
        elapsed_time = pbar.format_dict["elapsed"]
        list_.append(elapsed_time)
        # You can print or use these values as needed
        print(f"Elapsed: {elapsed_time:.2f}s")

print(list_)

### (i) Fix S, Vary N

In [None]:
# S value is fixed to be 32
S = 32
kc_vals_ci = []
execution_time_ci = []

for size in tqdm(sizes, desc = "Question part ci"):

    #generate arrays
    arr = [random.randint(0, 10**4) for _ in range(size)]
    
    #collect comparisons list
    kc_vals = 0

    for _ in range(3):

        counter = Counter()
        aux = [None] * size

        start_time = time.perf_counter()
        hybridSort(arr, 0, size, S, aux, counter)

        end_time = time.perf_counter()

        execution_time = end_time - start_time
        execution_time_ci.append(execution_time)

        kc_vals += counter.comparison

    kc_average = kc_vals / 3
    kc_vals_ci.append(kc_average)

print("Median of Program Execution Time is ", sorted(execution_time_ci)[1])

#Plot graph
plt.figure(figsize=(10,5))

plt.plot(sizes, kc_vals_ci, marker='o', label='Key Comparisons')
plt.plot(sizes, [n * math.log2(n) for n in sizes] , linestyle='--', label='~ n log2 n reference')  # mergesort is said to have a time complexity of O(n log n)

plt.xlabel('Array size (n)')
plt.ylabel('Operation count')
plt.title('Hybrid Sort Complexity')
plt.legend()
plt.grid(True)
plt.show()

### (ii) Fix N, Vary S

### (iii) Determining the optimal value of S

## **(d) Comparison with original MergeSort**

## **Conclusion**