### Import Dependencies

In [2]:
import pandas as pd
import numpy as np
import random
import string
import time

### Create specified arrays

Arrays should be of lengths 5000, 10000, 15000, 20000, and 25000. Arrays should be of following types:

- Uniformly distributed random floats
- Uniformly distributed random integers
- Random strings of length five
- Random strings of length fifteen

In [3]:
#Set empty arrays
int_arrays = []
float_arrays = []
len5_arrays = []
len15_arrays = []
array_lengths = [5000, 10000, 15000, 20000, 25000]

#Create arrays of four different styles - using one million for numeric dtypes as high because it is a required argument
# Arrays are for integers, floats, 5-length strings, and 15-length strings
for i in array_lengths:
    np.random.seed(14)
    integers = list(np.random.randint(low=0, high=1000000, size=i))
    int_arrays.append(integers)
    
    floats = list(np.random.uniform(low=0, high=1000000, size=i))
    float_arrays.append(floats)
    
    length_fives = [''.join(random.choices(string.ascii_letters, k=5)) for j in range(i)]
    len5_arrays.append(length_fives)
    
    length_fifteens = [''.join(random.choices(string.ascii_letters, k=5)) for k in range(i)]
    len15_arrays.append(length_fifteens)

### Implement and time Selection Sort algorithm

Selection sort algorithm as defined below was taken from the book _Grokking Algorithms_ by Aditya Bhargava.

In [4]:
# Finds the smallest value in an array
def findSmallest(arr):
  # Stores the smallest value
  smallest = arr[0]
  # Stores the index of the smallest value
  smallest_index = 0
  for i in range(1, len(arr)):
    if arr[i] < smallest:
      smallest_index = i
      smallest = arr[i]      
  return smallest_index

# Sort array
def selectionSort(arr):
  newArr = []
  for i in range(len(arr)):
      # Finds the smallest element in the array and adds it to the new array
      smallest = findSmallest(arr)
      newArr.append(arr.pop(smallest))
  return newArr

In [5]:
# Initialize empty lists to store execution times
int_execution_times = []
float_execution_times = []
l5_execution_times = []
l15_execution_times = []

# Execute selection sort on integer arrays
for i in range(len(int_arrays)):
    t1 = time.perf_counter()
    selectionSort(int_arrays[i])
    t2 = time.perf_counter()
    execution_time = t2-t1
    int_execution_times.append(execution_time)

# Execute selection sort on float arrays
for i in range(len(float_arrays)):
    t1 = time.perf_counter()
    selectionSort(float_arrays[i])
    t2 = time.perf_counter()
    execution_time = t2-t1
    float_execution_times.append(execution_time)

# Execute selection sort on arrays of random strings, length 5
for i in range(len(len5_arrays)):
    t1 = time.perf_counter()
    selectionSort(len5_arrays[i])
    t2 = time.perf_counter()
    execution_time = t2-t1
    l5_execution_times.append(execution_time)
    
# Execute selection sort on arrays of random strings, length 15
for i in range(len(len15_arrays)):
    t1 = time.perf_counter()
    selectionSort(len15_arrays[i])
    t2 = time.perf_counter()
    execution_time = t2-t1
    l15_execution_times.append(execution_time)

### Organize into pandas dataframe, examine differences in Selection Sort execution times.

In [11]:
execution_df = pd.DataFrame({
    'array_size'     : array_lengths
    ,'l5_execution'    : l5_execution_times
    ,'l15_execution'  : l15_execution_times
    ,'float_execution': float_execution_times
    ,'int_execution'  : int_execution_times

})
execution_df

Unnamed: 0,array_size,l5_execution,l15_execution,float_execution,int_execution
0,5000,0.924436,0.942175,1.00242,0.92129
1,10000,4.47068,3.894142,3.974851,5.167803
2,15000,9.008994,9.252333,9.990125,8.70569
3,20000,16.289444,15.774717,16.642081,16.072135
4,25000,24.689617,23.079684,26.826612,25.422895


In [25]:
import sys
# Integer size
print(f'Size of integer: {sys.getsizeof(999999)}')
# Float size
print(f'Size of float: {sys.getsizeof(999999.999)}')
# Single character size
print(f'Size of single character: {sys.getsizeof("q")}')
# Five character size
print(f'Size of five character string: {sys.getsizeof("shfow")}')
# Fifteen character size
print(f'Size of fifteen character string: {sys.getsizeof("fhwovbeiyvaldug")}')

Size of integer: 28
Size of float: 24
Size of single character: 54
Size of five character string: 54
Size of fifteen character string: 64


The strings of length five should theoretically take less time than those of length 15, but we see in the longer arrays that this isn't necessarily the case. In the arrays length greater than 10,000, we see that the execution of five-character arrays take more time than those of length 15.

We know that ASCII string characters take up more space than numeric data types: 54 bytes for an ASCII character or short string, 64 bytes for a fifteen-character string, 28 bytes for an integer (as constrained in the above code, under one million), and 24 bytes for a float as constrained above.

Still, the system takes longer to process the numeric data types than the strings, when we might expect to see the opposite. This might be due to some type of system overhead, though I grouped all of them into the same for-loop in order to eliminate as much of this variation as possible.

The Selection Sort algorithm has $O(n^{2})$ complexity, which is not variable based on data type, and that's where the bulk of the processing time will come from. Note that though there is some variation, all four selection sort implementations were in the same neighborhood. The $O(n^{2})$ complexity comes from having to iterate over the entire list once, selecting the minimum value, and then adding it to the `newArr` list, using the `.pop()` method to remove it from the initial array.

What we expect to see (and do) is that regardless of data type, the algorithm takes progressively longer to sort, and that runtime grows in an exponential fashion. The exponential rise in algorithm runtime despite constant increases in array length reflects the exponential algorithm complexity $O(n^{2})$. 