# Introduction to Computation and Python Programming

## Lecture 8

### Today
----------

- Sorting and Searching

### Search Algorithms

- A **search algorithm** is a method for finding an item or group of items with specific properties within a collection of items
- The collection is referrred to as the **search space**
- The collecction could be implicit:
    - example - find square root as a search problem
        - exhaustive search
        - bisection search
        - Newton-Raphson
- The collection could be explicit:
    - example - is a student record in a stored collection of data

### Let us look at two algorithms

- Each meets the specification
```python
def search(L, e):
    """Assumes L is a list.
       Returns True if e is in L and False otherwise"""
```

- **linear search**
    - brute force search (aka British Museum algorithm)
    - list does not need to be sorted
- **bisection search**
    - list MUST be sorted to give correct answer
    - will look at two different implementations of the algorithm

### Linear Search on Unsorted List

```python
def search(L, e):
    for i in range (len(L)):
        if L[i] == e:
            return True
    return False
```

- worst case must look through all elements to decide
- $O(len(L))$ for the loop and $O(1)$ to test if `L[i] == e`
- overall complexity is $O(n)$ where $n$ is $len(L)$

### Linear Search on Sorted List

```python
def search(L, e):
    for i in range(len(L)):
        if L[i] == e:
            return True
        if L[i] > e:
            return False
    return False
```

- must look until reaching a number greater than `e`
- overall complexity is still $O(n)$
- average running time is, however, better

### Using Bisection Search

- The algorithm:
1. Pick an index, `i`, that divides the list `L` roughly in half
2. Ask if `L[i] == e`
3. If not, ask whether `L[i]` is larger or smaller than `e`
4. Depending upon the answer, search either the left or right half of `L` for `e`
<br><br>
- A new version of a divide-and-conquer algorithm:
    - Break into smaller version of problem (smaller list), plus some simple operations
    - Answer to smaller version is answer to original problem

### Bisection Search implementation

```python
def search(L, e):
    """Assumes L is a list, the elements of which are in
          ascending order.
       Returns True if e is in L and False otherwise"""
    
    def bSearch(L, e, low, high):
        #Decrements high - low
        if high == low:
            return L[low] == e
        mid = (low + high)//2
        if L[mid] == e:
            return True
        elif L[mid] > e:
            if low == mid: #nothing left to search
                return False
            else:
                return bSearch(L, e, low, mid - 1)
        else:
            return bSearch(L, e, mid + 1, high)
        
    if len(L) == 0:
        return False
    else:
        return bSearch(L, e, 0, len(L) - 1)
```

- Functions like `search` above are called **wrapper functions**
    - provide a nice interface for the client but is essentially pass-through that does no serious computation
    - helper function `bSearch` does the work
    - why not call `bSearch` directly?
        - because parameters `low` and `high` have nothing to do with the abstraction of searching a list for an element
        - implementation details should be hidden

---

##### Complexity of Bisection Search

- $O(log\ n)$ bisection search calls
- reduce size of problem by a factor of 2 on each step
- pass list and indices as parameters
- list never copied, just re-passed as pointer
- constant work inside function
- $=> O(log\ n)$
        

### Searching a Sorted List

- using **linear search**, search for an element is $O(n)$
- using **binary search**, can search for an element in $O(log\ n)$
    - assumes the **list is sorted**
- when does it make sense to **sort first then search**?
    - $SORT + O(log\ n) < O(n) => SORT < O(n) - O(log\ n)$
    - So, Sorting needs to be less than $O(n)$
- **NEVER TRUE!**
    - to sort a collection of $n$ elements, must look at each at least once

---

#### Amortized Cost

- why bother sorting first?
- in some cases, may **sort a list once**, then do **many searches**
- **AMORTIZE cost** of the sort over many searches
- $SORT + K*O(log\ n) < K*O(n)$
    - for large $K$, **SORT time becomes irrelevant**, if cost of sorting is small enough

### Sorting 

- Standard Python implementation of Sorting runs in roughly $O(n*log(n))$ time, where $n$ is the length of the list
- In practice you will rarely need to implement your own `sort` function
- Python's built-in `sort` method: `L.sort()` sorts the list `L`
- built-in function `sorted(L)` returns a list with the same elements as L, sorted but **does not mutate** `L`

### Selection Sort

```python
def selSort(L):
    """Assumes that L is a list of elements that can be
         compared using >.
       Sorts L in ascending order"""
    suffixStart = 0
    while suffixStart != len(L):
        #look at each element in suffix
        for i in range(suffixStart, len(L)):
            if L[i] < L[suffixStart]:
                #swap position of elements
                L[suffixStart], L[i] = L[i], L[suffixStart]
        suffixStart += 1
```
- works by maintaining the **loop invariant**:
    - given a partitioning of the list into a **prefix** `L[0:i]` and a **suffix** `L[i+1:len(L)]`:
        - prefix is sorted
        - no element in the prefix is larger tha the smallest element in the suffix
- Each step of the algorithm - moves one element from the suffix to the prefix
    - append the minimum element from the suffix to the end of the prefix
    - By induction we know that if **loop invariant** was true before, it is true after the move
- Complexity:
    - Inner Loop: $O(len(L))$
    - Outer Loop: $O(len(L)$
    - Complexity of entire function is $O(len(L)^2)$

### Merge Sort

- Can do much better than quadratic time using a **divide-and-conquer algorithm**
- In general, for a divide-and-conquer algorithm:
    - A threshold input size, below which the problem is not subdivided, (this is called the **recursive base**)
    - The size and number of sub-instances into which an instance is split, and (in most examples we have seen, the ratio was **2**)
    - The algorithm used to combine sub-solutions.

---

#### Merge Sort invented in 1945 by John von Neumann
1. If the list of of length 0 or 1, it is already sorted
2. If the list has more than one element, split the list into two lists, and use merge sort to sort each of them.
3. Merge the results

- Neumann's observation is that two sorted lists can be efficiently merged into a single sorted list
- example mergeing `[1,5,12,18,19,20]` and `[2,3,4,17]`
<br>


|Remaining in list 1|Remaining in list 2|Result|
|:----|:----|:----|
|`[1,5,12,18,19,20]`|`[2,3,4,17]`|`[]`|
|`[5,12,18,19,20]`|`[2,3,4,17]`|`[1]`|
|`[5,12,18,19,20]`|`[3,4,17]`|`[1,2]`|
|`[5,12,18,19,20]`|`[4,17]`|`[1,2,3]`|
|`[5,12,18,19,20]`|`[17]`|`[1,2,3,4]`|
|`[12,18,19,20]`|`[17]`|`[1,2,3,4,5]`|
|`[18,19,20]`|`[17]`|`[1,2,3,4,5,12]`|
|`[18,19,20]`|`[]`|`[1,2,3,4,5,12,17]`|
|`[]`|`[]`|`[1,2,3,4,5,12,17,18,19,20]`|



### Complexity Analysis of Mergesort

#### Complexity of the merge process
- Two constant time operations: **comparing** the values of elements and **copying** elements from one list to another
- Number of comparisons is $O(len(L)$ where `L` is the longer of the two lists
- Number of copy operations is $O(len(L1) + len(L2))$ - each element gets copied exactly once
- Therefore, merging two sorted lists is linear ($O(n)$) in the length of the lists
<br><br>
- See Code
<br><br>
### Complexity of mergeSort
- Complexity of `merge` is $O(len(L))$
- At each level of recursion the total number of elements merged is `len(L)`
- Therefore, time complexity of `mergeSort` is $O(len(L))$ multiplied by the number of levels of recursion
- Since we divide in half each time, number of levels of recursion is $O(log(len(L)))$
- Therefore, time complexity of `mergeSort` is $O(n*log(n))$, where $n$ is $len(L)$
- Comparison with selection sort - if L has 10,000 elements:
    - $len(L)^2$ is 100 million
    - $len(L)*log_{2}(len(L))$ is 130,000
- There is a price:
    - Selection sort is an **in-place** sorting algorithm. Swapping requires constant amount of extra storage (one element)
    - In comparison, Merge Sort involves making copies of the list. Space complexity is $O(len(L))$

### Sorting in Python

- see code

### Hash Tables

- Merge sort + binary search seems like a good combination for amortized searches on a list
- if we search the list $k$ times, overall complexity is $O(n*log(n)\ + \ k*log(n))$
- Can we do better if we are willing to do some pre-processing?
<br><br>
#### Hashing
- Dictionaries in Python use a technique called hashing - converting the key to an integer and then using that integer to index into a list
- Values of any type can be converted to an integer e.g. internal representational sequence of bits ```'abc'``` is ```011000010110001001100011``` or the decimal integer ```6,382,179```.
- Think of how big the list needs to be
- **Hash function**: maps a large space of inputs (e.g. all natural numbers) to a smaller space of outputs (e.g. the natural numbers between 0 and 5000). Hash functions can be used to convert a large space of keys to a smaller space of integer indices
- Hash function is a **many-to-one mapping**. When two inputs are mapped to the same output it is called a **collision**
- Good hash function produces **uniform distribution** - every output is equally probable minimizing chance of collision