In [68]:
%autosave 0

Autosave disabled


## Searching and Hashing

**Objective**
* Explain and implement sequential search and binary search
* Understand the idea of hashing as a search technique
* Introduce the map abstract data type
* Implement map abstract data type using hashing



### Searching

**Searching** is the algorithmic process of finding a particular item in a collection of items. A search typically answers either *True* or *False* as to whether the item is present in a collection. 

In Python, there is a very easy way to ask whether an item is in a list of items. We use the **in** operator. 

In [47]:
15 in [2, 12, 15, 31, 5]


True

## The Sequential Search

* When data items are stored in a collection of list, we say that they have a linear or sequential relationship. 
* Each data item is stored in a position relative to the other. In Python lists, these **relative positions** are the **index values** of the individual items. 
* Since these *index values* are *ordered*, it is possible to visit them in sequence. This process gives rise to our first searching technique, the **sequential search**.

**How sequential search works**: Starting at the first item in the list, we simply move from item to item, following the underlying sequential ordering until we either find what we were looking for or run out of items. If we run out of items, we discovered that search item was not present in the list collection.



In [48]:
# Sequential Search Implementation
def sequentialSearch(alist, item):
    pos=0
    found=False
    
    while pos < len(alist) and not found:
        if alist[pos] == item:
            found = True
        else:
            pos = pos + 1
    
    return found

testlist = [2, 10, 43, 12, 44, 34, 94, 23, 59]
print(sequentialSearch(testlist, 34))
print(sequentialSearch(testlist, 80))

True
False


### Analyze Sequential Search

* For searching, the **basic unit of computation** is the number of comparisons performed. 
* Another assumption is that the list of items is **not ordered**. In other words, the probability that the item we are looking for is in any particular position is exactly the same for each position of the list.
* **If the item is present**: *Best Case*: **1**; *Worst Case*: **O(n)**; *Average Case*: **O(n/2)**
* **If the item NOT present**: *Best Case*: **O(n)**; *Worst Case*: **O(n)**; *Average Case*: **O(n)**


In [49]:
# Sequential Search for ordered list
def orderedSequentialSearch(alist, item):
    pos=0
    found=False
    stopSearch=False
    
    while pos < len(alist) and not found and not stopSearch:
        if alist[pos] == item:
            found = True
        else:
            if alist[pos] > item:
                stopSearch=True
            else:
                pos = pos + 1
    
    return found

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print(orderedSequentialSearch(testlist, 3))
print(orderedSequentialSearch(testlist, 13))

False
True


A sequential search is improved by ordering the list only in the case where we do not find the item.
* If the item NOT present: Best Case: 1; Worst Case: O(n); Average Case: O(n/2)

## The Binary Search

* It is possible to take greater advantage of the ordered list if we are clever with our comparisons. 

* Instead of searching the list in sequence, a **binary search** will start by examining the **middle item**. If that item is the one we are searching for, we are done. If it is **not the correct item**, we can use the ordered nature of the list to **eliminate half of the remaining items** by either choosing **lower half of the list** or **uppper half of the list** based on the value of the search item being less than or greater than the middle value respectively. Keep repeating this process until we find the item or the list is exhausted.
* This algorithm is a great example of a **divide and conquer** strategy. We divide the problem into smaller pieces, solve the smaller pieces, and then reassemble the whole problem to get result.


In [50]:
def binarySearch(alist, item):
    first=0
    last=len(alist)-1
    found=False
    
    while first <= last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:
            found = True
        else:
            if alist[midpoint] > item:
                last = midpoint-1
            else:
                first = midpoint+1
    
    return found



In [51]:
testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print(binarySearch(testlist, 3))
print(binarySearch(testlist, 13))

False
True


**Binary Search Algorithm**: First check the middle item in the ordered list. If the item we are searching is **less than** the middle item, we can simply perform a binary search on the left half of the original list. Likewise, if the item is **greater than** the middle item, we can perform binary search on the right half of the original list.Either way, this is a **recursive call** to the binary search function passing a smaller list.

In [52]:
# Recursive implementation of Binary Search
def recursiveBinarySearch(alist, item):
    if len(alist) == 0:
        return False
    else:
        midpoint = len(alist)//2
        if alist[midpoint] == item:
            return True
        else:
            if item < alist[midpoint]:
                return recursiveBinarySearch(alist[:midpoint], item)
            else:
                return recursiveBinarySearch(alist[midpoint+1:], item)

testlist = [3, 5, 6, 8, 11, 12, 14, 15, 17, 18]
print(recursiveBinarySearch(testlist, 8))
print(recursiveBinarySearch(testlist, 13))
    

True
False


### Analysis of Binary Search

* In Binary Search algorithm, each comparison eliminates about half of the remaining items from consideration. 
* What is the maximum number of comparisons this algorithm will require to check the entire list? 
    * Start with **n** items, after 1st comparison **n/2** items left
    * After 2nd comparison **n/4** items will be left. Subsequently, **n/8**, **n/16**.
    * After ith comparison, **n/2^i** items will be left.
* When we split the list enough times, we end up with a list that has just one item left. Either that is the item we are searching for or not.
* So after **i** comparison we have **n/2^i = 1**. Solving for **i** gives us **i = log(n)**.
* The maximum number of comparisons is **logarithmic log(n)** with respect to the number of items in the list. Therefore, the **Binary Search** is **O(log(n))**.

**Note**: The recursive call *recursiveBinarySearch(alist[:midpoint],item)*, uses the **slice operator** to create left half of the list. Here we assume that slice operation takes constant time, BUT in Python slice operators is **O(k)**

* Even though binary search is better than sequential search, for small *n* values, the additonal cost of sorting is probably not worth it.
* If we can sort once and search multiple times, then the cost of sorting is not significant.
* For **large lists**, sorting even once can be **very expensive**, such that using sequential search from start may be a better choice.

## Hashing

* **Hasing** is the concept of building a data structure that can be searched in **O(1)** time.
* If every item is where it should be, then the search can use a single comparison to discover the presence of an item. (which is typically not the case).
* A **hash table** is a collection of items which are stored in such a way as to make it easy to find them later.
* Each position of the hash table, often called a **slot**, can hold an item and is named by an integer value starting at 0. 
* We can implement a hash table by using a **list** with each element initialized to the special Python value **none**.
* The **mapping** between an **item** and the **slot** where that item belongs in the hash table is called the **hash function**.
* A **hash function** will take any item in the collection and return an integer in the range of slot names, between **0** and **m-1**.
* Example of a hash function can be **remainder method**, simply take an item from collection and divide it by the table size, returning the remainder as its **hash value** (h(item) = item%11).
* This *remainder method* (modulo arithmetic) will typically be present in some form in all hash functions, since the result must be in the range of slot names.
* **Load Factor** is commonly denoted by **lambda = numberofitems/tablesize**. 
* **Search an item**: simply use the hash function to compute the slot name for the item and then check the hash table to see if this item is present. This searching opeartion is constant time i.e. **O(1)**
* When two or more items compute to have the same slot, we have a **collision**. 

### Hash Function

* Given a collection of items, a hash function that maps each item into a **unique slot** is referred to as a **perfect hash function**. 
* Our goal is to create a hash funciton that *minimizes the number of collisions*, is *easy to compute* and *evenly distributes the items* in the hash table.
* The **folding method** for constructing hash functions begin by dividing the item into equal-size pieces(last piece may not be of equal size). These pieces are then added together to give the resulting hash value. 
    * Example: one of item is phone number 436-555-4601
    * Take the digits and dive them into groups of 2(43,65,55,46,01).
    * Afer adding them, 43+65+55+46+01=210. Assume hash table of size 11. 
    * Thus, 210%11=1, so the phone number hashs to slot 1. 
* Mid-square method: We first square the item, and then extract some portion of the resulting digits.
    * Example: if the item is *44*, we would first compute 44^2=1936. 
    * Extract the middle two digits, 93, and perform the remainder step. 93%11=5.
* Hash functions can also be created for charcter-based items such as **strings**.
    * Example: The word "cat" can be thought of as a sequence of *ordinal values*. 

In [53]:
print(ord('c'))
print(ord('a'))
print(ord('t'))

99
97
116


* Take these three ordinal values and add them up, and use the remainder method to get a hash value. 99+97+116=312, applying remainder method, 312%11=4. 
* Thus, string 'Cat' goes in the slot 4 of the hash table. 

In [54]:
def hash(astring, tablesize):
    sum=0
    for pos in range(len(astring)):
        sum=sum+(ord(astring[pos])*(pos+1)) #(pos+1) helps solving anagrams problem
    
    return sum%tablesize

print(hash("cat", 11))
print(hash("tac", 11))

3
2


### Collision Resolution

* When two items hash to same slot, we mush have a systematic method for placing the second item in the hash table. This process is called **collision resolution**.
* One method of collision resulution is to start at the original has value position and then move in a sequential manner through the slots until we encounter the first slot that is empty. This collision resolution process is referred to as **open addressing**. 
* By systematically visiting each slot one at a time, we are performing an open addressing technique called **linear probing**.
* A disadvantage to linear probing is the tendency for **clustering**, items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. 
* **Rehasing**, is the process of looking for another slot after a collision has occured. With simple **linear probming** the rehash function is **newhashvalue = rehash(oldhashvalue)** where **rehash(pos)=(pos+1)%sizeoftable**. **Plus 3** rehash can be defined as **rehash(pos) = (pos+3)%sizeoftable**. 
* In general **rehash(pos) = (pos + skp)%sizeoftable**. It is important to note that **size of the skip** must be such that all the slots in the table will eventually be visited. To ensure this could happen, **table size be a prime number**.
* **Quadratic Probming**: uses a skip consisting of successive **perfect squares**. Example, if the hash value is **h**, the successive hash values are **h+1, h+4, h+9, h+16 and so on**. 
* **Chaining**: Allows many items to exist at the same location in the hash table. This is by allowing each slot to hold a reference to a collection(or chain) of items.
    * When we want to search for an item, we use the hash function to generate the slot where it should reside. Since each slot holds a collection, we use a searching technique to decide whether the item is present. 
    * The advantage is that on average there are likely to be many fewer items in each slot, so the search is perhanps more efficient. 



## Implementing the Map Abstract Data Type

* **Dictionary**: One of the most useful collections in Python. Dictionary is an associative data type, where you can store **key-data** pair.
* **key** is used to look up the **associated data value**. This idea is often referred to as **map**.
* **map abstract data type** is defined as an unordered collection of associations between a key and a data value. **keys** in a map are all **unique** so that there is a one-to-one relationship between key and value.
* **Map operations**:
    * **map()** : Create a new, empty map. It returns an empty map collection
    * **put(key,val)** : Add a new key-value pair to the map. If key already exists then replace the old value with new value.
    * **get(key)** : Given a key, return the value stored in the map or *none* otherwise.
    * **del** : Delete the key-value pairs stored in the map
    * **len()** : Return the number of key-value pairs stored in the map.
    * **in** Return *true* for a statement of the form *key in map*, if the given key is in the map, *false* otherwise.

In [56]:
class HashTable:
    def __init__(self):
        self.size = 11
        self.slots = [None] * self.size
        self.data = [None] * self.size

**hashfunction** implements the simple remainder method.
**collision detection** technique is linear probing with a +1 rehash function.

In [58]:
def put(self, key, data):
    hashvalue = self.hashfunction(key,len(self.slots))
    
    if self.slots[hashvalue] == None:
        self.slots[hashvalue] = key
        self.data[hashvalue] = data
    else:
        if self.slots[hashvalue] == key:
            self.data[hashvalue] = data #replace value
        else:
            nextslot = self.rehash(hashvalue,len(self.slots))
            while self.slots[nextslot] !=None and self.slots[nextslot] != key:
                nextslot = self.rehash(nextslot,len(self.slots))
            
            if self.slots[nextslot] == None:
                self.slots[nextslot] = key
                self.data[nextslot] = data
            else:
                self.data[nextslot] = data #replace value

def hashfunction(self,key,size):
    return key%size

def rehash(self,oldhash,size):
    return (oldhash+1)%size

* **get** function begins by computing the initial hash value. If the value is not in the intial slot, **rehash** is used to locate the next possible slot position. 

In [60]:
def get(self,key):
    startslot = self.hashfunction(key,len(self.slots))
    
    data = None
    stop = False
    found = False
    position = startslot
    while self.slot[position] != None and not found and not stop:
        if self.slots[position] == key:
            found = True
            data = self.data[position]
        else:
            position=self.rehash(position,len(self.slots))
            if position == startslot:
                stop = True
    return data

def __getitem__(self,key):
    return self.get(key)

def __setitem__(self,key,data):
    self.put(key,data)

**Full hash table implementation with example**

In [67]:
class HashTable:
    def __init__(self):
        self.size = 11
        self.slots = [None] * self.size
        self.data = [None] * self.size

    def put(self,key,data):
      hashvalue = self.hashfunction(key,len(self.slots))

      if self.slots[hashvalue] == None:
        self.slots[hashvalue] = key
        self.data[hashvalue] = data
      else:
        if self.slots[hashvalue] == key:
          self.data[hashvalue] = data  #replace
        else:
          nextslot = self.rehash(hashvalue,len(self.slots))
          while self.slots[nextslot] != None and \
                          self.slots[nextslot] != key:
            nextslot = self.rehash(nextslot,len(self.slots))

          if self.slots[nextslot] == None:
            self.slots[nextslot]=key
            self.data[nextslot]=data
          else:
            self.data[nextslot] = data #replace

    def hashfunction(self,key,size):
         return key%size

    def rehash(self,oldhash,size):
        return (oldhash+1)%size

    def get(self,key):
      startslot = self.hashfunction(key,len(self.slots))

      data = None
      stop = False
      found = False
      position = startslot
      while self.slots[position] != None and  \
                           not found and not stop:
         if self.slots[position] == key:
           found = True
           data = self.data[position]
         else:
           position=self.rehash(position,len(self.slots))
           if position == startslot:
               stop = True
      return data

    def __getitem__(self,key):
        return self.get(key)

    def __setitem__(self,key,data):
        self.put(key,data)

H=HashTable()
H[54]="cat"
H[26]="dog"
H[93]="lion"
H[17]="tiger"
H[77]="bird"
H[31]="cow"
H[44]="goat"
H[55]="pig"
H[20]="chicken"

print(H.slots)
print(H.data)

[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]
['bird', 'goat', 'pig', 'chicken', 'dog', 'lion', 'tiger', None, None, 'cow', 'cat']


In [69]:
print(H[20])

chicken


In [70]:
print(H[99])

None


### Analysis of Hashing

* In best case *(with no collision)* hashing would provide **O(1)** constant time search technique.
* Load Factor is commonly denoted by **lambda = numberofitems/tablesize**.
    * If **lambda is small**, then there is lower chances of collisions, means that items are more likely to be in the slots where they belong.
    * If **lambda is large**, means that table is filling up and there will be more collisions.
    * With **Chaining** more collisions means, more number of items on each chain.
    * **For Successful Search**: with open addressing and linear probing, the average number of comparisons is approx. 1/2(1 + 1/(1-lambda)).
    * **For unsuccessful search** 1/2(1 + (1/(1-lambda))^2).
    * **Successful search w/ Chaining** : 1 + lambda/2
    * **Unsuccessful search w/ Chaining**: lambda