<a href="https://colab.research.google.com/github/fbeilstein/algorithms/blob/master/hashmaps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* **hash table** maps keys to values (dictionary in Python, std::unordered_map in C++)
* keys should be hashable
* **hash function** maps int argument to fixed range of integers

In [3]:
# Find first duplicate
# Suppose that there can only be numbers 1 to 10

input = [1,2,5,3,5,3,3,4,6,3,3,2,2,4,6,6]

arr_contains = [False] * 10
for i in input:
  if arr_contains[i - 1]:
    print(i)
    break
  arr_contains[i-1] = True

5


In [5]:
# Find first duplicate
# No information on range !

input = [1,2,5,3,5,3,3,4,6,3,3,2,2,4,6,6]

arr_contains = {} # hash map to the resque
for i in input:
  if i in arr_contains:
    print(i)
    break
  arr_contains[i] = True

5


In [6]:
# Simplest hashing

def hash(x):
  return x % 10

hash(27), hash(33), hash(147)

(7, 3, 7)

$$
h(x) \neq h(y) \rightarrow x \neq y
$$
contrapositive:
$$
x = y \rightarrow h(x) = h(y)
$$
but equality of hashes does not imply anything

In [8]:
# Different objects can be hashable

class Student:
  def __init__(self, name, age, mark):
    self.name = name
    self.age = age
    self.mark = mark

  # Bad hash, but still hash
  def hash(self):
    return (len(self.name) + 3*self.age + self.mark) % 10

x = Student('Ivanov Ivan', 20, 4)
x.hash()

5

* do not confuse hash for hash maps with cryptographic hash -- they have different purposes thus different requirements
* you may be concerned how quick your hash works

In [10]:
# Hash function should be deterministic !
# Example below not allowed

counter = 0
def hash(x):
  global counter
  counter += 1
  return (x + counter)

hash(3), hash(3), hash(3)

(4, 5, 6)

hash function:
* deterministic
* uniform
* keys are hashable = keys are immutable

In [11]:
# unhashable key ! ERROR
{[1,2]: 3}

TypeError: ignored

What about **collisions** (i.e. $h(x) = h(y)$ but $x \neq y$)?

There are many techniques, but 2 most popular:

* separate chaining (maintains data structure to hold all values in one bucket, e.g. list, binary tree, self-balancing tree, etc)
* open adressing (finds another place in hash table by offsetting)

Given a good hash-function and table size not much smaller than the number of elements:

operation|average|worst
---|---|---
insertion|$O(1)$|$$O(n)$$
removal|$O(1)$|$O(n)$
search|$O(1)$|$O(n)$