
# Hashing & Symbol Tables 




## 1) Hashing Basics
A hash function turns a key (often a string) into an integer used as an index.
In Python, ord() returns the Unicode code point for a character.



### 1.1 A Naive Hash: sum of character codes
Try a toy function: sum the ord() of each character. We'll see it's not unique.


In [3]:

def naive_hash_sum(s: str) -> int:
    return sum(map(ord, s))

samples = ["hello world", "world hello", "gello xorld"]
for t in samples:
    print(f"{t!r} -> {naive_hash_sum(t)}")


'hello world' -> 1116
'world hello' -> 1116
'gello xorld' -> 1116



Observation: Different strings can yield the same sum ⇒ collision. Try more below.



### 1.2 Improving the Toy Hash: position-weighted sum
Multiply each character's code by an increasing multiplier to reduce trivial collisions.
(This is still a toy, not a production hash.)


In [4]:

def myhash(s: str) -> int:
    mult = 1
    hv = 0
    for ch in s:
        hv += mult * ord(ch)
        mult += 1
    return hv

for t in samples:
    print(f"{t!r} -> {myhash(t)}")


'hello world' -> 6736
'world hello' -> 6616
'gello xorld' -> 6742



Still not perfect! You can still find collisions. Try the prompts below.



## 2) From Hash Values to Table Indices
To map a (possibly huge) hash value to a table index, we typically use modulo:
index = hash_value % table_size
This ensures the index is always in [0, table_size - 1].


In [5]:

def index_in_table(hv: int, size: int) -> int:
    return hv % size

print(index_in_table(6736, 256))
print(index_in_table(6616, 256))


80
216



## 3) Implementing a Simple Hash Table (Open Addressing, Linear Probing)
We'll build a minimal hash table storing key–value pairs.


In [6]:

class HashItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value

class HashTable:
    def __init__(self, size: int = 256):
        self.size = size
        self.slots = [None for _ in range(self.size)]  # stores HashItem or None
        self.count = 0  # number of items actually stored

    def _hash(self, key: str) -> int:
        mult = 1
        hv = 0
        for ch in key:
            hv += mult * ord(ch)
            mult += 1
        return hv % self.size

    def put(self, key, value):
        """Insert or update a key with open addressing (linear probing)."""
        item = HashItem(key, value)
        h = self._hash(key)
        while self.slots[h] is not None:
            if self.slots[h].key == key:
                self.slots[h].value = value
                return
            h = (h + 1) % self.size
        self.slots[h] = item
        self.count += 1

    def get(self, key):
        """Retrieve a value by key using linear probing; return None if not found."""
        h = self._hash(key)
        start = h
        while self.slots[h] is not None:
            if self.slots[h].key == key:
                return self.slots[h].value
            h = (h + 1) % self.size
            if h == start:
                break
        return None

    def __setitem__(self, key, value):
        self.put(key, value)

    def __getitem__(self, key):
        result = self.get(key)
        if result is None:
            raise KeyError(key)
        return result


### 3.1 Quick Demo

In [7]:

ht = HashTable(size=16)  # small on purpose to force collisions sooner
ht['apple'] = 42
ht['banana'] = 17
ht['grape'] = 73
ht['apple'] = 99  # update

print("apple ->", ht.get('apple'))
print("banana ->", ht.get('banana'))
print("missing ->", ht.get('missing'))

# peek at occupied slots (for learning/demo)
occupied = [(i, (slot.key, slot.value)) for i, slot in enumerate(ht.slots) if slot is not None]
occupied


apple -> 99
banana -> 17
missing -> None


[(7, ('grape', 73)), (10, ('apple', 99)), (14, ('banana', 17))]


## 4) Activities — Practice (No Solutions Included)

3. Explain how the modulo operation maps any integer to a valid index for a given table size.



### B. Strengthening the Hash Function
Implement a polynomial-rolling hash variant:
h = 0; p = 131 (experiment with others)
for each ch in s: h = h * p + ord(ch)
Then map to indices with % size. Compare collision behavior vs myhash on random strings.


In [None]:

# TODO: implement poly_hash and test collision rate vs myhash
# import random, string
# def poly_hash(s, base=131):
#     ...

# def random_string(n):
#     return ''.join(random.choice(string.ascii_letters + ' ') for _ in range(n))

# Compare on a sample:
# ...



## 5) Alternative Strategy: Separate Chaining
Instead of probing for another open slot, store a bucket at each index (often a list). On collisions, append to the bucket.
Pros: simple growth model; avoids primary clustering. Cons: buckets can grow long without further structure.


In [8]:

class ChainedHashTable:
    def __init__(self, size: int = 256):
        self.size = size
        self.buckets = [[] for _ in range(self.size)]  # each bucket is a list of (key,value)
        self.count = 0

    def _hash(self, key: str) -> int:
        mult = 1
        hv = 0
        for ch in key:
            hv += mult * ord(ch)
            mult += 1
        return hv % self.size

    def put(self, key, value):
        h = self._hash(key)
        bucket = self.buckets[h]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))
        self.count += 1

    def get(self, key):
        h = self._hash(key)
        bucket = self.buckets[h]
        for k, v in bucket:
            if k == key:
                return v
        return None

    def __setitem__(self, key, value):
        self.put(key, value)

    def __getitem__(self, key):
        result = self.get(key)
        if result is None:
            raise KeyError(key)
        return result

# Quick demo
cht = ChainedHashTable(size=8)
cht['alpha'] = 1
cht['beta'] = 2
cht['alpha'] = 3
print("alpha ->", cht.get('alpha'))
print("beta ->", cht['beta'])
# Inspect buckets
cht.buckets


alpha -> 3
beta -> 2


[[], [], [], [], [('beta', 2)], [], [('alpha', 3)], []]


### E. Reflection: Chaining Design Choices
3. What are the pros/cons of chaining vs open addressing for your use case?
