Chapter 15 Mappings and Hash Tables<br>

A mapping is an association between two sets of things. It associates a
value to a key. We refer to these associated pairs as key-value pairs. Keys
must be unique, so that there can only be one value associated with a given
key.<br>

The standard built-in data type in Python for mappings is the dictionary
(dict). This kind of mapping is used by Python itself to associate names of
variables (strings) with objects. In the notation for dictionaries, we would
write d[some key] = some value. This either creates a new key-value pair
if some key was not already in the dictionary, or it overwrites the existing
pair with key some key.<br>

We"re going to pretend for a short time that we don"t have a Python
dictionary available to us and go through the process of implementing one
ourselves.<br>

Why does accessing or assigning a value in a dictionary take only
constant time?

15.1 The Mapping ADT<br>

A mapping is a collection of key-value pairs such that the keys are unique.
- get(k) - return the value associate to the key k. Usually an error (KeyError) is raised if the given key is not present.
- put(k, v) - Add the key-value pair (k,v) to the mapping.

15.2 A minimal implementation

In [3]:
class Entry:
    def __init__(self, key, value):
        self.key = key
        self.value = value
    
    def __str__(self):
        return str(self.key) + " : " + str(self.value)
    
def mapput(L, key, value):
    for e in L:
        if e.key == key:
            e.value = value
            return
    L.append(Entry(key, value))

def mapget(L, key):
    for e in L:
        if e.key == key:
            return e.value
    raise KeyError
    
m = []
mapput(m, 4, "five")
mapput(m, 1, "one")
mapput(m, 13, "thirteen")
mapput(m, 4, "four")
assert(mapget(m, 1) == "one")
assert(mapget(m, 4) == "four")
for i in m:
    print(i)

4 : four
1 : one
13 : thirteen


Encapsulating the underlying list to avoid accidentally messing it up ie., using ***get*** and ***put*** methods

In [4]:
from ds2.mapping import Entry
class ListMappingSimple:
    def __init__(self):
        self._entries = []

    def put(self, key, value):
        for e in self._entries:
            if e.key == key:
                e.value = value
                return
        self._entries.append(Entry(key, value))

    def get(self, key):
        for e in self._entries:
            if e.key == key:
                return e.value
        raise KeyError

15.3 The extended Mapping ADT<br>

The standard behavior for iterators in dictionaries is to iterate over the
keys. Alternative iterators are provided to iterate over the values or to
iterate over the key-value pairs as tuples. For a dict object this is done as
follows.

In [5]:
d = {"key1": "value1", "key2": "value2"}

for k in d:
    print(k)

for v in d.values():
    print(v)

for k, v in d.items():
    print(k, v)

key1
key2
value1
value2
key1 value1
key2 value2


We’ll add the same kind of functionality to our Mapping ADT. So, the extended Mapping ADT includes the following methods<br>

- __ getitem __ (k) - return the value associate to the key k. Usually an error (KeyError) is raised if the given key is not present.
- __ setitem __ (k, v) - Add the key-value pair (k,v) to the mapping.
- remove(k) - Remove the entry with key k if it exists.
- __ len __ - return the number of keys in the dictionary.
- __ contains __ (k) - return true if the mapping contains a pair with key k.
- __ iter __ - return an iterator over the keys in the dictionary.
- values - return an iterator over the values in the dictionary.
- items - return an iterator over the key-value pairs (as tuples).
- __ str __ - return a string representation of the mapping.

The dict class is a non-sequential collection.<br>
However, this first implementation will have the items in a fixed order because we are using a list to store them.

In [7]:
from ds2.mapping import Entry

class ListMapping:
    def __init__(self):
        self._entries = []
    
    def put(self, key, value):
        e = self._entry(key)
        if e is not None:
            e.value = value
        else:
            self._entries.append(Entry(key, value))
    
    def get(self, key):
        e = self._entry(key)
        if e is not None:
            return e.value
        else:
            raise KeyError
    
    def remove(self, key):
        e = self._entry(key)
        if e is not None:
            self._entries.remove(e)
    
    def _entry(self, key):
        for e in self._entries:
            if e.key == key:
                return e
        return None

    def __str__(self):
        return "{" + ", ".join(str(e) for e in self._entries) + "}"
    
    def __len__(self):
        return len(self._entries)

    def __contains__(self, key):
        if self._entry(key) is None:
            return False
        else:
            return True

    def __iter__(self):
        return (e.key for e in self._entries)

    def values(self):
        return (e.value for e in self._entries)

    def items(self):
        return ((e.key, e.value) for e in self._entries)

    __getitem__ = get
    __setitem__ = put

15.4 It’s Too Slow!<br>

Our goal is to to get the same kind of constant-time operations as in the dict class.<br>

We want an integer, i.e. the index into our list of buckets. A
hash function takes a key and returns an integer. Most classes in Python
implement a method called hash that does just this. We can use it to
implement a simple mapping scheme that improves on the ListMapping.

In [9]:
from ds2.mapping import ListMapping

class HashMappingSimple:
    def __init__(self):
        self._size = 100
        self._buckets = [ListMapping() for i in range(self._size)]

    def put(self, key, value):
        m = self._bucket(key)
        m[key] = value
    
    def get(self, key):
        m = self._bucket(key)
        return m[key]

    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]

First, the initializer creates a list of 100 ListMaps. These are called the
buckets. If the keys get spread evenly between the buckets then this will be
about 100 times faster! If two keys are placed in the same bucket, this is
called a **collision**.<br>

The getitem and setitem methods call the bucket method to
get one of these buckets for the given key and then just use that ListMap’s
get and put methods. So, the idea is just to have several list maps instead
of one and then you just need a quick way to decide which to use. The
hash function returns an integer based on the value of the given key. The
collisions will depend on the hash function.

15.4.1 How many buckets should we use?<br>

The number 100 is pretty arbitrary. If there are many many entries, then
one might get 100-fold speedup over ListMap, but not more.<br>

It makes sense to use more buckets as the number of entries increases.
To do this, we will keep track of the number of entries in the map. This
will allow us to implement __ len __ and also grow the number of buckets as
needed. As the number of entries grows, we can periodically increase the
number of buckets. Here is the code.

In [10]:
from ds2.mapping import Entry, ListMapping

class HashMapping:
    def __init__(self, size = 2):
        self._size = size
        self._buckets = [ListMapping() for i in range(self._size)]
        self._length = 0
    
    def put(self, key, value):
        m = self._bucket(key)
        if key not in m:
            self._length += 1
        m[key] = value
        if self._length > self._size:
            self._double()
        
    def get(self, key):
        m = self._bucket(key)
        return m[key]

    def remove(self, key):
        m = self._bucket(key)
        m.remove(key)
    
    def __contains__(self, key):
        m = self._bucket(key)
        return key in m

    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]

    def _double(self):
        oldbuckets = self._buckets
        self._size *= 2
        self._buckets = [ListMapping() for i in range(self._size)]
        for bucket in oldbuckets:
            for key, value in bucket.items():
                m = self._bucket(key)
                m[key] = value
    
    def __len__(self):
        return self._length
    
    def __iter__(self):
        for b in self._buckets:
            for k in b:
                yield k

    def values(self):
        for b in self._buckets:
            for v in b.values():
                yield v
    
    def items(self):
        for b in self._buckets:
            for k, v in b.items():
                yield k, v
    
    def __str__(self):
        itemlist = [str(e) for b in self._buckets for e in b._entries]
        return "{" + ", ".join(itemlist) + "}"
    
    __getitem__ = get
    __setitem__ = put

15.4.2 Rehashing<br>

The most interesting part of the code above is the double method. This
is a method that increases the number of buckets. It’s not enough to just
append more buckets to the list, because the bucket method that we use to
find the right bucket depends on the number of buckets. When that number
changes, we have to reinsert all the items in the mapping so that they can
be found when we next get them.

15.5 Factoring Out A Superclass<br>

We have given two different implementations of the same ADT. There are
several methods that we implemented in the ListMapping that we will also
want in the HashMapping. It makes sense to avoid duplicating common parts
of these two (concrete) data structures. Inheritance provides a nice way to
do this.<br>

This is the most common way that inheritance appears in code. Two
classes want to share some code, so we factor out a superclass that both
can inherit from and share the underlying code.<br>

There are some methods that we expect to be implemented by the sub-
class. We can enforce this by putting the methods in the subclass, but
raising an error if they are called. This way, the error will only be raised if
the subclass does not override those methods.

In [11]:
class Mapping:
    # Child class needs to implement this!
    def get(self, key):
        raise NotImplementedError
    
    # Child class needs to implement this!
    def put(self, key, value):
        raise NotImplementedError
    
    # Child class needs to implement this!
    def __len__(self):
        raise NotImplementedError
    
    # Child class needs to implement this!
    def _entryiter(self):
        raise NotImplementedError
    
    def __iter__(self):
        return (e.key for e in self._entryiter())
    
    def values(self):
        return (e.value for e in self._entryiter())

    def items(self):
        return ((e.key, e.value) for e in self._entryiter())

    def __contains__(self, key):
        try:
            self.get(key)
        except KeyError:
            return False
        return True
    
    def __getitem__(self, key):
        return self.get(key)
    
    def __setitem__(self, key, value):
        self.put(key, value)
    
    def __str__(self):
        return "{" + ", ".join(str(e) for e in self._entryiter()) + "}"

four methods
that a subclass has to implement: get, put, len , and a method called entryiter that iterates through the entries.<br>
This last method is private
because the user of this class does not need to access Entry objects. They
have the Mapping ADT methods to provide access to the data. This is why
the Entry class is an inner class

In [12]:
from ds2.mapping import Mapping, Entry

class ListMapping(Mapping):
    def __init__(self):
        self._entries = []
    
    def put(self, key, value):
        e = self._entry(key)
        if e is not None:
            e.value = value
        else:
            self._entries.append(Entry(key, value))
    
    def get(self, key):
        e = self._entry(key)
        if e is not None:
            return e.value
        else:
            raise KeyError
    
    def _entry(self, key):
        for e in self._entries:
            if e.key == key:
                return e
        return None
    
    def _entryiter(self):
        return iter(self._entries)
    
    def __len__(self):
        return len(self._entries)

The HashMapping class can also be rewritten as follows.

In [13]:
from ds2.mapping import Mapping, ListMapping

class HashMapping(Mapping):
    def __init__(self, size = 100):
        self._size = size
        self._buckets = [ListMapping() for i in range(self._size)]
        self._length = 0

    def _entryiter(self):
        return (e for bucket in self._buckets for e in bucket._entryiter())

    def get(self, key):
        bucket = self._bucket(key)
        return bucket[key]

    def put(self, key, value):
        bucket = self._bucket(key)
        if key not in bucket:
            self._length += 1
        bucket[key] = value
        # Check if we need more buckets.
        if self._length > self._size:
            self._double()
    
    def __len__(self):
        return self._length
    
    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]
    
    def _double(self):
        # Save the old buckets
        oldbuckets = self._buckets
        # Reinitialize with more buckets.
        self.__init__(self._size * 2)
        for bucket in oldbuckets:
            for key, value in bucket.items():
                self[key] = value

Visualise hash maps [here](https://visualgo.net/en/hashtable)