# Hashing

*Hashing* is a technique that is frequently used in implementing efficient algorithms. In Python, the data structures `set` and `dict` (dictionary) are based on hashing.

In this chapter, we take a look at data structures based on hashing and their use in algorithm design. We will also cover some theory underlying the data structures.

## Set

The Python data structure `set`, based on hashing, maintains a set of elements. The operations on the data structure include:

- the method `add` adds an element to the set
- the operator `in` finds if a given element is in the set
- the method `remove` removes an element from the set
  
The data structure is implemented so that all of the above operations take $O(1)$ time.

### Example

The following code creates a set `numbers` and adds elements to the set:

In [1]:
numbers = set()

numbers.add(1)
numbers.add(2)
numbers.add(3)

print(numbers) # {1, 2, 3}

{1, 2, 3}


We can also create a set directly from a list:

In [2]:
numbers = set([1, 2, 3])

print(numbers) # {1, 2, 3}

{1, 2, 3}


The operator `in` tests if an element is in the set:

In [3]:
print(3 in numbers) # True
print(4 in numbers) # False

True
False


And we can remove an element from the set with the method `remove`:

In [4]:
print(numbers) # {1, 2, 3}
numbers.remove(2)
print(numbers) # {1, 3}

{1, 2, 3}
{1, 3}


### List vs. set

A list and a set are similar data structures in that both maintain a collection of elements and support additions and removals. However, there are significant differences in their efficiency and other properties.

#### Efficiency

Adding an element to a list is efficient, but finding an element and removing it can be slow.

With a set, adding elements, finding elements and removing elements are all efficient operations.

| Operation               | List   | Set    |
| ----------------------- | ------ | ------ |
| Adding (`append`/`add`) | $O(1)$ | $O(1)$ |
| Finding (`in`)          | $O(n)$ | $O(1)$ |
| Removing (`remove`)     | $O(n)$ | $O(1)$ |

#### Indexing

In a list, elements can be accessed using an index:

In [5]:
numbers = [1, 2, 3]
print(numbers[1]) # 2

2


A set does not support indexing:

In [6]:
numbers = set([1, 2, 3])
print(numbers[1]) # TypeError: 'set' object is not subscriptable

TypeError: 'set' object is not subscriptable

#### Repeated elements

In a list, an element can occur multiple times:

In [8]:
numbers = []

numbers.append(5)
numbers.append(5)
numbers.append(5)

print(numbers) # [5, 5, 5]

[5, 5, 5]


A set contains an element at most once. Adding an element that is already in the set has no effect:

In [7]:
numbers = set()

numbers.add(5)
numbers.add(5)
numbers.add(5)

print(numbers) # {5}

{5}


## Example: How many numbers?

**Task**

You are given a list of numbers. How many distinct numbers does it contain?

For example, when the list is $[3,1,2,1,5,2,2,3]$, the desired answer is $4$, because the distinct numbers are $1$, $2$, $3$ and $5$.

#### Slow solution (list)

We could solve the task using a list as follows:

In [9]:
def count_distinct(numbers):
    seen = []
    for x in numbers:
        if x not in seen:
            seen.append(x)
    return len(seen)

The algorithm goes through the numbers and adds a number to a list `seen` if it is not there already. At the end, the length of the list `seen` is the desired answer.

This algorithm is correct but not efficient, because every round of the loop calls the operator `in`, which can take $O(n)$ time. Thus the time complexity of the algorithm is $O(n^2)$. However, a simple improvement is to use a set instead of a list.

#### Efficient solution (set)

We can solve the task efficiently using a set as follows:

In [10]:
def count_distinct(numbers):
    seen = set()
    for x in numbers:
        if x not in seen:
            seen.add(x)
    return len(seen)

This function almost identical to the preceding one; the only differences are defining `seen` as a set instead of a list and using the method `add` instead of `append`. This change has a big effect on the efficiency of the algorithm. After the change, the operator `in` takes only $O(1)$ time and thus the time complexity of the algorithm is $O(n)$.

We can simplify the code further by using the fact that a set contains no duplicates. Thus we can remove the check if an element is already in the set:

In [11]:
def count_distinct(numbers):
    seen = set()
    for x in numbers:
        seen.add(x)
    return len(seen)

We can shorten the code further by creating the set directly from the list. Only one line is needed:

In [12]:
def count_distinct(numbers):
    return len(set(numbers))

## Dictionary

The Python data structure `dict` or dictionary is based on hashing and stores key-value pairs. The idea is that we can use the key to retrieve the associated value.

A dictionary can be seen as a generalization of a list: In a list, keys are the indices $0…n$, while in a dictionary, keys can be arbitrary objects.

Adding, accessing and removing data using a key takes $O(1)$ time.

### Example

The following code creates a dictionary `weights` where the keys are strings and the values are numbers.

In [17]:
weights = {}

weights["monkey"] = 100
weights["banana"] = 1
weights["harpsichord"] = 500

The same dictionary can also be created as follows:

In [18]:
weights = {"monkey": 100, "banana": 1, "harpsichord": 500}

The values in a dictionary can be used in the same way as the elements of a list:

In [19]:
print(weights["monkey"]) # 100
weights["monkey"] = 150
print(weights["monkey"]) # 150

100
150


The operator `in` checks if a given key is in the dictionary:

In [21]:
print("monkey" in weights) # True
print("ananas" in weights) # False

True
False


The command `del` removes a key and the associated value from a dictionary:

In [23]:
print(weights) # {'apina': 100, 'banaani': 1, 'cembalo': 500}
del weights["banana"]
print(weights) # {'apina': 100, 'cembalo': 500}

{'monkey': 150, 'banana': 1, 'harpsichord': 500}
{'monkey': 150, 'harpsichord': 500}


### Using a dictionary

We will next take a look at three common ways to use a dictionary in algorithm design.