# Data Structures and Their Access Characteristics

### © Luca de Alfaro, 2019, [CC-BY_NC License](http://creativecommons.org/licenses/by-nc-nd/4.0/).


Prepared on: Mon Sep 20 20:42:28 2021

This is a book chapter; it is not a homework assignment.  
Do not submit it as a solution to a homework assignment; you would receive no credit.


One of the things that makes Python so appealing for experimenting with algorithms and data science is that it comes with very flexible, and well-implemented, data structures that one can use then to build the solution to one's problem. 
We will review here briefly the main data structures we will be using in the course, and we will recall their approximate access characteristics. 
The goal is not to do a precise analysis of the running time of algorithms, but rather, to guide the choice of the data structures to be used in solving a problem.
Information about the run-time of Python methods can be found on the [Python wiki](https://wiki.python.org/moin/TimeComplexity).


## Lists

Lists in Python are incredibly flexible: they are at once lists (sequences of elements), arrays (lists where you can access an array at any position), stacks, and much more. 

A list can be created by listing its elements:

In [None]:
my_list = [1, 2, 3, 5]


The empty list is denoted by `[]`.  Lists are passed by reference, so:

In [None]:
l1 = [1, 2, 3, 5]
l2 = l1
l1.append(7)
print(l2)


[1, 2, 3, 5, 7]


There are several ways of making (shallow) copies of lists, so that a modification to one does not modify the others:

In [None]:
l1 = [1, 2, 3, 5]
l2 = l1[:]
l3 = l1.copy()
l4 = list(l1)

# We modify l1, the other lists don't change.
l1.append(7)
print(l1, l2, l3, l4)


[1, 2, 3, 5, 7] [1, 2, 3, 5] [1, 2, 3, 5] [1, 2, 3, 5]


### List operations

The list operations are described in the [Python standard library](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range), as well as in the [tutorial](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists). 
Their time complexity is as follows: 

**Accessing an element** as in `my_list[4]` takes constant time: it is _not_ necessary to traverse the list from the beginning; this is true for both reading and writing (modifying) the element.  Intuitively, the elements of a Python list (or better, the pointers to the elements) are stored as an array, so if you need to access the $k$-th element, you can compute precisely which loaction of memory to read: the location $k$ cells after the beginning of the list. There is no need to traverse the memory locations $1, 2, \ldots, k-1$ to get to it. 

**Deleting an element given the index** takes time proportional to the number of elements following the deleted one, as these elements need to be rearranged. 

**Concatenation** of two lists has cost proportional to the length of the resulting list.

**Extension** of a list has cost proportional to the length of the list being added to the existing one (most of the time; occasionally, much longer). 

**Getting the length** has constant cost, as the length is cached. 

## Dictionaries

A dictionary (dict) is a mapping from keys to values.  Internally, a dictionary is implemented via a hash table.  This means, intuitively, that when accessing the element `d['hello']` for a dictionary `d`, the following happens:

**1: Compute the hash.** We compute the [_hash_](https://docs.python.org/3.4/library/functions.html?highlight=file#hash) of the key, that is, we somehow compute a number out of the key, in this case the string `hello`.  Yes, this takes time proportional to the length of the key, but in practice, it's pretty fast, and if the key is too long, you can consider only some initial part of it.  For a string, we could simply add up the numbers corresponding to the characters (well, a better method is used in the actual implementation). Assume the number we generate from 'hello' is 

In [None]:
hash('hello')


-5764133991990426218

This dispels any belief that the hash function is computed by adding up the characters!  In fact, computing the hash is an art unto itself; some discussion can be [found in the git repository for Python](https://github.com/python/cpython/blob/bfe4fd5f2e96e72eecb5b8a0c7df0ac1689f3b7e/Python/pyhash.c).  But we don't need to worry with those things here.

**2. Look in the place given by the hash.** The dictionary internally has a fixed-length list of "cells"; each cell can contain one or more keys.  Suppose our dictionary has 100 cells; then for the key "hello" we look at position:

In [None]:
hash('hello') % 100


82

If we find the key `'hello'` in cell 56, we are all set.  If we don't find it there, it means the key `'hello'` is not in the dictionary.  The number of cells in the dictionary is tuned so that only a few keys are in each cell. 

### Time complexity of dictionary operations

Since the hash of a key can be computed more or less in constant time (and in any case, really fast), and since each dictionary cell (on average) contains only a few keys, the upshot is that accessing one element of a dictionary via its key takes constant time, for insertion, updates, and deletions. 
This is not a guarantee, however: occasionally the dictionary (the space for the keys) needs to be resized, which takes time proportional to the size of the dictionary. 

## ... and the rest

Lists and dictionaries are the main underlying data structures of Python.  In lists, you can compute how to access an element in position $k$.  In dictionaries, you use a hash function to figure out fast where to look for the key.

Yes, there are [deques](https://docs.python.org/3.7/library/collections.html#collections.deque) and a few other specialized data structures, but by and large, you can understand the rest in terms of lists and dictionaries.  For instance, a set is just a dictionary whose keys all map to the same value (say, None).  Thus you can check set membership in constant time, and you can take the union of two sets just by adding to the first set all the keys of the second. 

## Numpy

And then, there's [numpy](https://numpy.org/), the amazingly fast, useful, and sophisticated numerical library for Python.  Numpy has been written on the basis of an enormous amount of accumulated knowledge and experience about numerical computations, and it truly represents a peak in the computer science accomplishments. 

Numpy enables, among many other things, very fast operations on arrays and $n$-dimensional matrices. 



Here are some examples of things you can do in numpy.
You can convert a list to a numpy array: 

In [None]:
import numpy as np

a = np.array([3, 4, 5, 3, 2])
a


array([3, 4, 5, 3, 2])

You can generate a random numpy array of a given length:

In [None]:
b = np.random.random((5,))
b


array([0.66394404, 0.56719324, 0.77436979, 0.46613972, 0.33223195])

You can add (or multiply, subtract, etc) two arrays in a single blow:

In [None]:
a + b


array([3.66394404, 4.56719324, 5.77436979, 3.46613972, 2.33223195])

You can compare all elements to a constant: 

In [None]:
b > 0.5


array([ True,  True,  True, False, False])

or compare an array to another:

In [None]:
a > b


array([ True,  True,  True,  True,  True])

compute the sum (or maximum, ...) of an array:

In [None]:
np.sum(a)


17

Numpy is so fast, compared to Python, that I have found useful to think at the time complexity of Python programs that include numpy as follows: 

* In first approximation, count every numpy operation as being constant time, regardless of matrix size.  Your goal should be to minimize at all costs the number of numpy operations you perform, and _avoid at all costs_ iterating over elements of arrays (single, or multi-dimensional) directly in Python.  This often involves writing code in ways that are very different from what initially might feel natural; below we will provide some simple examples.

* Only after you have minimized the number of numpy operations, should you start to worry about the matrix sizes, and the time taken by numpy itself.

Let us first see for ourselves the difference in speed between numpy and Python.  We have two long arrays, and we need to compute their sum.

In [None]:
import time
import numpy as np

r = 1000

# First in Python
a = list(range(100000))
b = list(range(100000, 200000))
avg_p_time = 0.
for _ in range(r):
    t = time.time()
    # This is the iteration over elements you need so desperately to avoid.
    c = []
    for i in range(len(a)):
        c.append(a[i] + b[i])
    avg_p_time += time.time() - t
print("Python:", avg_p_time / r)

# Then in Numpy
aa = np.arange(100000)
bb = np.arange(100000, 200000)
avg_n_time = 0.
for _ in range(r):
    t = time.time()
    cc = aa + bb
    avg_n_time += time.time() - t
print("Numpy:", avg_n_time / r)
print("Ratio:", avg_p_time / avg_n_time)


Python: 0.027707100629806518
Numpy: 0.0001637082099914551
Ratio: 169.24686080956306


About 100 times faster.  The difference in speed justifies trying to solve problems relying as much as possible on numpy's ability to operate on a whole array at once, avoiding iterating over elements.  

As a simple example, consider the problem of counting how many elements in a random array are above a threshold.  The natural approach consists in keeping a counter, and iterate over the array, incrementing the counter whenever an element is over the threshold.  However, in numpy, another strategy is actually much better:

* First, we compare (at once!) all the array with the threshold, obtaining a boolean array of True, False of the same length of the original array. 
* Then, we add (at once!) all the elements of the boolean array, obtaining the count. 

Let's see how the approaches compare in practice. 

In [None]:
# This is our large random array.
aa = np.random.random(100000)
a = list(aa) # Python list
r = 1000

# First in Python
avg_p_time = 0.
for _ in range(r):
    t = time.time()
    c = 0
    for x in a:
        if x > 0.9:
            c += 1
    avg_p_time += time.time() - t
print("Python:", avg_p_time / r)

# Then in numpy, still iterating.
avg_n_time = 0.
for _ in range(r):
    t = time.time()
    c = 0
    for x in aa:
        if x > 0.9:
            c += 1
    avg_n_time += time.time() - t
print("Numpy, by iterating:", avg_n_time / r)

# Finally in numpy, first comparing, then adding.
avg_nn_time = 0.
for _ in range(r):
    t = time.time()
    c = np.sum(aa > 0.9)
    avg_nn_time += time.time() - t
print("Numpy, array ops:", avg_nn_time / r)


Python: 0.013422878742218017
Numpy, by iterating: 0.022822985887527465
Numpy, array ops: 0.00015058255195617677


A speed difference of 100 is the speed difference between jogging (12 Km/h) and the speed of sound.  Or between walking (4.6 Km/h) and a [garden snail](https://hypertextbook.com/facts/1999/AngieYee.shtml).

In [None]:
hash("dog") % 4, hash('cat') % 4, hash('bird') % 4, hash('cow') & 4


(0, 0, 2, 0)