### CS102/CS103

Prof. Götz Pfeiffer<br />
School of Mathematics, Statistics and Applied Mathematics<br />
NUI Galway

# Lecture 12: Collections

Data usually don't come on their own, but as collections
of data, in some form or another.  `python` deals with collections
of many kinds.  We have already seen **strings** and **lists**.  Both
are examples of **ordered collections**.  Items in an ordered
collection can be accessed through **indices**, their positions
in the collection.  A **dictionary** (aka hash or record), in contrast,
is an example of an **unordered collection**: its items are labelled (and accessed) by **keys**.  

We will discuss these types of collections, what they
have in common and what their differences are.

Before that, a brief diversion on modular arithmetic.

## Chinese Remainders

In algebra, the **Chinese Remainder Theorem** states that
a system of $k$ **simultaneous congruences**
$$
x \equiv a_i \pmod{m_i}, \quad i = 1, \dots, k
$$
has a **unique solution** modulo 
$$
m = \prod_{i=1}^k m_i
$$
provided that the numbers $m_i$ are **pairwise coprime**.

Moreover, that solution is determined by the (simple) formula
$$
x = \sum_{i=1}^k a_i \cdot m_i' \cdot n_i,
$$
where $m_i' = m/m_i$ and $n_i$ is the **inverse** of $m_i'$ modulo $m_i$.

(Then, modulo $m_i$, we have $m_i' n_i \equiv 1$
and $m_j' \equiv 0$ for $j \neq i$, whence 
$$
x \equiv a_1 \cdot 0 + \dots + a_{i-1} \cdot 0 + a_i \cdot 1 + a_{i+1} \cdot 0 + \dots + a_k \cdot 0 \equiv a_i,
$$
as desired).

Let's try and implement this as an algorithm in `python`,
using what we've learned and what we've made so far. From a few weeks back, we already have a function that computes **modular 
inverses** based on the extended Euclidean algorithm:

In [1]:
def egcd(a, b):
    "find integers  x, y  such that  gcd(a,b) = x*a + y*b"
    if b == 0:
        return (1, 0)
    x, y = egcd(b, a % b)
    x, y = y, x - (a // b) * y
    return (x, y)

def modular_inverse(a, m):
    "compute the modular inverse of a modulo m"
    x, y = egcd(m, a)
    d = x * m + y * a
    if d == 1:
        return y % m
    print("error: ", a, "has no inverse mod", m, ": d = ", d)

We know how to use the **accumulator pattern** to compute the 
product of a list of numbers:

In [2]:
def product_numbers(numbers):
    "compute the product of a list of numbers"
    total = 1
    for number in numbers:
        total *= number
    return total

We can now use these as **building blocks** for a short `python`
function `chinese_remainder()` that solves any system of simultaneous congruences,
assuming the moduli are coprime.
We pass the list of moduli as one argument, and the list of
residues as a second argument.  And accumulate ...

In [3]:
def chinese_remainder(moduli, residues):
    "compute the chinese remainders of residues mod moduli"
    M = product_numbers(moduli)
    total = 0
    for i in range(len(moduli)):
        a = residues[i]
        m = moduli[i]
        m1 = M // m
        n = modular_inverse(m1, m)
        total += a * m1 * n
    return total % M

**Example.**
$$
x \equiv 3 \pmod{13}, \quad
x \equiv 6 \pmod{14}, \quad
x \equiv 9 \pmod{15}
$$

In [4]:
x = chinese_remainder([13, 14, 15], [3, 6, 9])
x

2694

In [5]:
x % 13, x % 14, x % 15

(3, 6, 9)

## Tuples

Lists are **mutable**, strings are not: they are **immutable**.
```python
s = "a string"
s[1] = 'b'  # <-- will produce an error
```

Tuples are **immutable lists**.

## Dictionaries

**Example.** Suppose you want to determine, from a list of names, which
name is the most popular. If  you were to do this by hand,
you would probably read through the list of names,
record each new name you encounter, and keep a counter or tally
against each name, which is incremented by $1$
each time that name occurs.

Wouldn't it be nice if there was some sort of built-in data
structure in `python` that would allow you to do exactly this:
maintain a list of names, each with an associated counter?!

### Dictionary Literals

A **dictionary** in `python` is a data structure for representing
a **mapping** of objects to values, in the form of **key-value** pairs.  Such a data structure is extremely useful and versatile.

One might, for example, use it to describe a book by its attributes:

In [6]:
book = { "author" : "John Zelle", "title" : "Python Programming",
       "ISBN" : "978-1-59028-241-0" }
book

{'ISBN': '978-1-59028-241-0',
 'author': 'John Zelle',
 'title': 'Python Programming'}

Here all keys and all values are strings, but they could be numbers
as well, boolean values, or even lists.  The only constraint is
that a key must not be used more than once. 
So that **each key** has a  **unique value** assigned to it.
Just like, in mathematics, a function $f$ assigns
to each $x$ a unique value $y$ (which is then called **the** $f(x)$).

In general, a dictionary literal is a sequence of comma-separated
key-value pairs in side a pair of **curly braces**.
Each keys is separated from its value by a colon (`:`). 
Both keys and values can be any kind of `python` object.
Strings are a popular choice for the keys.

###  Dictionary Operations

The values in a dictionary can then be **accessed by key**
(as opposed to the positional access to list items).

In [7]:
book["author"]

'John Zelle'

An existing dictionary can be expanded by adding further
key-value pairs:

In [8]:
book["year"] = 2000
book

{'ISBN': '978-1-59028-241-0',
 'author': 'John Zelle',
 'title': 'Python Programming',
 'year': 2000}

As in the case of list items, values in a dictionary can be updated:

In [9]:
book["year"] = 2010
book

{'ISBN': '978-1-59028-241-0',
 'author': 'John Zelle',
 'title': 'Python Programming',
 'year': 2010}

In [10]:
book.keys()

dict_keys(['author', 'title', 'ISBN', 'year'])

In [11]:
book.values()

dict_values(['John Zelle', 'Python Programming', '978-1-59028-241-0', 2010])

In [12]:
book.items()

dict_items([('author', 'John Zelle'), ('title', 'Python Programming'), ('ISBN', '978-1-59028-241-0'), ('year', 2010)])

In [13]:
for key in book:
    print("{}: {}".format(key, book[key]))

author: John Zelle
title: Python Programming
ISBN: 978-1-59028-241-0
year: 2010


One can test **membership of a key** in a dictionary.

In [14]:
'author' in book, 'price' in book

(True, False)

One can delete a key ...

In [15]:
del book["year"]
book

{'ISBN': '978-1-59028-241-0',
 'author': 'John Zelle',
 'title': 'Python Programming'}

... or clear out the entire dictionary:

In [16]:
book.clear()
book

{}

In [17]:
book = { "author" : "John Zelle", "title" : "Python Programming",
       "ISBN" : "978-1-59028-241-0" }

## Name Frequency

Student names are listed in a file `students.csv`. 
The extension `.csv` stands for "comma separated values".
Here, students' last names are separated by commas from students'
first names.  To access the file, we open it (under its name).

In [18]:
students_file = open("students.csv")

A file can be regarded as a sequence of lines, each
terminated by a newline character (`\n`)

In [19]:
lines = students_file.readlines()

In [20]:
from random import randrange
line = lines[randrange(0,len(lines))]
line

'MC NAMARA,THOMAS\n'

To access the first name, we need to do three things:
* get rid of the trailing newline character,
* split the line into its comma-separated pieces,
* pick the second of those pieces.

In `python`, we can 
* use `strip()` to clear a string off any leading or trailing whitespace,
* use `split()` with argument `,` to split a string at every comma,
* and, of course, indexing (`[1]`) to select one item from a sequence

In [21]:
line.strip(), line.strip().split(','), line.strip().split(',')[1]

('MC NAMARA,THOMAS', ['MC NAMARA', 'THOMAS'], 'THOMAS')

Let's wrap this up in a function, and close the file for now.

In [22]:
def csv2name(line):
    name = line.strip()
    name = name.split(',')
    name = name[1]
    return name

students_file.close()

In [23]:
csv2name(line)

'THOMAS'

So here is the plan:

* Using the accumulator pattern (again), we first initialize
a dictionary `names` as an empty dictionary (`{}`).

* Then we open the `students.csv` file again, loop over its lines
and use the above method to extract a first name from each line.

* With each name found in this way we test the dictionary:
if the name is already in there (as a key), we add $1$ to the 
associated counter;  if it's not in yet,  we
create a new key and set its counter to $1$;
that is we add the key-value pair `(` name `,` $1$ `)` to
the dictionary `names`.

In [24]:
names  = {}
students_file = open("students.csv")
for line in students_file:
    name = csv2name(line)
    if name in names:
        names[name] += 1
    else:
        names[name] = 1
students_file.close()

Now we have a dictionary that has all the first names as its keys,
each with an associated value, the number of students sharing this name.  Let's look at a random key-value pair:

In [25]:
from random import choice
name = choice(list(names.keys()))
name, names[name]

('MOLLIE', 1)

Actually, there is a dictionary method `get()` that can be used to
simplify the code.  `get()` accesses a value in a dictionary by its key:  `book.get('author')` is the same as `book['author']`.
But where `book['price']` gives an error (as the dictionary `book`
has no key `price`, the `get()` method takes as a second argument
a **default value** which is returned if the key is absent.

In [26]:
book.get('price', 12.00)

12.0

Here, we can use this behaviour to avoid the `if` statement
and get a shorter program as follows.

In [27]:
names  = {}
students_file = open("students.csv")
for line in students_file:
    name = csv2name(line)
    names[name] = names.get(name, 0) + 1
students_file.close()

In [28]:
name = choice(list(names.keys()))
name, names[name]

('DAYLE', 1)

But which name occurs more often than all the other names?

In [29]:
most = 0
for name in names:
    if names[name] >= most:
        most = names[name]
        print("{}: {}".format(name, names[name]))

ELLEN: 3
OISIN: 6


More systematically, one could proceed as follows:  

* Use the `items()` method to turn the dictionary into a list of pairs
(of names and frequencies), 

* then sort that list according to
frequency, 

* and finally print the top 10.

In [30]:
items = list(names.items())
def frequency(item):
    return item[1]
items.sort()
items.sort(key=frequency, reverse=True)
for i in range(10):
    name, count = items[i]
    print("{0:<15}{1:>5}".format(name, count))

OISIN              6
SEAN               5
MATTHEW            4
RONAN              4
CAOIMHE            3
CIAN               3
CONOR              3
DAVID              3
ELLEN              3
MICHAEL            3


## Summary: Dictionaries

* A `python` list is a **sequential collection** of data.

* In `python`, lists are **mutable**, but strings are not.

* A `python` **tuple** is an **immutable** list.

* A `python` dictionary implements an arbitrary **mapping** from keys to values.

* Dictionaries are very useful for representing **non-sequential collections** of data.

* The `sort()` method for lists takes additional **keyword arguments**
that define the sorting criteria.