# Geological abstractions 2

Next we're going to meet the following Python types:

- `dict`: mappings, short for 'dictionary'
- `set`: unordered collections of unique elements

Let's get started!

## `dicts`

Dictionaries are an important data structure in Python. They are arguably the most important, since Python's internal workings depend entirely on dictionaries. Called mappings or hash tables in other languages, a dictionary, or `dict`, is a collection of unique key / value pairs. The dictionary itself is mutable, but the keys must be immutable.

To see why the concept of key / value pairs is useful, recall the data structure we used in the last chapter for a list of geological periods, with further lists for their ages:

In [68]:
period_list = [["Cambrian (\uA792)", [541, 485]], ["Ordovician (O)", [485, 444]], 
               ["Silurian (S)", [444, 419]], ["Devonian (D)",[419, 359]], 
               ["Carboniferous (M)", [359, 299]], ["Permian (P)", [299, 252]],
               ["Triassic (T)", [252, 201]], ["Jurassic (J)", [201, 145]],
               ["Cretaceous (C)", [155, 66]], ["Palaeogene (Pg)", [66, 23]],
               ["Neogene (N)", [23, 2.6]], ["Quaternary (Q)", [2.6, 0]],
              ]

Dictionaries turn out to be a much more convenient way to store information like this, because instead of just making a list of the values (or the names or their start and end ages), and relying on their positions for meaning, we can use _keys_ to give the _values_ more natural names. For example, we could store a 'row' of our dataset like this:

In [69]:
period = {'name': "Cambrian", "start": 544, "end": 495}

This is nice, because instead of remembering the indices of our data, we can use names. For example:

In [70]:
period['name']

'Cambrian'

Adding new keys follows the same pattern:

In [71]:
period["abbreviation"] = "\uA792"
period

{'abbreviation': 'Ꞓ', 'end': 495, 'name': 'Cambrian', 'start': 544}

Notice that the order of the keys seems to be random. This is a bug in IPython (at the time of writing, version 6.1.0). Python dictionaries before Python version 3.6 had randomly ordered keys, but since version 3.6, keys are stored in the order in which they are added. Some utilities have not caught up yet and still reflect the older behaviour.

In [72]:
import IPython
IPython.__version__

'6.1.0'

Returning to our example, there are a few ways would could organize this dataset. The structure we choose depends on the purpose we have in mind and the features we want. One possibility is to make a dictionary that maps each period name to another dict containing its metadata.

Instead of defining it implicitly, challenge yourself to build the dictionary programmatically in the following exercise.

### Exercise

Build a dictionary from `period_list`. The keys should be `'Cambrian'`, `'Ordovician'`, etc. The values should be like this:

    {'abbreviation': 'Ꞓ', 'start': 541, 'end': 485}
    
It can be done in a single 'dictionary comprehension', but you will find it easier to do it in a loop.

In [73]:
# SOLUTION 1
periods = {}
for period in period_list:
    name, abbr = period[0].split()
    meta = dict(abbreviation = abbr.strip('()'),
                start = period[1][0],
                end = period[1][1],
                )
    periods[name] = meta

In [74]:
# SOLUTION 2
periods = {i[0].split()[0]: {'abbreviation': i[0].split()[1].strip('()'), 'start': i[1][0], 'end': i[1][1]}
           for i in period_list}

In [75]:
periods

{'Cambrian': {'abbreviation': 'Ꞓ', 'end': 485, 'start': 541},
 'Carboniferous': {'abbreviation': 'M', 'end': 299, 'start': 359},
 'Cretaceous': {'abbreviation': 'C', 'end': 66, 'start': 155},
 'Devonian': {'abbreviation': 'D', 'end': 359, 'start': 419},
 'Jurassic': {'abbreviation': 'J', 'end': 145, 'start': 201},
 'Neogene': {'abbreviation': 'N', 'end': 2.6, 'start': 23},
 'Ordovician': {'abbreviation': 'O', 'end': 444, 'start': 485},
 'Palaeogene': {'abbreviation': 'Pg', 'end': 23, 'start': 66},
 'Permian': {'abbreviation': 'P', 'end': 252, 'start': 299},
 'Quaternary': {'abbreviation': 'Q', 'end': 0, 'start': 2.6},
 'Silurian': {'abbreviation': 'S', 'end': 419, 'start': 444},
 'Triassic': {'abbreviation': 'T', 'end': 201, 'start': 252}}

## Changing entries

Dictionaries are mutable, so we can change their contents in-place. We do this by simply reassigning a value; let's fix the abbreviation for the Carboniferous:

In [76]:
periods['Carboniferous'] = {'abbreviation': 'C', 'start': 359, 'end': 299}

Instead of making a whole new dictionary for the value, we could index into it and change only the abbreviation itself:

In [77]:
periods['Cretaceous']['abbreviation'] = 'K'

Now we can step over our dictionary's items and do something with them. The convenient `items()` method on a dictionary is an iterable that returns `(key, value)` tuples:

In [78]:
for key, value in periods.items():
    print(f"The {key} started {value['start']} million years ago.")

The Cambrian started 541 million years ago.
The Ordovician started 485 million years ago.
The Silurian started 444 million years ago.
The Devonian started 419 million years ago.
The Carboniferous started 359 million years ago.
The Permian started 299 million years ago.
The Triassic started 252 million years ago.
The Jurassic started 201 million years ago.
The Cretaceous started 155 million years ago.
The Palaeogene started 66 million years ago.
The Neogene started 23 million years ago.
The Quaternary started 2.6 million years ago.


We used `key` and `value` for the names of those elements in that loop, but normally we'd look for more descriptive names for these variables. In this case, `period` and `meta` might make sense.

## Default values

Dictionaries are a little like little databases. Just like a real database, sometimes we'll attempt to retrieve a value from a dictionary, only to find it is not there:

In [58]:
periods['Mississippian']

KeyError: 'Mississippian'

Dictionaries have a `get()` method, which takes a default value that will be returned if the key is missing from the dictionary:

In [60]:
periods.get('Mississippian', 'Key not found')

'Not found'

There's also a `setdefault()` method that takes a default value which will be inserted into the dictionary if the key is not present. Like `get()`, the method returns the value (after inserting it if necessary).

In [61]:
periods.setdefault('Mississippian', {'abbreviation': 'M', 'start':  358.9, 'end': 323.2})

{'abbreviation': 'M', 'end': 323.2, 'start': 358.9}

In [62]:
periods

{'Cambrian': {'abbreviation': 'Ꞓ', 'end': 485, 'start': 541},
 'Carboniferous': {'abbreviation': 'C', 'end': 66, 'start': 155},
 'Cretaceous': {'abbreviation': 'K', 'end': 66, 'start': 155},
 'Devonian': {'abbreviation': 'D', 'end': 359, 'start': 419},
 'Jurassic': {'abbreviation': 'J', 'end': 145, 'start': 201},
 'Mississippian': {'abbreviation': 'M', 'end': 323.2, 'start': 358.9},
 'Neogene': {'abbreviation': 'N', 'end': 2.6, 'start': 23},
 'Ordovician': {'abbreviation': 'O', 'end': 444, 'start': 485},
 'Palaeogene': {'abbreviation': 'Pg', 'end': 23, 'start': 66},
 'Permian': {'abbreviation': 'P', 'end': 252, 'start': 299},
 'Quaternary': {'abbreviation': 'Q', 'end': 0, 'start': 2.6},
 'Silurian': {'abbreviation': 'S', 'end': 419, 'start': 444},
 'Triassic': {'abbreviation': 'T', 'end': 201, 'start': 252}}

Note that `setdefault()` doesn't change the value if the key was already present. Now that we've inserted it, future attempts to retrieve the value correpsonding to this key will get the one from the dictionary, regardless of the default value we pass:

In [67]:
periods.setdefault('Mississippian', {'abbreviation': 'Miss'})

{'abbreviation': 'M', 'end': 323.2, 'start': 358.9}

### Exercise

- What is the expected output of the following?
  - `periods['Triassic']`
  - `periods['Jurassic'].get('start')`
- What command would you type to return the age of the end of the Permian?
- The start of the Cretaceous is wrong: it should be 145. Change it to the correct value.
- The Ediacaran Period (635 Ma, to the beginning of the Cambrian) is not in our dataset. Add it.

In [65]:
# your code here

## `set`

Instances of the `set` type implement something like the mathematical concept of sets. Sets in Python are unordered containers of unique values (i.e. duplicated elements are ignored). Sets are mutable, but their items must be immutable, so you can't have a set of lists, or a set of sets. Sets are defined by comma seperated values (of any type) between curly braces `{}`.

In [83]:
s = {7, 7, 1, 3, 2, 4, 7}
s

{1, 2, 3, 4, 7}

It looks as if the set is ordered, but it is not. IPython has sorted the values for display 'prettiness'. 

Because the items are unordered, you cannot index into a set by position:

In [86]:
s[3]

TypeError: 'set' object does not support indexing

What are sets useful for? It seems rather trivial, but the most popular use of sets is to get a list of unique items in a list of immutable objects:

In [103]:
sedimentary = ["Sandstone", "Siltstone", "Mudstone", "Limestone", "Dolostone", "Limestone"]
set(sedimentary)

{'Dolostone', 'Limestone', 'Mudstone', 'Sandstone', 'Siltstone'}

If I have a set that overlaps this set, we can do some set theoretic operations, such as unions and intersections:

In [101]:
carbonates = {"Limestone", "Dolostone", "Marble", "Carbonatite"}

set(sedimentary).intersection(carbonates)

{'Dolostone', 'Limestone'}

There's no elegant way to achieve the same thing with a list comprehension. We'd have to resort to a for loop to avoid adding duplicates to the list.

In [102]:
[r for r in sedimentary if r in carbonates]

['Limestone', 'Dolostone', 'Limestone']

You probably won't come across sets very often in your early use of Python. But it's good to know they are there. 

## All the types

That's it! Congratualations, you have met all the most important 'types' — Python's family of object classes that you will use every day of your coding life. If some of the details seem a little murky right now, don't worry about it. By the end of this book, you'll be intimately familiar with them, and you'll be wielding them without a second thought. 