# Dictionaries, Sets, Tuples

By [Allison Parrish](http://www.decontextualize.com/)

## Dictionaries

The dictionary is a very useful data structure in Python. The easiest way to conceptualize a dictionary is that it's like a spreadsheet with two columns: the "key" column, and the "value" column. Unlike a spreadsheet (or a Python list), dictionaries are implemented in such a way that it is very fast and efficient to find values by looking up their corresponding key.

We're going to focus here just on learning how to get data *out* of dictionaries, not how to build new dictionaries from existing data. We're also going to omit some of the nitty-gritty details about how dictionaries work internally. As a consequence, some of what I'm going to tell you will seem weird and magical. Be prepared!

### Why dictionaries?

For our purposes, the benefit of having data that can be parsed into dictionaries, as opposed to lists, is that dictionary keys tend to be *mnemonic*. That is, a dictionary key will usually tell you something about what its value is. (This is in opposition to parsing, say, CSV data, where we have to keep counting fields in the header row and translating that to the index that we want.)

Lists and dictionaries work together and are used extensively to represent all different kinds of data. Often, when we get data from a remote source, or when we choose how to represent data internally, we'll use both in tandem. The most common form this will take is representing a table, or a database, as a *list* of records that are themselves represented as *dictionaries* (mapping the name of the column to the value for that column).

Dictionaries are also good for storing *associations* or *mappings* for quick lookups. For example, if you wanted to write a program that was able to recall the capital city of every US state, you might use a dictionary whose keys are the names of the states, and whose values are the corresponding capitals. Dictionaries are also used for data analysis tasks, like keeping track of how many times a particular token occurs in an incoming data stream.

### What dictionaries look like

Dictionaries are written with curly brackets, surrounding a series of comma-separated pairs of *keys* and *values*. Here's a very simple dictionary, with one key, `Obama`, associated with a value, `Hawaii`:

In [1]:
{'Obama': 'Hawaii'}

{'Obama': 'Hawaii'}

Here's another dictionary, with more entries:

In [2]:
{'Obama': 'Hawaii', 'Bush': 'Texas', 'Clinton': 'Arkansas', 'Trump': 'New York'}

{'Obama': 'Hawaii',
 'Bush': 'Texas',
 'Clinton': 'Arkansas',
 'Trump': 'New York'}

As you can see, we're building a simple dictionary that associates the names of presidents to the home states of those presidents.

The association of a key with a value is sometimes called a *mapping*. (In fact, in other programming languages like Java, the dictionary data structure is called a "Map.") So, in the above dictionary for example, we might say that the key `Bill Clinton` *maps to* the value `Arkansas`.

A dictionary is just like any other Python value. You can assign it to a variable:

In [3]:
president_states = {'Obama': 'Hawaii', 'Bush': 'Texas', 'Clinton': 'Arkansas', 'Trump': 'New York'}

And that value has a type:

In [4]:
type(president_states)

dict

At its most basic level, a dictionary is sort of like a two-column spreadsheet, where the key is one column and the value is another column. If you were to represent the dictionary above as a spreadsheet, it might look like this:

| key   | value   |
| ----- | ------- |
| Obama | Hawaii |
| Bush | Texas |
| Clinton | Arkansas |
| Trump | New York |

The main difference between a spreadsheet and a dictionary is that dictionaries are *unordered*. For an explanation of this, see below.

Keys and values in dictionaries can be of any data type, not just strings. Here's a dictionary, for example, that maps integers to lists of floating point numbers:

In [5]:
{17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]}

{17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]}

> HEAD-SPINNING OPTIONAL ASIDE: Actually, "any type" above is a simplification: *values* can be of any type, but keys must be *hashable*---see [the Python glossary](https://docs.python.org/2/glossary.html#term-hashable) for more information. In practice, this limitation means you can't use lists (or dictionaries themselves) as keys in dictionaries. There are ways of getting around this, though!

A dictionary can also be empty, containing no key/value pairs at all:

In [6]:
{}

{}

### Getting values out of dictionaries

The primary operation that we'll perform on dictionaries is writing an expression that evaluates to the value for a particular key. We do that with the same syntax we used to get a value at a particular index from a list. Except with dictionaries, instead of using a number, we use one of the keys that we had specified for the value when making the dictionary. For example, if we wanted to know what Bill Clinton's home state was, or, more precisely, what the value for the key `Clinton` is, we would write this expression:

In [7]:
president_states["Clinton"]

'Arkansas'

Going back to our spreadsheet analogy, this is like looking for the row whose first column is "Clinton" and getting the value from the corresponding second column.

If we put a key in those brackets that does not exist in the dictionary, we get an error similar to the one we get when trying to access an element of an array beyond the end of a list:

In [8]:
president_states['Franklin']

KeyError: 'Franklin'

As you might suspect, the thing you put inside the brackets doesn't have to be a string; it can be any Python expression, as long as it evaluates to something that is a key in the dictionary:

In [9]:
president = 'Obama'
president_states[president]

'Hawaii'

You can get a list of all the keys in a dictionary using the dictionary's `.keys()` method:

In [10]:
president_states.keys()

dict_keys(['Obama', 'Bush', 'Clinton', 'Trump'])

That funny-looking `dict_keys(...)` thing isn't *exactly* a list, but it's close enough: you can use it anywhere you would normally use a list, like in a list comprehension:

In [11]:
[item.upper() for item in president_states.keys()]

['OBAMA', 'BUSH', 'CLINTON', 'TRUMP']

... or a `for` loop:

In [12]:
for item in president_states.keys():
    print(item)

Obama
Bush
Clinton
Trump


And a list of all the values with the `.values()` method:

In [13]:
president_states.values()

dict_values(['Hawaii', 'Texas', 'Arkansas', 'New York'])

If you want a list of all key/value pairs, you can call the `.items()` method:

In [14]:
president_states.items()

dict_items([('Obama', 'Hawaii'), ('Bush', 'Texas'), ('Clinton', 'Arkansas'), ('Trump', 'New York')])

(The weird list-like things here that use parentheses instead of brackets are called *tuples*. We'll discuss those later.)

### Other operations on dictionaries

[Here's a list of all the methods that dictionaries support](https://docs.python.org/3.6/library/stdtypes.html#mapping-types-dict). I want to talk about a few of these in particular. First, the in operator (which we've used previously to check to see if there's a substring in a string, or a particular item in a list), also works with dictionaries! It checks to see if a particular key exists in the dictionary:

In [15]:
'Obama' in president_states

True

In [16]:
'Franklin' in president_states

False

A dictionary can also go in a `for` loop, in the spot between `in` and the colon (where you might normally put a list). If you write a for loop like this, the loop will iterate over each key in the dictionary:

In [17]:
for item in president_states:
    print(item)

Obama
Bush
Clinton
Trump


### Dictionaries can contain lists and other dictionaries

Dictionaries are often used to represent *hierarchical* data structures, that is, data structures with a top-down organization. For example, consider a program intended to keep track of a shopping list. In such a program, you might want to categorize grocery items by category, so you might make a dictionary that has a key for each category:

In [18]:
shopping = {'produce': ['apples', 'oranges', 'spinach', 'carrots'],
            'meat': ['ground beef', 'chicken breast']}

The `shopping` dictionary above has two keys, whose values are both *lists*. Writing an expression that evaluates to one of these lists is easy, e.g.:

In [19]:
shopping['meat']

['ground beef', 'chicken breast']

And you could write a `for` loop to print out the items of one of these lists fairly easily as well, e.g.:

In [20]:
print("Produce items on your list:")
for item in shopping['produce']:
    print("* " + item)

Produce items on your list:
* apples
* oranges
* spinach
* carrots


Slightly more challenging is this: how do you write an expression that evaluates to (let's say) the *first item* of the list of produce? The trick to this is to remember how indexing syntax works. When you have a pair of square brackets with a single value inside of them, Python looks immediately to the left of those square brackets for an expression that *evaluates to* either a list or a dictionary. For example, in the following expression:

In [21]:
[5, 10, 15, 20][3]

20

... you can think of Python as looking at this expression from right to left. It sees the `[3]` first and then thinks, "okay, I need to find something that is a list or dictionary directly to the left of this, and grab the third item (index-wise)." In fact, it *does* find a list or a dictionary (i.e., the list `[5, 10, 15, 20]`) and evaluates the entire expression to `20` accordingly.

With that in mind, let's rephrase the task. I want to get:

* the first item
* from the list that is the value for the key `produce`
* in the dictionary `shopping`

We can work at this problem by following these instructions and then writing the expression *in reverse*. To get the first item from a list, we write:

    ????[0] # the first item
    
`????` is just a placeholder for the part of the code that we haven't written yet, but we know that it has to be a list. Then, to get the value for the key `produce`:

    ????["produce"][0] # from the list that is the value for the key `produce`
    
Again, `????` is a placeholder, but now we know it has to be a dictionary. The dictionary, of course, is `shopping`, so we can fill that in as the last step:

    shopping["produce"][0]
    
Let's see what that expression evaluates to:

In [22]:
shopping["produce"][0]

'apples'

Exactly right! But let's say we want to take the organization in our dictionary up a notch and create separate categories for fruits and vegetables. One way to do this would be to make the value for the key `produce` be... another dictionary, like so:

In [23]:
shopping = {'produce': {'fruits': ['apples', 'oranges'], 'vegetables': ['spinach', 'carrots']},
            'meat': ['ground beef', 'chicken breast']}

This is now a pretty complicated data structure! (Well, not *that* complicated compared to what you'll see, e.g., in responses from web APIs. But it's the most complicated data structure we've made so far.) If we were to draw a schematic of this data structure, it might look something like this:

    shopping (dictionary)
        -> produce (dictionary)
            -> fruits (list)
            -> vegetables (list)
        -> meat (list)
        
In prose: `shopping` is a variable that contains a dictionary. That dictionary has two keys: `produce`, whose value is itself a dictionary, and `meat`, whose value is a list. (Whew!)

Given this data structure, let's work through how to do the following tasks:

* Get a list of all fruits
* Get a list of all categories of produce
* Get the first fruit
* Get the second vegetable

Getting a list of the fruits requires getting the value for the `fruits` key in the dictionary that is the value for the `produce` key. So we start out with:

                       ['fruits'] -> Step one
            ['produce']['fruits'] -> Step two
    shopping['produce']['fruits'] -> Step three
    
The final expression:

In [24]:
# A list of all fruits
shopping['produce']['fruits']

['apples', 'oranges']

Continuing with our tasks:

In [25]:
# a list of all categories of produce
shopping['produce'].keys()

dict_keys(['fruits', 'vegetables'])

In [26]:
# the first fruit
shopping['produce']['fruits'][0]

'apples'

In [27]:
# the second vegetable
shopping['produce']['vegetables'][1]

'carrots'

### Adding key/value pairs to a dictionary

Once you've assigned a dictionary to a variable, you can add another key/value pair to the dictionary by assigning a value to a new index, like so:

In [28]:
president_states['Reagan'] = 'California'

Take a look at the dictionary to see that there's a new key/value pair in there:

In [29]:
president_states

{'Obama': 'Hawaii',
 'Bush': 'Texas',
 'Clinton': 'Arkansas',
 'Trump': 'New York',
 'Reagan': 'California'}

### Dictionary keys are unique

Another important fact about dictionaries is that you can't put the same key into one dictionary twice. If you try to write out a dictionary that has the same key used more than once, Python will silently ignore all but one of the key/value pairs. For example:

In [32]:
{'a': 1, 'a': 2, 'a': 3}

{'a': 3}

Similarly, if we attempt to set the value for a key that already exists in the dictionary (using `=`), we won't add a second key/value pair for that key---we'll just overwrite the existing value:

In [33]:
test_dict = {'a': 1, 'b': 2}
test_dict['a']

1

In [34]:
test_dict['a'] = 100
test_dict['a']

100

In the case where a key needs to map to multiple values, we might instead see a data structure in which the key maps to another kind of data structure that itself can contain multiple values, like a list:

In [35]:
{'a': [1, 2, 3]}

{'a': [1, 2, 3]}

### Representing column-oriented data with dictionaries

Dictionaries are often used to represent table-like data structures (i.e., two-dimensional data structures with columns and rows). One way of representing such data is to make a list of dictionaries, where each individual row is represented with a dictionary whose keys are the column names, and whose values are the contents of the cell for that row. That's a mouthful—maybe easier to understand with an example:

In [58]:
city_data_list = [
    {
      "city": "New York",
      "state": "New York",
      "population": 8461961
    },
    {
      "city": "Los Angeles",
      "state": "California",
      "population": 3918872
    },
    {
      "city": "Chicago",
      "state": "Illinois",
      "population": 2714017
    },
    {
      "city": "Houston",
      "state": "Texas",
      "population": 2240582
    },
    {
      "city": "Philadelphia",
      "state": "Pennsylvania",
      "population": 1559938
    }
]

An individual row can be accessed by indexing the list, e.g.:

In [59]:
city_data_list[2]

{'city': 'Chicago', 'state': 'Illinois', 'population': 2714017}

... and an individual cell value can be accessed by indexing a key name from that row:

In [60]:
city_data_list[2]['state']

'Illinois'

Getting all of the values from a particular column involves a list comprehension. For example, to find the sum of the population of these cities:

In [61]:
sum([city['population'] for city in city_data_list])

18895370

Another way of using a dictionary to represent tabular data is to have keys for every column, whose value is a list of all of the values for cells in that column. The data above, for example, can also be represented like this:

In [73]:
city_data_dict = {
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia'],
    'state': ['New York', 'California', 'Illinois', 'Texas', 'Pennsylvania'],
    'population': [8461961, 3918872, 2714017, 2240582, 1559938]
}

In this arrangement, getting the value for a particular column and row involves indexing the key for the column, and then the numerical index for the row:

In [74]:
city_data_dict['state'][2]

'Illinois'

... but performing operations on an entire column only involves one index (i.e., getting the value for the key corresponding to the column):

In [75]:
sum(city_data_dict['population'])

18895370

It's easy to use either of these formats with Pandas, by the way. Just pass the dictionary as an argument to the `DataFrame` constructor. This works both for lists-of-dictionaries...

In [76]:
import pandas as pd

In [77]:
df = pd.DataFrame(city_data_list)
df

Unnamed: 0,city,state,population
0,New York,New York,8461961
1,Los Angeles,California,3918872
2,Chicago,Illinois,2714017
3,Houston,Texas,2240582
4,Philadelphia,Pennsylvania,1559938


... and dictionaries of lists:

In [78]:
df = pd.DataFrame(city_data_dict)
df

Unnamed: 0,city,state,population
0,New York,New York,8461961
1,Los Angeles,California,3918872
2,Chicago,Illinois,2714017
3,Houston,Texas,2240582
4,Philadelphia,Pennsylvania,1559938


Neat!

By the way, the [city data](https://github.com/dariusk/corpora/blob/master/data/geography/us_cities.json) I used in this example is from the [Corpora Project](https://github.com/dariusk/corpora/tree/master), which has a number of useful small datasets. These datasets are in JSON format, but for the most part, they can be copy/pasted into your Python code without modification.

> *Advanced exercise*: Write some code that converts data in the list-of-dictionaries format to the dictionary-of-lists format, and vice versa. You might look into the `defaultdict` class from the Collections library.

## Sets

The set is our second important data structure. You can think of a set as a kind of list, but with the following caveats:

1. Sets don’t maintain the order of objects after you’ve put them in.
2. You can’t add an object to a set if it already has an identical object.

Objects can be added to a set by calling its `.add()` method (as opposed to the `.append()` method used for lists).

A corollary to item 1 from the list above is that you can’t use the square bracket notation to access a particular element in a set. Once you’ve added an object, the only operations you can do are to check to see if an object is in the set (with the `in` operator), and iterate over all objects in the set (with, for example, `for`). So, for example, to initialize a set:

In [36]:
emojis = set()

And then add some items to the set:

In [37]:
emojis.add("😀")
emojis.add("😜")
emojis.add("😐")

Evaluating the set shows us its members:

In [38]:
emojis

{'😀', '😐', '😜'}

And you can check to see if something is in a set using the `in` operator:

In [39]:
'😀' in emojis

True

In [40]:
'hello' in emojis

False

You can write a loop that executes code for every item in a set by putting the set between `in` and the colon in a `for` loop:

In [42]:
for item in emojis:
    print(item)

😀
😜
😐


An additional aspect of sets to note from the transcript above: because sets don’t maintain the order of objects, you’ll get the objects back in (seemingly) random order when you iterate over the set. For most applications, this isn’t a problem, but it’s something to keep in mind.

One thing you'll notice about sets is that you can't add the same item twice. Observe:

In [43]:
important_numbers = set([5, 10, 15, 20, 25])

In [44]:
important_numbers

{5, 10, 15, 20, 25}

In [45]:
important_numbers.add(15)

In [46]:
important_numbers

{5, 10, 15, 20, 25}

For this reason, sets are often used as a way to quickly remove duplicates from a list. All you need to do is pass a list to the `set()` function, and then use the `list()` function to convert the data back into a list. For example:

In [47]:
letters = ["a", "b", "c", "a", "d", "a", "b", "e"]

In [48]:
list(set(letters))

['e', 'b', 'd', 'c', 'a']

## Tuples

Tuples (rhymes with "supple") are data structures very similar to lists. You can create a tuple using parentheses (instead of square brackets, as you would with a list):

In [49]:
t = ("alpha", "beta", "gamma", "delta")
t

('alpha', 'beta', 'gamma', 'delta')

You can access the values in a tuple in the same way as you access the values in a list: using square bracket indexing 
syntax. Tuples support slice syntax and negative indexes, just like lists:

In [50]:
t[-2]

'gamma'

In [51]:
t[1:3]

('beta', 'gamma')

The difference between a list and a tuple is that the values in a tuple can't be changed after the tuple is created. This means, for example, that attempting to .append() a value to a tuple will fail:

In [52]:
t.append("epsilon")

AttributeError: 'tuple' object has no attribute 'append'

Likewise, assigning to an index of a tuple will fail:

In [53]:
t[2] = "bravo"

TypeError: 'tuple' object does not support item assignment

### Why tuples? Why now?

"So," you think to yourself. "Tuples are just like... broken lists. That's strange and a little unreasonable. Why even have them in your programming language?" That's a fair question, and answering it requires a bit of knowledge of how Python works with these two kinds of values (lists and tuples) behind the scenes.

Essentially, tuples are *faster* and *smaller* than lists. Because lists can be modified, potentially becoming larger after they're initialized, Python has to allocate more memory than is strictly necessary whenever you create a list value. If your list grows beyond what Python has already allocated, Python has to allocate more memory. Allocating memory, copying values into memory, and then freeing memory when it's when no longer needed, are all (perhaps surprisingly) slow processes—slower, at least, than using data already loaded into memory when your program begins.

Because a tuple can't grow or shrink after it's created, Python knows exactly how much memory to allocate when you create a tuple in your program. That means: less wasted memory, and less wasted time allocating a deallocating memory. The cost of this decreased resource footprint is less versatility.

Tuples are often called an immutable data type. "Immutable" in this context simply means that it can't be changed after it's created.

### Tuple unpacking in `for` loops

Python "unpacks" lists of tuples in `for` loops. This means that a `for` loop can technically have more than one temporary variable name in the initial `for` statement; those temporary variables will be filled, in order, with the tuples from the list that you are iterating over. For example, here we have a list of tuples:

In [86]:
names = [
    ('James', 'Kirk', 'Enterprise'),
    ('Jean-Luc', 'Picard', 'Enterprise D'),
    ('Benjamin', 'Sisko', 'Defiant'),
    ('Kathryn', 'Janeway', 'Voyager')
]

Iterating over this list in a `for` loop gives the following:

In [87]:
for item in names:
    print(item)

('James', 'Kirk', 'Enterprise')
('Jean-Luc', 'Picard', 'Enterprise D')
('Benjamin', 'Sisko', 'Defiant')
('Kathryn', 'Janeway', 'Voyager')


But if we use *two* temporary variables, separated by commas, the first temporary variable gets the value of the first item in the tuple, the second temporary variable gets the value of the second variable, and so forth:

In [89]:
for a, b, c in names:
    print("First name:", a)
    print("Last name:", b)
    print("Vessel:", c)
    print("---")

First name: James
Last name: Kirk
Vessel: Enterprise
---
First name: Jean-Luc
Last name: Picard
Vessel: Enterprise D
---
First name: Benjamin
Last name: Sisko
Vessel: Defiant
---
First name: Kathryn
Last name: Janeway
Vessel: Voyager
---


Of course the temporary variable names can be whatever you choose. If you use more (or fewer) temporary variables than there are values in the tuple, you get an error:

In [90]:
for a, b in names:
    print(a, b)

ValueError: too many values to unpack (expected 2)

In [91]:
for a, b, c, d in names:
    print(a, b, c, d)

ValueError: not enough values to unpack (expected 4, got 3)

### Tuples in the standard library

Because tuples are faster, they're often the data type that gets returned from methods and functions in Python's built-in library. For example, the .items() method of the dictionary object returns a list of tuples (rather than, as you might otherwise expect, a list of lists):

In [92]:
moon_counts = {'mercury': 0, 'venus': 0, 'earth': 1, 'mars': 2}
moon_counts.items()

dict_items([('mercury', 0), ('venus', 0), ('earth', 1), ('mars', 2)])

The tuple() function takes a list and returns it as a tuple:

In [93]:
tuple([1, 2, 3, 4, 5])

(1, 2, 3, 4, 5)

If you want to initialize a new list with with data in a tuple, you can pass the tuple to the list() function:

In [94]:
list((1, 2, 3, 4, 5))

[1, 2, 3, 4, 5]

### Zip and enumerate

There are two very useful functions that work with lists and tuples. The first is `enumerate()`, which takes a list and converts it to a list of tuples, where each tuple's second element is the value from the original list, and the first element is the index of that value in the list. Again, this is easier to describe with an example:

In [95]:
planets = ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"]

In [96]:
list(enumerate(planets))

[(0, 'Mercury'),
 (1, 'Venus'),
 (2, 'Earth'),
 (3, 'Mars'),
 (4, 'Jupiter'),
 (5, 'Saturn'),
 (6, 'Uranus'),
 (7, 'Neptune')]

(Technically, `enumerate()` evaluates to an `enumerate` object, which we can turn into a list with the `list()` function. But it operates like a list in other contexts—see below.) This is especially useful in `for` loops, where we want both the numerical index of the item, as well as its value:

In [97]:
for i, item in enumerate(planets):
    print(i, item)

0 Mercury
1 Venus
2 Earth
3 Mars
4 Jupiter
5 Saturn
6 Uranus
7 Neptune


The `zip()` function takes two lists of equal length, and returns a list of tuples that pairs up each successive item from both lists. This is useful for joining together two different lists into a single data structure, when you know that the items in both lists are related by index. For example:

In [99]:
vessels = ["Enterprise", "Enterprise-D", "Defiant", "Voyager"]
captains = ["Kirk", "Picard", "Sisko", "Janeway"]
list(zip(captains, vessels))

[('Kirk', 'Enterprise'),
 ('Picard', 'Enterprise-D'),
 ('Sisko', 'Defiant'),
 ('Janeway', 'Voyager')]

Again, this is often used in `for` loops or list comprehensions to easily iterate over two lists at once:

In [100]:
for a, b in zip(captains, vessels):
    print(a, "is the captain of", b)

Kirk is the captain of Enterprise
Picard is the captain of Enterprise-D
Sisko is the captain of Defiant
Janeway is the captain of Voyager
