# Dictionaries and collections

We have seen that lists, strings and so forth have a definite, intrinsic order.  This is not necessarily a sorted order: the words of a sentence are not in alphabetical order.  This means that in order to find out, for example, if a list contains a given word, you need to go through the whole list.  That may not be a problem for a sentence, but if your list is a list of all of the sentences in the 19th C. British novel, that might take a long time.

In this notebook, we are going to look at unordered collections, which are implemented in a way that makes it very fast to find out whether some item is already a member of the collection and to put your finger on where it is in the collection.  These data structures are also used for mapping one set of items to another, as in a table of values you can look up by keywords.  

This makes sense, if you think about it.  If you are looking up a keyword or adding a new one to the table, you need to be able to find it quickly; you do not want to be seeking through each one.  So a data structure that permits fast look-ups is essential.  There are two ways of doing this.  One is to use a *sorted* list.  This permits you to find a given item quickly, becuase you know where it belongs.  But many lists, such as the words in a sentence, do not come in sorted order.

So most languages implement a data structure which is variously called a hash or an associative array. Python calls this a dictionary, becuase it is a structure that permits you to look up a word (or other object) quickly, and find an entry for it (a value, which is another object).  But this is also a misnomer, since real dictionaries afford fast lookups because their entries are sorted, whereas the keys in Python dictionaries are not sorted.

## Dictionaries

A dictionary is an unordered set of key, value pairs.  The keys may be any immutable object (such as strings, numbers and tuples) and the values may be any object (even another dict, which permits nested dicts).

You can create a dict with curly braces {}, and access individual items with square brackets.  The square brackets syntax is similar to lists, except that the keys are not limited to the integers from zero up to the index of the last item. 

In [None]:
letters = {'a': 'α', 
           'b': 'β', 
           'g': 'γ', 
           'd': 'δ', 
           'e': 'ε'}
letters

In [None]:
letters['b']

In [None]:
letters['z'] = 'ζ'
letters

**NB** You see that here the order of the dict has been preserved: the items stay in the order in which they have been inserted.  This is new behaviour in recent versions of Python (3.7+).  So technically dicts are no longer unordered data types.  But in most cases that won't matter.  In the lookup table above, it does not matter that we have listed the letters in alphabetical order, beacuse you cannot access them by numerical index, or take a slice of it, and so on. 

In [None]:
letters.get('g')

In [None]:
letters['g']

In [None]:
letters['h']

In [None]:
print(letters.get('h'))

In [None]:
letters.get('h', "Oh no!")

In [None]:
letters.keys()

In [None]:
for key in letters.keys():
    print(key, letters[key])

In [None]:
letters.items()

In [None]:
for key, value in letters.items():
    print(key, value)

items() returns a list of (key, value) tuples.  A tuple is just a list of fixed length.  It uses parentheses.

In [None]:
letters.items()

### Example

As an example, let's load in a text and count the number of times each word appears.  This means a dict where the keys are words and the values are the number of times we have seen it.

In [None]:
words = {}
with open('carroll-alice.txt') as f:
    for line in f.readlines():
        for word in line.split():
            words[word] = words.get(word, 0) + 1

In [None]:
words

In [None]:
words['The']

In [None]:
import string
string.punctuation

In [None]:
words = {}
with open('carroll-alice.txt') as f:
     for line in f.readlines():
            for word in line.split():
                word = word.strip(string.punctuation)
                word = word.lower()
                if word == '': continue 
                words[word] = words.get(word, 0) + 1

In [None]:
words

In [None]:
words.items()

We really want this sorted by values, not by keys.  We could define a function to help do this, but there is an easier way.

## Collections

Above, we started off with an empty dict = {}.  Then we incremented the values of each key.  When a key did not already exist, we used the second argument of the get() method to supply a default value of zero if we had never seen that word before. 

Always using get() with two argument instead of accessing values by means of the key in square brackets is tiresome, so we can take a shortcut by using a modification of the dict data structure from the collections library called defaultdict.   This permits you to define a default value for the dict to return any time the key is not found.

So instead of writing as we did above:

    words[word] = words.get(word, 0) + 1

We could just write:

    words[word] += 1

Because the first time you access a word that doesn't exist in the defaultdict, it sets its default value to zero.

In [None]:
from collections import defaultdict
test = {}
test = defaultdict(int)
test['the'] += 1
test['the'] += 1
test['a'] += 1
test

In fact, there is a handy Counter data structure that maps to a counter you can increment. 

In [None]:
from collections import Counter
test = Counter()
test['the'] += 1
test['the'] += 1
test['a'] += 1
test

In [None]:
from collections import Counter
words = Counter()
with open('carroll-alice.txt') as f:
     for line in f.readlines():
            for word in line.split():
                word = word.strip(string.punctuation)
                word = word.lower()
                if word == '': continue 
                words[word] += 1

In [None]:
words.most_common()

## Nested Dictionaries

In [None]:
from collections import Counter, defaultdict
test = defaultdict(Counter)
test['Chapter 1']['the'] += 1
test['Chapter 1']['the'] += 1
test['Chapter 2']['the'] += 1
test['Chapter 2']['a'] += 1
test

In general, working with nested dicts is made much easier when you use defaultdict to define a dict recursively so that you can initialize several levels at once.

In [None]:
from collections import defaultdict
def tree(): 
    return defaultdict(tree)
taxonomy = tree()
taxonomy['Animalia']['Chordata']['Mammalia']['Carnivora']['Felidae']['Felis'] = 'cat'
taxonomy['Animalia']['Chordata']['Mammalia']['Carnivora']['Canidae']['Canis'] = 'dog'
taxonomy['Plantae']['Solanales']['Solanaceae']['Solanum'] = 'tomato'
taxonomy.keys()

In [None]:
taxonomy['Animalia'].keys()

In [None]:
taxonomy['Animalia']['Chordata']['Mammalia']['Carnivora'].keys()