

# Summer 2018 Status: UNREVIEWED

# Dictionaries

Dictionaries are like lists in that they hold collections of data.  However, as we discussed with the for and while loops, they have fundamentallf different purposes and so they act in very different ways.  Don't try to understand dictionaries like you understand lists.  Yes, they each hold things, but that's where similarites stop.

Think of a dictionary as a small database.  It has two pieces:  unique IDs, and associated records.  We call these keys and values.  

## Think of an apartment building

Your apartment has a unique address.  First off, the street address tells you which unique buildng you want to find.  There may be several identical buildings, but each has a unique address.

Inside that apartment building there are unit IDs.  These may be letters (A, B, etc), labels (Garden, Penthouse, etc), or numbers (6, 303. etc).  Many apartment buildings share the same numbering systems, and this is fine because that apartment belongs to that building.  So the unique combination of street address and unit number can uniquly identify a unit in the same city.

Inside these units you can hold whatever you want.  Usually household goods, food, pets, collectables, human bodies, etc.  The idea is that in order to access this unit, you must go into a building and enter a specific ID.  The IDs in that unit must be unique.  

You also have to access things in order.  So you can't just directly go into an apartment number.  Even if that apartment door is outside, you still have to find the building.  Imagine saying, "Hi, I live in New York City and I likein apartment 3.  The party will be at 8pm."  Will anyone be at your party?  Not so much.

In this analogy, the building is our dictionary.  The street address is our variable name, and the unit number is our key.  You use the key with the variable name to access the contents of that unit.

That's pretty much it.

In [1]:
mybuilding = {1: ['cat', 'adult human', 'adult human',
                  'human child', 'rosy boa', "Dumeril's boa"],
              2: ['adult human'],
              3: ['adult human', 'adult human', 'human child', 'dog?']}

If I'm using this to represent my apartment, there are 3 units:  `1`, `2`, and `3`.  Each of these unit labels is actually my key.  See the : in there after each key?  That is separating the key and the value, and the order does matter.  It is always `key: value`.  Meanwhile, each of my values is a list that contains strings of the occupants.

As we start covering syntax, I suggest that you write them down or make a small cheat sheet for yourself.  You'll likely want to look it up each time you want to use it until you learn them by heart.

# Common patterns that you will want to look up constantly

## Accessing values with a key

Here's how we can look up the contents:  `dictvariable[key]`

In [2]:
print(mybuilding[1])

['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"]


In [3]:
print(mybuilding[2])

['adult human']


In [4]:
print(mybuilding[3])

['adult human', 'adult human', 'human child', 'dog?']


And if I try to look up a key that isn't in there?  I get a key error.

In [5]:
print(mybuilding[4])

KeyError: 4

Something to note is that these keys and values can be of any data type, and you can have a combination of them in the same dictionary.  You just need to match how that data type is actually typed in.  Don't forget the quotes if you have a string!

In [6]:
anotherbuilding = {'Garden': ['Adult human'],
                    2: ['Adult human', 'Adult human', 'cat']}

## Adding a new key/value pair

Say that I leart more about my neighbors and I want to add more things in.  I can add garages as well.

The syntax for adding a key/value looks like an assignment statement extending our lookup syntax.

`dictvariable[new_key] = new_value`

So you place the new key that you want to add in the `[]` and whatever the value is after the `=`.

In [7]:
mybuilding['garage 1'] = ['car', 'innumerable crap']

In [8]:
print(mybuilding)

{1: ['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"], 2: ['adult human'], 3: ['adult human', 'adult human', 'human child', 'dog?'], 'garage 1': ['car', 'innumerable crap']}


Take a closer look at that syntax.  It has the `[]` with the new key on the left side of the `=` statement and only 1 `=`.  So I don't need to completely reassign my dictionary to change it.

## Changing the value with a given key

Warning!  This syntax is also shared for the reassignment.  If you reuse a key that is in the dictionary, it will overwrite that value without warning.

In [9]:
mybuilding['garage 1'] = ['car', 'bike', 'bike', 'innumerable crap']

In [10]:
print(mybuilding)

{1: ['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"], 2: ['adult human'], 3: ['adult human', 'adult human', 'human child', 'dog?'], 'garage 1': ['car', 'bike', 'bike', 'innumerable crap']}


Alternatively, since our value for each key is in fact a mutator, we can directly reference that object using our access syntax and alter it.

Say that the person in apartment 2 has a baby and I want to add that to their record.  Since the data type of that value is a list, I can call .append() to that lookup statement.

In [11]:
mybuilding[2].append('human child')

In [12]:
print(mybuilding)

{1: ['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"], 2: ['adult human', 'human child'], 3: ['adult human', 'adult human', 'human child', 'dog?'], 'garage 1': ['car', 'bike', 'bike', 'innumerable crap']}


## checking to see if a key is in there

You'll often be looping over some data and building up a dictionary in the process.  Oftentimes you'll want to handle things defferently if that key is not already in the dictionary.  For example, let's say I want to add some new people to a unit not already in my dictionary.

In [13]:
mybuilding[5].append('adult human')

KeyError: 5

We can check this with the `in` keyword that we've seen elsewhere.  This will create a boolean expression that will return `True` if the key does exist in that dictionary, and `False` if it doesn't.

In [14]:
5 in mybuilding

False

I can't append something to a key that doesn't already exist, but if I use my assignment statement I'll destroy the data I already have in there.  Here's a fragment showing how we might handle such a situation.

In [15]:
new_member = 'adult human'
key = 5

if key in mybuilding:
    mybuilding[key].append()
else:
    mybuilding[key] = [new_member]

In [16]:
print(mybuilding)

{1: ['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"], 2: ['adult human', 'human child'], 3: ['adult human', 'adult human', 'human child', 'dog?'], 'garage 1': ['car', 'bike', 'bike', 'innumerable crap'], 5: ['adult human']}


This is an incredibly common pattern.  You'll also notice that I can reference my key as a variable.  Hint hint, this means you can loop over a set of keys and access the values in turn.

## Getting the keys and values out separately

There are several helpful dictionary methods to use.

Warning!  Dictionaries have no actual order.  The following methods will give you lists, which have order, but that order should not be depended on.

`mydict.keys()` will give you the keys and `mydict.values()` will give you the values.

In [17]:
print(list(mybuilding.keys()))

[1, 2, 3, 'garage 1', 5]


In [18]:
print(list(mybuilding.values()))

[['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"], ['adult human', 'human child'], ['adult human', 'adult human', 'human child', 'dog?'], ['car', 'bike', 'bike', 'innumerable crap'], ['adult human']]


I know they may look like they are in order, but you shouldn't depend on that.

## getting the keys and values out together

Here's how you can get the pairs of data out such that the pair relationship can be reliably maintained.

In [19]:
print(list(mybuilding.items()))

[(1, ['cat', 'adult human', 'adult human', 'human child', 'rosy boa', "Dumeril's boa"]), (2, ['adult human', 'human child']), (3, ['adult human', 'adult human', 'human child', 'dog?']), ('garage 1', ['car', 'bike', 'bike', 'innumerable crap']), (5, ['adult human'])]


These are tuple pairs, and are valuable when looping over them.

## looping over lists

You may want to loop over the entire thing at once.  Using `.items()` for this is great for when you need both the key and value items out at the same time, and you want to keep your code very tidy.

In [20]:
for key, value in mybuilding.items():
    print("Unit", key, "has", ", ".join(value))

Unit 1 has cat, adult human, adult human, human child, rosy boa, Dumeril's boa
Unit 2 has adult human, human child
Unit 3 has adult human, adult human, human child, dog?
Unit garage 1 has car, bike, bike, innumerable crap
Unit 5 has adult human


However, sometimes you might want to mess with the keys and only look up a few things.  (massive homework hint) In this case, you can extract out the list of keys, do what you need with it, and then loop through those remaining (LITERALLY DYING OF HINT HERE) keys.

Say that I only want the integer value keys.

In [21]:
allkeys = mybuilding.keys()

justints = []

for key in allkeys:
    if type(key) == int:
        justints.append(key)
    else:
        print("rejected!", key)
        
print("Remaining keys are:", justints)

for key in justints:
    print(key, "has:", ", ".join(mybuilding[key]))

rejected! garage 1
Remaining keys are: [1, 2, 3, 5]
1 has: cat, adult human, adult human, human child, rosy boa, Dumeril's boa
2 has: adult human, human child
3 has: adult human, adult human, human child, dog?
5 has: adult human


## Counting occurances

There are several things we can do to count occurances, but I suggest using the `Counter` object first.

Let's revisit last week's example of randomly counting to 100.  Below we've got code that runs the simulation 10,000 times and collects up the number of steps taken for each.

In [22]:
import random

# priming all the values
count_sum = 0
top = 100
step_count = 0

all_step_counts = []

for i in range(10000):
    while count_sum < 100:
        my_num =  random.randint(0, 10)

        count_sum += my_num # sum = sum + my_num
        step_count += 1 # step_count = step_count + 1
#     print(sum, step_count)
    all_step_counts.append(step_count)
    # reset!
    count_sum = 0
    step_count = 0

print("The average number of steps to take is:", sum(all_step_counts) / len(all_step_counts))

The average number of steps to take is: 20.5947


# Make a dictionary of the data

In [24]:
counts_dict = {} # make an empty dictionary

for one_count in all_step_counts:
    if one_count not in counts_dict:
        counts_dict[one_count] = 1 # start at one if it isn't already there
    else:
        counts_dict[one_count] += 1 # increment up by one if it isn't already there

In [25]:
print(counts_dict)

{24: 596, 17: 753, 22: 1156, 27: 158, 18: 1080, 20: 1385, 23: 834, 19: 1274, 21: 1314, 15: 175, 26: 251, 16: 390, 25: 418, 30: 19, 29: 31, 28: 83, 14: 64, 31: 7, 34: 2, 13: 6, 32: 3, 33: 1}


# But what can we do with this in text?

### Problem Statement
From the book DRACULA, the contents of the book itself, count every word in each chapter. How many of them are repeated? What is the count of each word repeated in Dracula?

------

To figure this problem out, we have to think about a few things.

#### 1. What is our fidelity?
Our unit of measure in this particular problem is a word. Specifically, we want to break things down into a list of lines so that we can then split those up into a list of words. Additionally, we want *only* the book Dracula. This means that we will need to deal with getting the book separated from the table of contents as well as the header and footer. We also need to deal with the infamous *NOTE*. So we have that already!

In [17]:
# we open up the Dracula text AND readlines() so that we can get all the lines into a list.
dracula = open('dracula.txt', 'r')
all_text = dracula.readlines()
dracula.close() # remember to close the file!

# alternatively, you could also do something like:

# with open('dracula.txt', 'r') as dracula:
#     all_text = dracula.readlines()

# This would open the text file, read it in as a variable called Dracula and as it does that, 
# readlines() all lines into a variable called, "all_text" AND close the file.

# we then need to get all of dracula accounted for. You will recall this example from class!
start_index = all_text.index('DRACULA\n')
end_index = all_text.index('                                THE END\n')+1

#### 2. How do we get what we want?

The problem asks, "For every word in Dracula, how many of them are repeated? What is the count of each word repeated in Dracula?"

We already got the whole of dracula in our "START_INDEX/END_INDEX" so how do we go on from there? Well, we need to think about what readlines() did! It put all of dracula into a giant list for us. As it is, we can add to this code a for loop that will split up the list entries into words.

In [20]:
# First, we need to get everything into one space. We'll make a variable for it.

content_lines = all_text[start_index:end_index] #everything in the book.

# then we need to split stuff up! That means we take each line and .split it on spaces. SO:

# for line in content_lines:
#     each_word = line.split()
    
# But we need to consider actually collecting them so we need to add an accumulator of some kind or another. 

for line in content_lines:
    each_word = line.lower().split() #we're adding .lower here so that every word is uniform. 

# But how do we actually count these words?

#### 3. How to count things?

There are 2 ways that you can count things. We'll cover one version without using a dictionary and another with dictionaries. First, we need to think about how to count things and we can do that with a module. 

We've talked about these in a bunch of different places but we've never really addressed them directly. We've used the [String Module](https://docs.python.org/2/library/string.html). In particular, we used 'string.punctuation' for getting things like commas and periods from the text itself. 

There are other modules built in to the Python interpreter and we call them the [Python Standard Library](https://docs.python.org/3/library/). One of these modules allows us to collect data of a particular type and it is called "COLLECTIONS". You can read about it here: [http://pymbook.readthedocs.io/en/latest/collections.html](http://pymbook.readthedocs.io/en/latest/collections.html). 

Of interest is `Counter` of which is said `Counter is a dict subclass which helps to count hashable objects. Inside it elements are stored as dictionary keys and counts are stored as values which can be zero or negative.`

In [31]:
word_counts = Counter()
for line in content_lines:
    words = line.lower().split()
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1

for word, freq in word_counts.most_common(50):
    print(word, '\t|', str(freq).zfill(4))

the 	| 7777
and 	| 5683
i 	| 4495
to 	| 4420
of 	| 3574
a 	| 2869
he 	| 2507
in 	| 2410
that 	| 2349
was 	| 1801
it 	| 1717
as 	| 1552
we 	| 1484
for 	| 1459
his 	| 1445
is 	| 1428
not 	| 1290
with 	| 1259
my 	| 1211
you 	| 1123
at 	| 1064
have 	| 1041
all 	| 1029
be 	| 1023
had 	| 1019
but 	| 0998
so 	| 0983
on 	| 0959
her 	| 0857
me 	| 0847
she 	| 0794
when 	| 0743
there 	| 0693
* 	| 0640
which 	| 0630
if 	| 0615
from 	| 0611
him 	| 0594
this 	| 0544
are 	| 0543
were 	| 0529
by 	| 0482
could 	| 0460
they 	| 0447
or 	| 0442
then 	| 0437
some 	| 0434
must 	| 0428
one 	| 0413
what 	| 0413


#### But what if we want to use dictionaries directly? What if we want to count everything by hand instead of using a module?

In this case, things get a little tricky. 

In [32]:
# Let's open up the dracula text and work with it. 
stopwords = ['the', 'of', 'and', 'to', 'a', 'in', 'that', 'was', 'for', 'it', 'i']  # among the most common words in English


def main():
    dracula = open('dracula.txt', 'r')
    all_text = dracula.readlines()
    dracula.close()

    start_index = all_text.index('DRACULA\n')
    end_index = all_text.index('                                THE END\n')+1

    content_lines = all_text[start_index:end_index]

    # See the 2CR answer for a simpler Counter-based solution.
    word_counts = {}
    for line in content_lines:
        words = line.lower().split()
        for word in words:
            if word not in stopwords:
                word_counts[word] = word_counts.get(word, 0) + 1

    # Calling items() and casting to list results in a list of tuples, with each tuple containing the (key,
    # value). For word counts, this looks like [('happy', 3), ('nail-studded', 1), ('lucy', 141), ...]. A dictionary
    # isn't directly sortable, but turning into a list in this way makes the elements sortable.
    word_count_tuples = list(word_counts.items())

    # sort() will call the byFreq() method once for each element in the list that's being sorted. In this case,
    # each element is an individual tuple (see above). The value returned by byFreq() will be the basis of the sorting,
    # so we want to make sure we return the count.
    word_count_tuples.sort(key=byFreq, reverse=True)
    for word, freq in word_count_tuples[:100]:  # limit to the top 100 after sorting
        print(word, freq)


# As mentioned above, the value returned by byFreq() will the basis of the sorting. Since we want the list to be
# sorted on the basis of the word counts, and since the word count is at index 1 of each (word, count) tuple,
# we return the value at index 1.
def byFreq(pair):
    return pair[1]


main()


he 2507
as 1552
we 1484
his 1445
is 1428
not 1290
with 1259
my 1211
you 1123
at 1064
have 1041
all 1029
be 1023
had 1019
but 998
so 983
on 959
her 857
me 847
she 794
when 743
there 693
* 640
which 630
if 615
from 611
him 594
this 544
are 543
were 529
by 482
could 460
they 447
or 442
then 437
some 434
must 428
one 413
what 413
shall 411
would 409
our 403
no 400
will 398
been 375
do 367
may 365
up 364
can 347
has 340
see 337
out 332
more 329
am 322
van 317
know 305
them 300
came 295
said 293
an 292
your 292
went 291
us 287
into 287
over 285
me, 284
any 280
only 277
who 270
very 264
did 263
now 263
like 261
come 252
time 249
go 244
seemed 238
before 235
even 232
took 220
such 218
their 218
than 216
helsing 215
about 213
though 213
saw 206
through 205
after 201
where 201
down 200
back 199
should 190
how 185
made 185
poor 182
tell 177
think 177
looked 176
much 174
