# Dictionaries

[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/master/05_dictionaries.ipynb) or
[click here to download it](https://github.com/AllenDowney/ElementsOfDataScience/raw/master/05_dictionaries.ipynb).

In the previous chapter we used a `for` loop to read a file and count the words.  In this chapter, you'll learn about a new type called a **dictionary**, and we'll use it to count the number of unique words and the number of times each one appears.

You will also see how to select an element from a sequence (tuple, list, or array).  And you will learn a little about Unicode, which is used to represent letters, numbers, and punctuation for almost every language in the world.

## Indexing

Suppose you have a variable named `t` that refers to a list or tuple.
You can select an element using the **bracket operator**, `[]`.
For example, here's a tuple of strings:

In [2]:
t = 'zero', 'one', 'two'

To select the first element, we put `0` in brackets:

In [3]:
t[0]

'zero'

To select the second element, we put `1` in brackets:

In [4]:
t[1]

'one'


To select the third element, we put `2` in brackets:

In [5]:
t[2]

'two'

The number in brackets is called an **index** because it indicates which element we want.
Tuples and lists use zero-based numbering; that is, the index of the first element is 0.  Some other programming languages use one-based numbering.  There are pros and cons of both systems (see <https://en.wikipedia.org/wiki/Zero-based_numbering>).

The index in brackets can also be a variable:

In [6]:
i = 1
t[i]

'one'

Or an expression with variables, values, and operators:

In [7]:
t[i+1]

'two'

But if the index goes past the end of the list or tuple, you get an error.

Run this line of code in the next cell to see what happens.

```
t[3]
```

In [8]:
# t[3]

Also, the index has to be an integer; if it is any other type, you get an error.

Run these lines of code in the next cell to see what happens.

```
t[1.5]
t['1']
```

In [9]:
# t[1.5]
# t['1']

**Exercise:** You can use negative integers as indices.  Try using `-1` and `-2` as indices, and see if you can figure out what they do. 

In [10]:
t[-1]

'two'

In [11]:
t[-2]

'one'

## Dictionaries

A dictionary is similar to a tuple or list, but in a dictionary, the index can be almost any type, not just an integer.
We can create an empty dictionary like this:

In [12]:
d = {}

Then we can add elements like this:

In [13]:
d['one'] = 1
d['two'] = 2

In this example, the indices are the strings, `'one'` and `'two'`
If you display the dictionary, it shows each index and the corresponding value. 

In [14]:
d

{'one': 1, 'two': 2}

Instead of creating an empty dictionary and then adding elements, you can create a dictionary and specify the elements at the same time:

In [15]:
d = {'one': 1, 'two': 2, 'three': 3}
d

{'one': 1, 'two': 2, 'three': 3}

When we are talking about dictionaries, an index is usually called a **key**.  In this example, the keys are strings and the corresponding values are integers.

A dictionary is also called a **map**, because it represents correspondence or "mapping", between keys and values.  So we might say that this dictionary maps from English number names to the corresponding integers.

You can use the bracket operator to select an element from a dictionary, like this:

In [16]:
d['two']

2

But don't forget the quotation marks.
Without them, Python looks for a variable named `two` and doesn't find one.

Try running the following line to see what happens.

```
d[two]
```

In [17]:
# d[two]

To check whether a particular key is in a dictionary, you can use the special word `in`:

In [18]:
'one' in d

True

In [19]:
'zero' in d

False

Because the word `in` is an operator in Python, you can't use it as a variable name.

Try this to see what happens:

```
in = 5
```

In [20]:
# in = 5

If a key is already in a dictionary, adding it again has no effect:

In [21]:
d

{'one': 1, 'two': 2, 'three': 3}

In [22]:
d['one'] = 1
d

{'one': 1, 'two': 2, 'three': 3}

But you can change the value associated with a key:

In [23]:
d['one'] = 100
d

{'one': 100, 'two': 2, 'three': 3}

You can loop through the keys in a dictionary like this:

In [24]:
for key in d:
    print(key)

one
two
three


If you want the keys and the values, one way to get them is to loop through the keys and look up the values:

In [25]:
for key in d:
    print(key, d[key])

one 100
two 2
three 3


Or you can loop through both at the same time, like this:

In [26]:
for key, value in d.items():
    print(key, value)

one 100
two 2
three 3


The `items` method loops through the key-value pairs in the dictionary; each time through the loop, they are assigned to `key` and `value`.

**Exercise:** Make a dictionary with the integers `1`, `2`, and `3` as keys and strings as values.  The strings should be the words "one", "two", and "three" or their equivalents in any language you know.

Write a loop that prints just the values from the dictionary.

In [27]:
my_dict = {'eins': 1, 'zwei': 2, 'drei': 3}

In [28]:
for key, value in my_dict.items():
    print(value)

1
2
3


In [29]:
# Solution goes here

## Counting Unique Words

In the previous chapter we downloaded *War and Peace* from Project Gutenberg and counted the number of lines and words.
Now that we have dictionaries, we can also count the number of unique words and the number of times each one appears.

First, let's download the book again.  
When you run the following cell, it checks to see whether you already have a file named `2600-0.txt`, which is the name of the file that contains the text of *War and Peace*.
If not, it copies the file from Project Gutenberg to your computer.  

In [30]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://www.gutenberg.org/files/2600/2600-0.txt')

As we did in the previous chapter, we can read the text of *War and Peace* and count the number of words.

In [31]:
fp = open('2600-0.txt')
count = 0
for line in fp:
    count += len(line.split())
    
count

566334

To count the number of unique words, we'll loop through the words in each line and add them as keys in a dictionary:

In [32]:
fp = open('2600-0.txt')
unique_words = {}
for line in fp:
    for word in line.split():
        unique_words[word] = 1

unique_words

{'\ufeffThe': 1,
 'Project': 1,
 'Gutenberg': 1,
 'eBook': 1,
 'of': 1,
 'War': 1,
 'and': 1,
 'Peace,': 1,
 'by': 1,
 'Leo': 1,
 'Tolstoy': 1,
 'This': 1,
 'is': 1,
 'for': 1,
 'the': 1,
 'use': 1,
 'anyone': 1,
 'anywhere': 1,
 'in': 1,
 'United': 1,
 'States': 1,
 'most': 1,
 'other': 1,
 'parts': 1,
 'world': 1,
 'at': 1,
 'no': 1,
 'cost': 1,
 'with': 1,
 'almost': 1,
 'restrictions': 1,
 'whatsoever.': 1,
 'You': 1,
 'may': 1,
 'copy': 1,
 'it,': 1,
 'give': 1,
 'it': 1,
 'away': 1,
 'or': 1,
 're-use': 1,
 'under': 1,
 'terms': 1,
 'License': 1,
 'included': 1,
 'this': 1,
 'online': 1,
 'www.gutenberg.org.': 1,
 'If': 1,
 'you': 1,
 'are': 1,
 'not': 1,
 'located': 1,
 'States,': 1,
 'will': 1,
 'have': 1,
 'to': 1,
 'check': 1,
 'laws': 1,
 'country': 1,
 'where': 1,
 'before': 1,
 'using': 1,
 'eBook.': 1,
 'Title:': 1,
 'Peace': 1,
 'Author:': 1,
 'Translators:': 1,
 'Louise': 1,
 'Aylmer': 1,
 'Maude': 1,
 'Release': 1,
 'Date:': 1,
 'April,': 1,
 '2001': 1,
 '[eBook': 1,
 

This is the first example we've seen with one loop **nested** inside another.

* The outer loop runs through the lines in the file.

* The inner loops runs through the words in each line.

Each time through the inner loop, we add a word as a key in the dictionary, with the value 1.  If the same word appears more than once, it gets added to the dictionary again, which has no effect.  So the dictionary contains only one copy of each unique word in the file.

At the end of the loop, we can display the first 10 keys:

In [33]:
i = 0
for key in unique_words:
    print(key)
    i += 1
    if i == 10:
        break

The
Project
Gutenberg
eBook
of
War
and
Peace,
by
Leo


The dictionary contains all the words in the file, in order of first appearance.
But each word only appears once, so the number of keys is the number of unique words:

In [34]:
len(unique_words)

41971

It looks like there are about 42,000 different words in the book, which is substantially less than the total number of words, about 560,000. 
But that's not quite right, because we have not taken into account capitalization and punctuation.

**Exercise:** Before we deal with that problem, let's practice with nested loops, that is, one loop inside another.
Suppose you have a list of words, like this:

In [35]:
line = ['War', 'and', 'Peace']
line_2 = 'War and Peace'

Write a nested loop that iterates through each word in the list, and each letter in each word, and prints the letters on separate lines.

In [36]:
for word in line:
    for letter in word:
        print(letter)

W
a
r
a
n
d
P
e
a
c
e


In [37]:
for word in line_2.split():
    for letter in word:
        print(letter)

W
a
r
a
n
d
P
e
a
c
e


## Dealing with Capitalization

When we count unique words, we probably want to treat `The` and `the` as the same word.  We can do that by converting all words to lower case, using the `lower` function:

In [38]:
word = 'The'
word.lower()

'the'

`lower` creates a new string; it does not modify the original string.  

In [39]:
word

'The'

However, you can assign the new string back to the existing variable, like this:

In [40]:
word = word.lower()

Now if we can display the new value of `word`, we get the lowercase version:

In [41]:
word

'the'

**Exercise:** Modify the previous loop so it makes a lowercase version of each word before adding it to the dictionary.  How many unique words are there, if we ignore the difference between uppercase and lowercase?

In [42]:
fp = open('2600-0.txt')

unique_words = {}

for line in fp:
    for word in line.split():
        unique_words[word.casefold()] = 1

len(unique_words.keys())

40114

## Removing Punctuation

To remove punctuation from the words, we can use `strip`, which removes specified characters from the beginning and end of a string.  Here's an example:

In [43]:
word = 'abracadabra'
word.strip('ab')

'racadabr'

In [44]:
word = 'abracadabra'
word.strip('ba')

'racadabr'

In this example, `strip` removes all instances of `a` and `b` from the beginning and end of the word, but not from the middle.
But note that it makes a new word; it doesn't modify the original:

In [45]:
word

'abracadabra'

To remove punctuation, we can use the `string` library, which provides a variable named `punctuation`.

In [46]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`string.punctuation` contains the most common punctuation marks, but as we'll see, not all of them.
Nevertheless, we can use it to handle most cases.  Here's an example:

In [47]:
line = "It's not given to people to judge what's right or wrong."

for word in line.split():
    word = word.strip(string.punctuation)
    print(word)

It's
not
given
to
people
to
judge
what's
right
or
wrong


`strip` removes the period at the end of `wrong`, but not the apostrophes in `It's`, `don't` and `what's`.
So that's good, but we have one more problem to solve.
Here's another line from the book.

In [48]:
line = 'anyone, and so you don’t deserve to have them.”'

Here's what happens when we try to remove the punctuation.

In [49]:
for word in line.split():
    word = word.strip(string.punctuation)
    print(word)

anyone
and
so
you
don’t
deserve
to
have
them.”


The comma after `anyone` is removed, but not the period and quotation mark after `them`.
The problem is that this kind of quotation mark is not in `string.punctuation`, so `strip` stops before it gets to the period.

To fix this problem, we'll use the following loop, which

1. Reads the file and builds a dictionary that contains all punctuation marks that appear in the book, then

2. It uses the `join` function to concatenate the keys of the dictionary in a single string.

You don't have to understand everything about how it works, but you should read it and see how much you can figure out.  You can read the documentation of the `unicodedata` library here at <https://docs.python.org/3/library/unicodedata.html>.

In [50]:
import unicodedata

fp = open('2600-0.txt')
punc_marks = {}
for line in fp:
    for x in line:
        category = unicodedata.category(x)
        if category[0] == 'P':
            punc_marks[x] = 1
        
all_punctuation = ''.join(punc_marks)
print(all_punctuation)

,.-:[#]*/“’—‘!?”;()"%'


In [51]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [52]:
my_punc = string.punctuation + "“’—‘”"
my_punc


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~“’—‘”'

The result is a string containing all of the punctuation characters that appear in the document, in the order they first appear.

**Exercise:** Modify the word-counting loop from the previous section to convert words to lower case *and* strip punctuation before adding them to the dictionary.  Now how many unique words are there?

Optional: You might want to skip over the front matter and start with the text of Chapter 1, and skip over the license at the end, as we did in the previous chapter.

In [53]:
fp = open('2600-0.txt')

unique_words = {}
my_unique_words = {}

for line in fp:
    for word in line.split():
        their_word = word.strip(all_punctuation).casefold()
        unique_words[their_word] = 1

        my_word = word.strip(my_punc).casefold()
        my_unique_words[my_word] = 1

missing_chars = []

for key in unique_words:
    if key in my_unique_words:
        continue
    missing_chars.append(key)

missing_chars

['=', '$1', '$5,000']

## Counting Word Frequencies

In the previous section we counted the number of unique words, but we might also want to know how often each word appears.  Then we can find the most common and least common words in the book.
To count the frequency of each word, we'll make a dictionary that maps from each word to the number of times it appears.

Here's an example that loops through a string and counts the number of times each letter appears.

In [54]:
word = 'Mississippi'

letter_counts = {}
for x in word:
    if x in letter_counts:
        letter_counts[x] += 1
    else:
        letter_counts[x] = 1
        
letter_counts

{'M': 1, 'i': 4, 's': 4, 'p': 2}

The `if` statement here uses a feature we have not seen before, an `else` clause.
Here's how it works.

1. First, it checks whether the letter, `x`, is already a key in the dictionary, `letter_counts`.

2. If so, it runs the first statement, `letter_counts[x] += 1`, which increments the value associated with the letter.

3. Otherwise, it runs the second statement, `letter_counts[x] = 1`, which adds `x` as a new key, with the value `1` indicating that we have seen the new letter once.

The result is a dictionary that maps from each letter to the number of times it appears.

To get the most common letters, we can use a `Counter`, which is similar to a dictionary.  To use it, we have to import a library called `collections`: 

In [55]:
import collections

Then we use `collections.Counter` to convert the dictionary to a `Counter`:

In [56]:
counter = collections.Counter(letter_counts)
type(counter)

collections.Counter

In [58]:
counter

Counter({'M': 1, 'i': 4, 's': 4, 'p': 2})

`Counter` provides a function called `most_common` we can use to get the most common characters:

In [59]:
counter.most_common(3)

[('i', 4), ('s', 4), ('p', 2)]

The result is a list of tuples, where each tuple contains a character and an integer.

**Exercise:** Modify the loop from the previous exercise to count the frequency of the words in *War and Peace*; then print the 20 most common words and the number of times each one appears.

In [60]:
fp = open('2600-0.txt')

word_counts = collections.Counter()

for line in fp:
    for word in line.split():
        word = word.strip(my_punc).casefold()
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

word_counts.most_common(20)

[('the', 34574),
 ('and', 22147),
 ('to', 16713),
 ('of', 14994),
 ('a', 10493),
 ('he', 9811),
 ('in', 8930),
 ('his', 7965),
 ('that', 7806),
 ('was', 7332),
 ('with', 5694),
 ('had', 5354),
 ('it', 5179),
 ('her', 4697),
 ('not', 4665),
 ('him', 4571),
 ('at', 4533),
 ('i', 4106),
 ('but', 4011),
 ('on', 3996)]

**Exercise:** You can run `most_common` with no value in parentheses, like this:

```
word_freq_pairs = counter.most_common()
```

The result is a list of tuples, with one tuple for every unique word in the book.  Assign the result to a variable so it doesn't get displayed.  Then answer the following questions:

1. How many times does the #1 ranked word appear (that is, the first element of the list)?

2. How many times does the #10 ranked word appear?

3. How many times does the #100 ranked word appear?

4. How many times does the #1000 ranked word appear?

5. How many times does the #10000 ranked word appear?

Do you see a pattern in the results?  We will explore this pattern more in the next chapter.

In [62]:
word_freq_pairs = word_counts.most_common()

In [63]:
word_freq_pairs[0][1]

34574

In [64]:
word_freq_pairs[9][1]

7332

In [65]:
word_freq_pairs[99][1]

681

In [66]:
word_freq_pairs[999][1]

59

In [67]:
word_freq_pairs[9999][1]

2

**Exercise:** Write a loop that counts how many words appear 200 times.  What are they?  How many words appear 100 times, 50 times, and 20 times?

**Optional:** If you know how to define a function, write a function that takes a `Counter` and a frequency as arguments, prints all words with that frequency, and returns the number of words with that frequency.

In [78]:
def get_count_of_min_freq(frequencies, min_freq):
    count_of_min_freq = 0

    for word, freq in frequencies:
        if freq < min_freq:
            break

        print(word)
        count_of_min_freq += 1

    return count_of_min_freq

In [79]:
get_count_of_min_freq(word_freq_pairs, 200)

the
and
to
of
a
he
in
his
that
was
with
had
it
her
not
him
at
i
but
on
as
you
for
she
is
said
all
from
be
by
were
what
they
who
this
one
which
have
prince
so
pierre
an
or
up
them
when
did
been
there
their
no
would
if
now
only
are
me
out
my
could
natásha
will
man
more
do
andrew
about
himself
into
how
we
then
time
princess
face
french
went
some
know
after
before
eyes
your
old
very
room
thought
men
go
like
chapter
see
rostóv
began
moscow
has
again
down
well
came
come
without
asked
still
same
those
count
looked
army
say
felt
nicholas
our
first
where
away
mary
left
another
over
something
these
such
life
two
other
seemed
its
head
just
little
yes
am
day
hand
why
than
whole
don’t
people
emperor
should
back
long
even
any
own
heard
way
having
because
general
countess
must
here
look
can
napoleon
always
saw
nothing
being
made
russian
right
kutúzov
though
young
love
suddenly
off
voice
round
us
smile
moment
sónya
officer
knew
told
never
everything
whom
while
took
much
words
looking
too
house
turned


320

In [75]:
# Solution goes here
get_count_of_min_freq(word_freq_pairs, 100)

the
and
to
of
a
he
in
his
that
was
with
had
it
her
not
him
at
i
but
on
as
you
for
she
is
said
all
from
be
by
were
what
they
who
this
one
which
have
prince
so
pierre
an
or
up
them
when
did
been
there
their
no
would
if
now
only
are
me
out
my
could
natásha
will
man
more
do
andrew
about
himself
into
how
we
then
time
princess
face
french
went
some
know
after
before
eyes
your
old
very
room
thought
men
go
like
chapter
see
rostóv
began
moscow
has
again
down
well
came
come
without
asked
still
same
those
count
looked
army
say
felt
nicholas
our
first
where
away
mary
left
another
over
something
these
such
life
two
other
seemed
its
head
just
little
yes
am
day
hand
why
than
whole
don’t
people
emperor
should
back
long
even
any
own
heard
way
having
because
general
countess
must
here
look
can
napoleon
always
saw
nothing
being
made
russian
right
kutúzov
though
young
love
suddenly
off
voice
round
us
smile
moment
sónya
officer
knew
told
never
everything
whom
while
took
much
words
looking
too
house
turned


629

In [76]:
# Solution goes here
get_count_of_min_freq(word_freq_pairs, 50)

the
and
to
of
a
he
in
his
that
was
with
had
it
her
not
him
at
i
but
on
as
you
for
she
is
said
all
from
be
by
were
what
they
who
this
one
which
have
prince
so
pierre
an
or
up
them
when
did
been
there
their
no
would
if
now
only
are
me
out
my
could
natásha
will
man
more
do
andrew
about
himself
into
how
we
then
time
princess
face
french
went
some
know
after
before
eyes
your
old
very
room
thought
men
go
like
chapter
see
rostóv
began
moscow
has
again
down
well
came
come
without
asked
still
same
those
count
looked
army
say
felt
nicholas
our
first
where
away
mary
left
another
over
something
these
such
life
two
other
seemed
its
head
just
little
yes
am
day
hand
why
than
whole
don’t
people
emperor
should
back
long
even
any
own
heard
way
having
because
general
countess
must
here
look
can
napoleon
always
saw
nothing
being
made
russian
right
kutúzov
though
young
love
suddenly
off
voice
round
us
smile
moment
sónya
officer
knew
told
never
everything
whom
while
took
much
words
looking
too
house
turned


1158

In [77]:
# Solution goes here
get_count_of_min_freq(word_freq_pairs, 20)

the
and
to
of
a
he
in
his
that
was
with
had
it
her
not
him
at
i
but
on
as
you
for
she
is
said
all
from
be
by
were
what
they
who
this
one
which
have
prince
so
pierre
an
or
up
them
when
did
been
there
their
no
would
if
now
only
are
me
out
my
could
natásha
will
man
more
do
andrew
about
himself
into
how
we
then
time
princess
face
french
went
some
know
after
before
eyes
your
old
very
room
thought
men
go
like
chapter
see
rostóv
began
moscow
has
again
down
well
came
come
without
asked
still
same
those
count
looked
army
say
felt
nicholas
our
first
where
away
mary
left
another
over
something
these
such
life
two
other
seemed
its
head
just
little
yes
am
day
hand
why
than
whole
don’t
people
emperor
should
back
long
even
any
own
heard
way
having
because
general
countess
must
here
look
can
napoleon
always
saw
nothing
being
made
russian
right
kutúzov
though
young
love
suddenly
off
voice
round
us
smile
moment
sónya
officer
knew
told
never
everything
whom
while
took
much
words
looking
too
house
turned


2496

## Summary

This chapter introduces dictionaries, which represent a collection of keys and values.
We used a dictionary to count the number of unique words in a file and the number of times each one appears.

It also introduces the bracket operator, which selects an element from a list or tuple, or looks up a key in a dictionary and finds the corresponding value.

We saw some new methods for working with strings, including `lower` and `strip`. Finally, we used the `unicodedata` library to identify characters that are considered punctuation.

*Elements of Data Science*

Copyright 2021 [Allen B. Downey](https://allendowney.com)

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)