# Tues, 27 Sept. 2018

Continuing text analysis intro. Dictionaries and complex data types.

## Dictionaries

Last week we learned about two data types that *contain other values*: **lists**, which contain an ordered sequence of items, and **strings** which contain an ordered sequence of characters. This week we introduce **dictionaries**, another data type that functions as a container. But order is not the organizing principal in this case:

A **dictionary** is a collection of key-value pairs. That means that each **value** in the dictionary has a unique **key** associated with it.

 - keys and values can be any data type
 - keys and values are associated using the colon (`:`)
 - key-value pairs are separated by commas
 - the whole dictionary is wrapped in curly braces (`{}`)
 
 
**Example:**

```python
results = {
    'PC': 22,
    'Lib': 21,
    'PA': 3,
    'Grn': 3,
    'NDP': 0,
}
```

<div class="alert alert-info" style="margin:1em 2em;">
‚ö†Ô∏è The order of the values is not particularly important. You can add key-value pairs in any order and the resulting dictionary will be the same. By default, Python will tend to keep the keys alphabetized, but that's mostly just for convenience in debugging code.
</div>

### Checking dictionary size

As with lists and strings, you can check the number of things in a dictionary using `len()`. Note that for dictionaries, this returns the number of **keys**.

```python
# number of parties for which we have results
len(results)
```

## Retrieving values

To retrieve a value from the dictionary, you must provide the appropriate key. Put it in square brackets after the dictionary name, just as you would retrieve an item from a list:

**Example:**

```python
results['PC']
```

<div class="alert alert-warning" style="margin:1em 2em">
<h4>‚ö†Ô∏è KeyError</h4>
<p>
    If you ask the dictionary for a key that doesn't exist in the dictionary, **your program will crash**.
</p>
</div>

**Example:**
```python
results['PQ']
```

To avoid this, you have some options.

### Check whether the key exists

Just like with lists, you can use the `in` keyword to check membership in the dictionary's keys.

**Example:**
```python
if 'PQ' in results:
    print('The Parti Qu√©b√©cois won ' + str(results['PQ']) + ' seats.')
else:
    print('The Parti Qu√©b√©cois did not run in the New Brunswick election.')
```

### Use `.get()` to specify a fallback value

As an alternative to putting the key square brackets, you can also use the `.get()` method to ask a dictionary for a particular key. Although it's a bit longer to type, `.get()` has the advantage that you can specify a default value: in case the key doesn't exist, `.get()` will return the default.

üòÄ Then your program won't crash!

**Example:**
```python
results.get('Reform Party', 0)
```

### Use `.setdefault()` to add the key if it doesn't exist

A related method is `.setdefault()`. Like `.get()`, it takes a key name to look for, and a default value in case the key does not exist in your dictionary. But unlike `.get()`, the `.setdefault()` method will **add the key and its value** to the dictionary.

**Example:**
```python
# ‚ö†Ô∏è this changes the dictionary!
results.setdefault('Ind', 0)
```

## Iterating over dictionaries

Just as with lists, you can use a `for` loop to iterate over the contents of a dictionary. In fact, you have a couple of options:

### Iterate over the keys

Use `for` ... `in` to iterate over the **keys** of the dictionary. By default, Python goes through the keys in alphabetic order.

**Example:**
```python
for name in results:
    seats = results[name]
    print('The ' + name + ' party won ' + str(seats) + ' seats.')
```

### Iterate over keys and values together

Use the `.items()` method in your loop to take key-value pairs one at a time.

**Example:**
```python
for name, seats in results.items():
    if seats == 0:
        msg = 'The ' + name + ' party won no seats.'
    else:
        msg = 'The ' + name + ' party won ' + str(seats) + ' seats.'
    print(msg)
```

### Other iterables: `.keys()` and `.values()`

In addition to the `.items()` method, you can also use `.keys()` or `.values()` to produce an *iterable* version of just the keys or values. This might be useful in a loop, or certain other circumstances where you want to treat e.g. all the values as a collection.

**Example:**
```python
most_seats = max(results.values())

if most_seats >= 25:
    print('One party has a majority')
else:
    print('We have a minority government.')
```

## Use case: wordCount revisted

Let's recreate the `wordCount()` function from a couple of weeks ago, using a dictionary.

```python
def wordCount(text, normalize=False):
    '''count all the words in a text'''

    # clean up and break into words
    text = text.lower().strip()
    words = text.split()

    # dictionary to hold the tallies
    wc = {}

    # tally the words
    for word in words:
        wc[word] = wc.get(word, 0) + 1

    # optional normalization
    if normalize:
        # get the total number of words
        total = sum(wc.values())

        # calculate normalized freq per 1000 words
        for word in wc:
            wc[word] = wc[word] / total
            wc[word] = round(wc[word] * 1000, 2)
    
    return(wc)
```

Test it out:

```python
text = '''
    MRS. Rachel Lynde  lived just where the  Avonlea main road
    dipped down into a little  hollow, fringed with alders and
    ladies‚Äô eardrops  and traversed  by a  brook that  had its
    source away back  in the woods of the  old Cuthbert place;
    it was reputed  to be an intricate, headlong  brook in its
    earlier course  through those woods, with  dark secrets of
    pool  and cascade;  but  by the  time  it reached  Lynde‚Äôs
    Hollow it  was a quiet, well-conducted  little stream, for
    not even a  brook could run past Mrs.  Rachel Lynde‚Äôs door
    without due  regard for  decency and decorum;  it probably
    was conscious that Mrs. Rachel  was sitting at her window,
    keeping  a  sharp  eye  on everything  that  passed,  from
    brooks and children  up, and that if  she noticed anything
    odd or  out of place  she would  never rest until  she had
    ferreted out the whys and wherefores thereof.
'''

# call the function, save the resulting dictionary
wc = wordCount(text)

# check some word counts
for word in ['that', 'mrs.', 'a', 'for']:
    print(wc[word], word)
```

### Bonus: count the entire document

Let's download the entire novel from the Internet Archive:

```python
import requests

url = 'https://ia902604.us.archive.org/18/items/anneofgreengable00045gut/anne11.txt'
anne = requests.get(url).text
```

Check that it worked -- sometimes the online repositories get angry if we all try to download the same text at once.

```python
# check first hundred chars
print(anne[:100])
```

If all's well, then proceed with counting the words:

```python
# count words
wc = wordCount(anne)

# sort by value
words = sorted(wc, key=wc.get, reverse=True)

# show top ten words
for word in words[:10]:
    print(wc[word], word)
```

# Complex data types

As we've seen, lists and dictionaries can contain other data types, like strings, numbers, Boolean values, etc. They can also contain **other lists or dictionaries**. For example, we could have a **list of lists**, or a **list of dictionaries**, or even a **dictionary whose values are lists or dictionaries**.


### üòï Wait, why would anyone do that??

These **complex data structures** are more common (and useful) than you might guess. A **list of lists**, for example, is really good for representing a **table of values**. For example, my fitbit checks my heart rate five times a day for a week:

<table class="table">
<caption>Heart Rate, 1‚Äì7 Sept.</caption>
    <thead>
    </thead>
    <tbody>
<tr>
    <th>01-09-2018</th>
<td>73</td>
<td>74</td>
<td>71</td>
<td>72</td>
<td>73</td>
</tr>
<tr>
<th>02-09-2018</th>
<td>70</td>
<td>70</td>
<td>70</td>
<td>71</td>
<td>71</td></tr>
<tr>
<th>03-09-2018</th>
<td>73</td>
<td>76</td>
<td>74</td>
<td>76</td>
<td>73</td>
</tr>
<tr>
<th>04-09-2018</th>
<td>77</td>
<td>79</td>
<td>79</td>
<td>79</td>
<td>76</td>
</tr>
<tr>
<th>05-09-2018</th>    
<td>70</td>
<td>69</td>
<td>66</td>
<td>69</td>
<td>69</td>
</tr>
<tr>
    <th>06-09-2018</th>    
<td>77</td>
<td>76</td>
<td>77</td>
<td>78</td>
<td>81</td>
</tr>
<tr>
        <th>07-09-2018</th>
<td>70</td>
<td>69</td>
<td>69</td>
<td>70</td>
<td>68</td>
</tr>
    </tbody>
</table>

I can represent each row‚Äîeach day in the chart‚Äîas a list of integer values. The whole week is represented as a list of rows, i.e. a list of lists.

**Example:**

```python
hr = [
    [73, 74, 71, 72, 73,],
    [70, 70, 70, 71, 71,],
    [73, 76, 74, 76, 73,],
    [77, 79, 79, 79, 76,],
    [70, 69, 66, 69, 69,],
    [77, 76, 77, 78, 81,],
    [70, 69, 69, 70, 68,],
]
```

### Getting a row from a list of lists

I can get specific days out of the chart by slicing the outer list. The result is a list.

**Example:**

```python
# day 3
hr[2]
```

And I can use it like a list.

**Example:**

```python
# get average heart rate for day 2
avg_hr = sum(hr[1]) / len(hr[1])
print('Avg hr =', avg_hr)
```

**Another example:**
```python
# iterate over all values in day 3
total_active = 0

for rate in hr[2]:
    if rate > 90:
        total_active = total_active + 1

print(total_active, ' active sessions today.')
```

### Getting one value from a list of lists

If I want to pull a specific value, I need üëâüèª **two sets of square brackets**: one for the outer list and a second for the inner list.

**Example:**

```python
# get the first value from day 4
rate = hr[3][0]
print(rate)
```

### Iterating over all the individual values

If I want to take each value from the whole chart in turn, I need üëâüèª **two `for` loops**: one for the outer list (i.e. taking each row in turn) and one for the inner list (i.e. each column in the row).

**Example:**

```python
# count measurements with low hr
total_lazy = 0

# loop over day, measurement
for day in hr:
    for rate in day:
        
        # check each measurement
        if rate < 70:
            total_lazy = total_lazy + 1

# print total
print(total_lazy, ' inactive sessions this week.')
```

## Dictionaries as records

Another common use of complex data structures is to use a dictionary like a **record**, that is, a collection of labelled data that you might want to collect for each of a number of objects, events, or samples.


### Case study: two children's authors

Let's say we want to compare Lucy Maud Montgomery's children's writing to that of E. Nesbit. We might have several books that we want to look at, with some important **metadata** for each one.

<table class="table"> <thead><col><col><col width="10%"><col width="40%"> <tr> <th>author</th> <th>title</th> <th>year</th> <th>url</th> </tr> </thead> <tbody> <tr> <td>Montgomery</td> <td><em>Anne of Green Gables</em></td> <td>1908</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/4/45/45.txt</td> </tr> <tr> <td>Montgomery</td> <td><em>Anne of Avonlea</em></td> <td>1909</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/4/47/47.txt</td> </tr> <tr> <td>Montgomery</td> <td><em>Anne of the Island</em></td> <td>1915</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/5/51/51.txt</td> </tr> <tr> <td>Montgomery</td> <td><em>Anne's House of Dreams</em></td> <td>1917</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/5/4/544/544.txt</td> </tr> <tr> <td>Montgomery</td> <td><em>Rilla of Ingleside</em></td> <td>1921</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/3/7/9/3796/3796.txt</td> </tr> <tr> <td>Montgomery</td> <td><em>Rainbow Valley</em></td> <td>1919</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/5/3/4/5343/5343-0.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>The Enchanted Castle<em></td> <td>1907</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/3/5/3/3536/3536.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>The Railway Children</em></td> <td>1906</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/1/8/7/1874/1874-0.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>The Story of the Treasure Seekers</em></td> <td>1899</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/7/7/770/770-0.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>Five Children and It</em></td> <td>1902</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/1/7/3/1/17314/17314.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>The Phoenix and the Carpet</em></td> <td>1904</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/8/3/836/836-0.txt</td> </tr> <tr> <td>Nesbit</td> <td><em>The Story of the Amulet</em></td> <td>1906</td> <td>http://mirror.csclub.uwaterloo.ca/gutenberg/8/3/837/837-0.txt</td> </tr> </tbody></table>

This is like a set of **records**. Each record represents a book available online, and there are labelled **data fields** we might want to fill in for each record. A dictionary is great for this: the **keys** become the **field labels** and the **values** are the data. With one dictionary per book, the corpus becomes a **list of dictionaries**.

**Example:**

```python
corpus = [
    {
        "author": "Montgomery",
        "title": "Anne of Green Gables",
        "date": 1908,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/4/45/45.txt",
    },
    {
        "author": "Montgomery",
        "title": "Anne of Avonlea",
        "date": 1909,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/4/47/47.txt",
    },
    {
        "author": "Montgomery",
        "title": "Anne of the Island",
        "date": 1915,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/5/51/51.txt",
    },
    {
        "author": "Montgomery",
        "title": "Anne's House of Dreams",
        "date": 1917,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/5/4/544/544.txt",
    },
    {
        "author": "Montgomery",
        "title": "Rilla of Ingleside",
        "date": 1921,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/3/7/9/3796/3796.txt",
    },
    {
        "author": "Montgomery",
        "title": "Rainbow Valley",
        "date": 1919,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/5/3/4/5343/5343-0.txt",
    },
    {
        "author": "Nesbit",
        "title": "The Enchanted Castle",
        "date": 1907,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/3/5/3/3536/3536.txt",
    },
    {
        "author": "Nesbit",
        "title": "The Railway Children",
        "date": 1906,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/1/8/7/1874/1874-0.txt",
    },
    {
        "author": "Nesbit",
        "title": "The Story of the Treasure Seekers",
        "date": 1899,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/7/7/770/770-0.txt",
    },
    {
        "author": "Nesbit",
        "title": "Five Children and It",
        "date": 1902,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/1/7/3/1/17314/17314.txt",
    },
    {
        "author": "Nesbit",
        "title": "The Phoenix and the Carpet",
        "date": 1904,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/8/3/836/836-0.txt",
    },
    {
        "author": "Nesbit",
        "title": "The Story of the Amulet",
        "date": 1906,
        "url": "http://mirror.csclub.uwaterloo.ca/gutenberg/8/3/837/837-0.txt",
    },
]
```

### Try it out:

How many books are in the corpus?

```python
len(corpus)
```

What's the title of the third book?

```python
corpus[2]['title']
```

Iterate over the whole corpus and print out just the dates:

```python
for book in corpus:
    print(book['date'])
```

## Bonus: authorship attribution with Montgomery and Nesbit

Just for fun, let's see if we can tell Montgomery and Nesbit apart using function words. In the code below, I'm going to iterate over each of the books in `corpus` and:
 - download the complete text from the url
 - calculate word counts from the text
 - save counts for 'the', 'and', 'to', and 'a' to a new table
 
**Example:**

```python
# empty list for function word counds
results = []

# words to check
my_words = ['and', 'the', 'to', 'a']

# check each book in turn
for book in corpus:
    
    # download and do complete word count
    print('Downloading ' + book['title'] + '...')
    
    fulltext = requests.get(book['url']).text
    wc = wordCount(fulltext, normalize=True)
    
    # check the words of interest
    these_counts = []
    for word in my_words:
        these_counts.append(wc[word])
        
    # add the results to the master table
    results.append(these_counts)
```

What do the results look like?

**Example:**
```python
results
```

### Graphing the results: 'and' versus 'the'

Let's try a graphical approach:

```python
from matplotlib import pyplot
%matplotlib inline

# x values are 'and' counts, y values are 'the' counts
x = []
y = []

for row in results:
    x.append(row[0])
    y.append(row[1])

# plot the first six books (Montgomery) in red
pyplot.plot(x[:6], y[:6], marker='o', color='r', linestyle='')

# plot the second six books (Nesbit) in green
pyplot.plot(x[:6], y[:6], marker='o', color='g', linestyle='')
```