# Dictionaries

Today we'll learn about our last core Python datatype, dictionaries. 

## Why?

Dictionaries will help us store many types of data, including word counts. They store information in pairs of `items` known as `keys` and `values`:

```python
word_counts = {
    'to' : 2,
    'be' : 2,
    'or' : 1,
    'not': 1
}
```

## Why are they called "dictionaries?"

Python dictionaries work like real-world dictionaries.

If you want to know what the word "syzygy" means, you flip to the word, and find this:
```
syz·y·gy | ˈsizijē |
noun (plural syzygies) Astronomy
a conjunction or opposition, especially of the moon with the sun: the planets were aligned in syzygy.
```
In Python parlance, the word is the `key` and its definition is the `value`.

In a Python dictionary, the same definition might be expressed like this:

In [90]:
# Create an empty dictionary. Just like initializing an empty list!
my_dict = {}

# Add an entry
my_dict['syzygy'] = 'a conjunction or opposition, especially of the moon with the sun'

In [91]:
my_dict

{'syzygy': 'a conjunction or opposition, especially of the moon with the sun'}

In [92]:
len(my_dict)

1

## Structure

A dictionary is indicated by `{}`. Each `key` is on the left of a `:`, with its `value` on the right side of the `:`. Each `item` in the dictionary, which is one pair of a key and a value, is separated by a `,` just like in lists.

You could manually create this dictionary like so:

In [3]:
my_dict = {'syzygy' : 'a conjunction or opposition, especially of the moon with the sun'}

In [4]:
my_dict

{'syzygy': 'a conjunction or opposition, especially of the moon with the sun'}

# Looking up values in your dictionary
In the above example, `'syzygy'` is the `key`, and its definition is the `value`.

We use the `key` to retrieve the `value`:

In [5]:
my_dict['syzygy']

'a conjunction or opposition, especially of the moon with the sun'

Dictionary keys are unique; values are not. Watch what happens if we do this:

In [6]:
my_dict['syzygy'] = 'a rare sneeze'

In [7]:
my_dict

{'syzygy': 'a rare sneeze'}

The `value` of the dictionary can be overwritten or updated while the key stays in place. Like lists, dictionaries are mutable (meaning they can be changed in place).

So we can fix it the same way:

In [8]:
my_dict['syzygy'] = 'a conjunction or opposition, especially of the moon with the sun'

In [9]:
my_dict

{'syzygy': 'a conjunction or opposition, especially of the moon with the sun'}

# Dictionaries can contain any object we've discussed

In [10]:
d = {'a string': 'like this'}
d['a string']

'like this'

In [192]:
d = {'an integer' : 600}
d['an integer']

600

In [95]:
d = {'a list': [1,2,3]}
d['a list']

[1, 2, 3]

# Making Dictionaries
Since they are mutable, dictionaries can be updated in place, just like `appending` to a list:

In [197]:
my_dict

{'syzygy': 'a conjunction or opposition, especially of the moon with the sun'}

In [88]:
my_dict['curlicue'] = 'a decorative curl or twist in calligraphy or in the design of an object'

In [89]:
len(my_dict)

6

Note that `my_dict` now has a `,` after the definition for `syzygy`.

## What happens if you request a `key` that is not in the dictionary?

In [96]:
my_dict['plunder']

KeyError: 'plunder'

Alternatively, you can check to see if a key exists in a dictionary using the `get` method, which can be used to `return` a default value when the key does not exist:

In [99]:
del my_dict['plunder']

In [100]:
my_dict.get('plunder') # the second argument is the default value. returns if the key does not exist

# Deleting entries
Deleting dictionary `pairs` uses a similar syntax to the one we saw with lists:

In [101]:
del my_dict['syzygy']

Passing `del` gets rid of *both* the `key` and the `value`.

In [103]:
my_dict

{}

# Getting dictionary `keys` and `values`
Sometimes you may want all of your keys and/or values as lists. This is especially useful to loop over them:

In [104]:
hp = {'Hermione' : 'Gryffindor',
    'Severus': 'Slytherin',
    'Luna': 'Ravenclaw',
    'Cedric': 'Hufflepuff'}

In [105]:
hp.keys()

dict_keys(['Hermione', 'Severus', 'Luna', 'Cedric'])

In [106]:
hp.values()

dict_values(['Gryffindor', 'Slytherin', 'Ravenclaw', 'Hufflepuff'])

Lastly, you can also have your dictionary export all of its pairs using `items`:

In [96]:
hp.items()

dict_items([('Hermione', 'Gryffindor'), ('Severus', 'Slytherin'), ('Luna', 'Ravenclaw'), ('Cedric', 'Hufflepuff')])

The pairs of items grouped by `()` are called tuples.

# Looping over dictionary keys and values

In [107]:
for key, value in hp.items():
    print('key: ', key)
    print('value: ', value)
    print(key + '\'s house is ' + value)
    print('-'*50)

key:  Hermione
value:  Gryffindor
Hermione's house is Gryffindor
--------------------------------------------------
key:  Severus
value:  Slytherin
Severus's house is Slytherin
--------------------------------------------------
key:  Luna
value:  Ravenclaw
Luna's house is Ravenclaw
--------------------------------------------------
key:  Cedric
value:  Hufflepuff
Cedric's house is Hufflepuff
--------------------------------------------------


# Booleans and dictionaries
You can find out whether specific keys or values are in your dictionary using boolean expressions:

In [19]:
'Harry' in hp

False

In [20]:
'Luna' in hp

True

Checking whether a string is `in` a dictionary evaluates against its `keys`. So:

In [21]:
'Slytherin' in hp

False

In [23]:
# but this works because it checks values
'Slytherin' in hp.values()

True

# Combining dictionaries, lists, and loops to count words
We can combine these to accurately count and store values for each unique word in any text.

In [26]:
# pretend this is a list of words
my_string = 'the cat sat on the mat'
my_list = my_string.split(' ')
my_list

['the', 'cat', 'sat', 'on', 'the', 'mat']

In [108]:
my_dict = {}

for x in my_list:
    if x not in my_dict: # this checks to see if the word is not yet in my_dict
        my_dict[x] = 1
    else: # this increments with every subsequent instance of the word
        my_dict[x] += 1

In [113]:
my_dict = {}
for x in my_list:
    if x in my_dict:
        my_dict[x] += 1
    else:
        my_dict[x] = 1

In [114]:
my_dict

{'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}

Now we can use the same methods as before to find out how many instances of a word appeared in our text:

In [115]:
my_dict['the']

2

In [116]:
type(my_dict['the'])

int

In [30]:
my_dict['cat']

1

Why write the above function that way? Python throws an error if we try to access a nonexistent key, which stops the loop:

In [110]:
other_dict = {}

for x in my_list:
    other_dict[x] += 1

KeyError: 'the'

## Calculating scaled frequencies in a dictionary
We can use our dictionary to calculate the scaled frequencies of each of our words easily:

In [118]:
sum(my_dict.values())

6

In [120]:
total = sum(my_dict.values()) # remember that the counts are values, the words are keys

print('total number of words in list: ', total)
print('-'*50)

scaled_dict = {}

for key in my_dict.keys():
    print('key: ', key)
    print('my_dict[key]:', my_dict[key])
    scaled_dict[key] = my_dict[key] / total
    print('scaled_dict[key]:', scaled_dict[key])
    print('-'*50)

total number of words in list:  6
--------------------------------------------------
key:  the
my_dict[key]: 2
scaled_dict[key]: 0.3333333333333333
--------------------------------------------------
key:  cat
my_dict[key]: 1
scaled_dict[key]: 0.16666666666666666
--------------------------------------------------
key:  sat
my_dict[key]: 1
scaled_dict[key]: 0.16666666666666666
--------------------------------------------------
key:  on
my_dict[key]: 1
scaled_dict[key]: 0.16666666666666666
--------------------------------------------------
key:  mat
my_dict[key]: 1
scaled_dict[key]: 0.16666666666666666
--------------------------------------------------


In [121]:
scaled_dict['the']

0.3333333333333333

# Using loops and dictionaries to compare frequencies across multiple books
Now, we could easily create multiple dictionaries representing the frequencies of words in separate texts in order to compare them.

Here I have a directory with all of the *Harry Potter* novels. Let's compare the scaled frequencies of character names in each one.

In [122]:
hp_path = '/Users/e/code/literarytextmining/corpora/harry_potter/texts'

Python can list all of the files in this directory and create a list of filepaths for me. We can then use that list of filepaths to `open` the texts one at a time, perform calculations on them, and move to the next book.

But to do that, we have to learn how to `import`.

## `import` statements
In Python, many functions and methods like `print` are loaded by default. But some are only used in special circumstances, so we have to ask Python to `import` them for a particular session. For example:

In [123]:
pi

NameError: name 'pi' is not defined

But this works:

In [124]:
import math
math.pi

3.141592653589793

Here, we `import` a Python library called `math` that contains the reserved word `pi`.

`import` allows us to load modules and code we need in some circumstances, but perhaps not every time.

# Using `import` to list a directory of texts
To our *Harry Potter* problem:

In [125]:
hp_path = '/Users/e/code/literarytextmining/corpora/harry_potter/texts'

To list the files in this directory, we need to `import` the `os` module, which gives us some special functions for interacting with our filesystem:

In [41]:
import os # os stands for "operating system"
potters = os.listdir(hp_path) #os.listdir lists all of the files in the path you give it
potters

['5 Order of the Phoenix.txt',
 '4 Goblet of Fire.txt',
 '6 Half-Blood Prince.txt',
 '1 Sorcerers Stone.txt',
 '3 Prisoner of Azkaban.txt',
 '7 Deathly Hallows.txt',
 '2 Chamber of Secrets.txt']

`os.listdir` lists the *relative* paths to these files. We want to give Python the *absolute* paths to these files so it knows where to look.

To do that, we use another part of the `os` module called `os.path.join`.

In [42]:
os.path.join(hp_path, potters[0]) # takes two args: the base path, and the filename to be joined to it

'/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt'

So we can use a `for` loop to get absolute paths for all of the files in our directory:

In [43]:
hp_paths = []

for book in potters:
    absolute = os.path.join(hp_path, book)
    hp_paths.append(absolute)

In [44]:
hp_paths = sorted(hp_paths) # to do the books in order...

In [45]:
hp_paths

['/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt']

## Looping over a directory of texts
Now we can count our characters in each of our files:

In [129]:
main_chars = ['Harry', 'Hermione', 'Ron']

results = []

for book in hp_paths:
    print(book)
    d = {}
    d['file'] = book
    text = open(book).read()
    
    for char in main_chars:
        print(char)
        d[char] = text.count(char)
        
    results.append(d)

/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt
Harry
Hermione
Ron
/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt
Harry
Hermione
Ron


What did we make? A *list of dictionaries*. This is a common way of storing and processing labeled data, such as our results here.

In [130]:
results

[{'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
  'Harry': 722,
  'Hermione': 57,
  'Ron': 176},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt',
  'Harry': 1745,
  'Hermione': 329,
  'Ron': 721},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
  'Harry': 2057,
  'Hermione': 672,
  'Ron': 789},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
  'Harry': 3277,
  'Hermione': 880,
  'Ron': 1051},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
  'Harry': 4092,
  'Hermione': 1304,
  'Ron': 1310},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
  'Harry': 2797,
  'Hermione': 694,
  'Ron': 883},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt',
  'Harry': 3146

There are two main things to notice about the function above: First, we initialize a list called `results` outside of the `for` loop. Then, I initialize a dictionary, `d`, inside of the first `for` loop. We collect data on each book in `d`, `append` it to `results`, and then clear `d` to get data about the next book.

There is a second `for` loop that checks each of the characters in turn, and adds new `keys` and `values` to the dictionary, `d`.

We *could* check three names manually. But what about a *ton* of characters from Harry Potter?

In [131]:
chars="""Ginny
Luna
Nymphadora
Lily
Bellatrix
Narcissa
Petunia
Fleur
Cho
Helga
Rowena
Nagini
Lavender
Pansy
Parvati
Padme
Molly
Penelope
Dolores
Minerva
Madame
Sybill
Rita
Severus
Draco
Albus
Neville
Kingsley
James
Sirius
Remus
Cedric
Gellert
Tom
Peter
Viktor
Dudley
Blaise
Godric
Salazar
Vernon
Hedwig
Newt
Lee
Fred
George
Bill
Charlie
Percy
Arthur
Oliver
Cornelius
Mad-Eye
Cornelius
Gilderoy
Barty
Lucius""".split('\n')

In [132]:
chars = sorted(chars)

In [133]:
chars[:5]

['Albus', 'Arthur', 'Barty', 'Bellatrix', 'Bill']

All we have to do is change our variable, and we get a lot more data with no new code.

This is the great thing about writing abstract code!

In [134]:
results = []

for book in hp_paths:
    d = {}
    d['file'] = book
    text = open(book).read()
    
    for char in chars:
        d[char] = text.count(char)
        
    results.append(d)

In [135]:
results[:2]

[{'file': '/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
  'Albus': 8,
  'Arthur': 0,
  'Barty': 0,
  'Bellatrix': 0,
  'Bill': 4,
  'Blaise': 1,
  'Cedric': 0,
  'Charlie': 10,
  'Cho': 7,
  'Cornelius': 1,
  'Dolores': 0,
  'Draco': 7,
  'Dudley': 138,
  'Fleur': 0,
  'Fred': 17,
  'Gellert': 0,
  'George': 15,
  'Gilderoy': 0,
  'Ginny': 3,
  'Godric': 1,
  'Hedwig': 9,
  'Helga': 0,
  'James': 5,
  'Kingsley': 0,
  'Lavender': 2,
  'Lee': 3,
  'Lily': 7,
  'Lucius': 0,
  'Luna': 0,
  'Mad-Eye': 0,
  'Madame': 0,
  'Minerva': 1,
  'Molly': 0,
  'Nagini': 0,
  'Narcissa': 0,
  'Neville': 50,
  'Newt': 1,
  'Nymphadora': 0,
  'Oliver': 3,
  'Padme': 0,
  'Pansy': 1,
  'Parvati': 3,
  'Penelope': 0,
  'Percy': 31,
  'Peter': 0,
  'Petunia': 56,
  'Remus': 0,
  'Rita': 0,
  'Rowena': 0,
  'Salazar': 0,
  'Severus': 1,
  'Sirius': 1,
  'Sybill': 0,
  'Tom': 1,
  'Vernon': 112,
  'Viktor': 0},
 {'file': '/Users/e/code/literarytextmining/corpora/harry_

We're going to use this data structure, the *list of dictionaries* to capture a lot of information about our texts.

# Pandas
This list of dictionaries has a ton of information in it, but it's not in the easiest format to work with. Thankfully, there is a Python library made for this task: `pandas`

In [56]:
import pandas as pd # this may take a minute!

It's conventional to `import pandas as pd`. What that means is that you call `pandas` methods by typing `pd` instead of `pandas`.

`pandas` has a lot of features, but one way to think about it is as a way for working with tabular data in Python.

In [None]:
pd.set_option()

In [136]:
df = pd.DataFrame(results) # yes capitalization matters for this one

In [137]:
df

Unnamed: 0,Albus,Arthur,Barty,Bellatrix,Bill,Blaise,Cedric,Charlie,Cho,Cornelius,...,Rita,Rowena,Salazar,Severus,Sirius,Sybill,Tom,Vernon,Viktor,file
0,8,0,0,0,4,1,0,10,7,1,...,0,0,0,1,1,0,1,112,0,/Users/e/code/literarytextmining/corpora/harry...
1,7,19,0,0,10,0,0,4,0,4,...,0,1,10,6,0,0,22,55,0,/Users/e/code/literarytextmining/corpora/harry...
2,5,6,0,0,4,0,4,1,20,11,...,0,0,0,25,158,6,18,78,0,/Users/e/code/literarytextmining/corpora/harry...
3,19,34,51,0,75,0,274,58,41,24,...,96,0,0,14,224,0,6,104,50,/Users/e/code/literarytextmining/corpora/harry...
4,15,48,1,53,33,0,42,5,162,32,...,33,0,0,1,638,11,5,111,3,/Users/e/code/literarytextmining/corpora/harry...
5,18,23,1,56,55,5,1,4,37,6,...,1,0,2,39,57,4,74,27,2,/Users/e/code/literarytextmining/corpora/harry...
6,110,21,1,98,125,0,2,15,13,0,...,28,4,1,62,69,0,12,50,5,/Users/e/code/literarytextmining/corpora/harry...


As you can see, `pandas` converts our list of dictionaries into a format that looks a lot like an Excel file. This is called a `DataFrame`.

In text mining, we often use what are called "Document-Term Matrices" (DTM). The DataFrame above is an example. In a DTM, the rows (horizontal) are our texts ("documents"). And the columns are our words ("terms"). In the sciences, rows contain our observations, and columns contain our features.

Our DTM isn't quite right just yet: All of our columns except one contain character names. The last column contains our filepath to the files we're performing our calculations on:

In [141]:
df['Severus']

0     1
1     6
2    25
3    14
4     1
5    39
6    62
Name: Severus, dtype: int64

This column should label our rows as its `index`. We can set that like so:

In [142]:
df.set_index('file')

Unnamed: 0_level_0,Albus,Arthur,Barty,Bellatrix,Bill,Blaise,Cedric,Charlie,Cho,Cornelius,...,Remus,Rita,Rowena,Salazar,Severus,Sirius,Sybill,Tom,Vernon,Viktor
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt,8,0,0,0,4,1,0,10,7,1,...,0,0,0,0,1,1,0,1,112,0
/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt,7,19,0,0,10,0,0,4,0,4,...,0,0,1,10,6,0,0,22,55,0
/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt,5,6,0,0,4,0,4,1,20,11,...,15,0,0,0,25,158,6,18,78,0
/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt,19,34,51,0,75,0,274,58,41,24,...,1,96,0,0,14,224,0,6,104,50
/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt,15,48,1,53,33,0,42,5,162,32,...,10,33,0,0,1,638,11,5,111,3
/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt,18,23,1,56,55,5,1,4,37,6,...,13,1,0,2,39,57,4,74,27,2
/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt,110,21,1,98,125,0,2,15,13,0,...,21,28,4,1,62,69,0,12,50,5


That's a little ugly so I'm going to clean it up. You can *overwrite* values in `pandas` in the exact same way you would with a dictionary:

In [61]:
titles = list(df['file'])
titles

['/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt']

In [62]:
cleaner = []
for x in titles:
    nice_title = x.split('/')[-1].split('.txt')[0]
    cleaner.append(nice_title)

In [63]:
cleaner

['1 Sorcerers Stone',
 '2 Chamber of Secrets',
 '3 Prisoner of Azkaban',
 '4 Goblet of Fire',
 '5 Order of the Phoenix',
 '6 Half-Blood Prince',
 '7 Deathly Hallows']

In [76]:
df['file'] = cleaner

In [77]:
df['file']

title
1 Sorcerers Stone              1 Sorcerers Stone
2 Chamber of Secrets        2 Chamber of Secrets
3 Prisoner of Azkaban      3 Prisoner of Azkaban
4 Goblet of Fire                4 Goblet of Fire
5 Order of the Phoenix    5 Order of the Phoenix
6 Half-Blood Prince          6 Half-Blood Prince
7 Deathly Hallows              7 Deathly Hallows
Name: file, dtype: object

In [78]:
df = df.set_index('file')

In [79]:
df.head() #df.head() prints the first 5 rows. useful when you have big matrices!

Unnamed: 0_level_0,Albus,Arthur,Barty,Bellatrix,Bill,Blaise,Cedric,Charlie,Cho,Cornelius,...,Remus,Rita,Rowena,Salazar,Severus,Sirius,Sybill,Tom,Vernon,Viktor
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 Sorcerers Stone,8,0,0,0,4,1,0,10,7,1,...,0,0,0,0,1,1,0,1,112,0
2 Chamber of Secrets,7,19,0,0,10,0,0,4,0,4,...,0,0,1,10,6,0,0,22,55,0
3 Prisoner of Azkaban,5,6,0,0,4,0,4,1,20,11,...,15,0,0,0,25,158,6,18,78,0
4 Goblet of Fire,19,34,51,0,75,0,274,58,41,24,...,1,96,0,0,14,224,0,6,104,50
5 Order of the Phoenix,15,48,1,53,33,0,42,5,162,32,...,10,33,0,0,1,638,11,5,111,3


There are a lot of different ways we can slice and dice our data:

## What are my columns called?

In [80]:
df.columns

Index(['Albus', 'Arthur', 'Barty', 'Bellatrix', 'Bill', 'Blaise', 'Cedric',
       'Charlie', 'Cho', 'Cornelius', 'Dolores', 'Draco', 'Dudley', 'Fleur',
       'Fred', 'Gellert', 'George', 'Gilderoy', 'Ginny', 'Godric', 'Hedwig',
       'Helga', 'James', 'Kingsley', 'Lavender', 'Lee', 'Lily', 'Lucius',
       'Luna', 'Mad-Eye', 'Madame', 'Minerva', 'Molly', 'Nagini', 'Narcissa',
       'Neville', 'Newt', 'Nymphadora', 'Oliver', 'Padme', 'Pansy', 'Parvati',
       'Penelope', 'Percy', 'Peter', 'Petunia', 'Remus', 'Rita', 'Rowena',
       'Salazar', 'Severus', 'Sirius', 'Sybill', 'Tom', 'Vernon', 'Viktor'],
      dtype='object')

## What did I set my index as again?

In [81]:
df.index

Index(['1 Sorcerers Stone', '2 Chamber of Secrets', '3 Prisoner of Azkaban',
       '4 Goblet of Fire', '5 Order of the Phoenix', '6 Half-Blood Prince',
       '7 Deathly Hallows'],
      dtype='object', name='file')

## How big is my dataframe?

In [82]:
df.shape # the first number is the number of rows, the second the number of columns

(7, 56)

## How do I access the values of a named column?

In [83]:
df['Tom']

file
1 Sorcerers Stone          1
2 Chamber of Secrets      22
3 Prisoner of Azkaban     18
4 Goblet of Fire           6
5 Order of the Phoenix     5
6 Half-Blood Prince       74
7 Deathly Hallows         12
Name: Tom, dtype: int64

## How do I access a particular row by index number?

In [84]:
df.iloc[-1] # give me the last row

Albus         110
Arthur         21
Barty           1
Bellatrix      98
Bill          125
Blaise          0
Cedric          2
Charlie        15
Cho            13
Cornelius       0
Dolores         4
Draco          62
Dudley         49
Fleur         101
Fred           94
Gellert        14
George         77
Gilderoy        0
Ginny         122
Godric         60
Hedwig         19
Helga           1
James          57
Kingsley       51
Lavender        3
Lee            17
Lily           80
Lucius         48
Luna          140
Mad-Eye        49
Madame          8
Minerva         8
Molly          11
Nagini         19
Narcissa       21
Neville        89
Newt            0
Nymphadora      3
Oliver          2
Padme           0
Pansy           2
Parvati         2
Penelope        1
Percy          33
Peter           2
Petunia        47
Remus          21
Rita           28
Rowena          4
Salazar         1
Severus        62
Sirius         69
Sybill          0
Tom            12
Vernon         50
Viktor    

# How do I access a particular row by name?

In [85]:
df.loc['1 Sorcerers Stone']

Albus           8
Arthur          0
Barty           0
Bellatrix       0
Bill            4
Blaise          1
Cedric          0
Charlie        10
Cho             7
Cornelius       1
Dolores         0
Draco           7
Dudley        138
Fleur           0
Fred           17
Gellert         0
George         15
Gilderoy        0
Ginny           3
Godric          1
Hedwig          9
Helga           0
James           5
Kingsley        0
Lavender        2
Lee             3
Lily            7
Lucius          0
Luna            0
Mad-Eye         0
Madame          0
Minerva         1
Molly           0
Nagini          0
Narcissa        0
Neville        50
Newt            1
Nymphadora      0
Oliver          3
Padme           0
Pansy           1
Parvati         3
Penelope        0
Percy          31
Peter           0
Petunia        56
Remus           0
Rita            0
Rowena          0
Salazar         0
Severus         1
Sirius          1
Sybill          0
Tom             1
Vernon        112
Viktor    

# Filtering with Pandas
`pandas` allows us to filter our data easily.

Let's say we wanted to figure out which characters were mentioned at least 5 times in every book. How would we do that?

In [86]:
for x in df.columns:
    if all(df[x] >= 5):
        print(x)

Albus
Draco
Dudley
Fred
George
Hedwig
Neville
Percy
Petunia
Vernon


`all()` checks the `True` or `False` condition for every element in the DataFrame column. Only if all of the elements are `True` does it `return` `True`.

`any()` works similarly. It checks the condition of every element in the column to see if any of them return `True`. Once it hits a `True` value, `any` `returns` `True`.

Which characters are *not* mentioned in at least one book?

In [87]:
for x in df.columns:
    if any(df[x] == 0):
        print(x)

Arthur
Barty
Bellatrix
Blaise
Cedric
Cho
Cornelius
Dolores
Fleur
Gellert
Gilderoy
Godric
Helga
Kingsley
Lucius
Luna
Mad-Eye
Madame
Molly
Nagini
Narcissa
Newt
Nymphadora
Oliver
Padme
Pansy
Penelope
Peter
Remus
Rita
Rowena
Salazar
Sirius
Sybill
Viktor


# Using `pandas` with metadata

# NLTK
We can do the same thing with a library called `nltk`, which stands for Natural Language Tool Kit. It's a library written in Python designed to do some of the work we have been doing by hand to learn:

In [7]:
import nltk # this may take a second to load!
nltk.word_tokenize('This can help us out')

['This', 'can', 'help', 'us', 'out']

`nltk` isn't perfect. It gets tripped up by stuff like this:

In [8]:
nltk.word_tokenize('Sometimes things-even simple things-cause problems')

['Sometimes', 'things-even', 'simple', 'things-cause', 'problems']

We can improve its results by preprocessing our texts some. We can `import` lists of punctuation to save time, too:

In [9]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`re` is a special module that stands for regular expressions. It is the same system that Piper used to count punctuation. It may be most useful for us to count *arbitrary* amounts of whitespace, as in the string below:

In [10]:
import re
re.findall('\s+', '  instances of multiple \t spaces \n  are.   present')

['  ', ' ', ' ', ' \t ', ' \n  ', '   ']

All you need to know is that the method `findall` from the library `re` can identify any amount of different sorts of whitespace in sequence. The special sequence `'\s+'` finds the whitespace.

We can use that flexibility to find *all* of the whitespace and replace it with spaces for simplicity:

In [11]:
with open('/Users/e/Downloads/walden.txt') as thoreau:
    thoreau = thoreau.read()

In [28]:
def tokenize(text, keep_punct = False, to_lower = True):
    if to_lower:
        text = text.lower().strip()
        
    if keep_punct is True:
        for punct in string.punctuation:
            text = text.replace(punct, ' ' + punct + ' ') # adding spaces to either side of any punctuation mark
    else:
        for punct in string.punctuation:
            text = text.replace(punct, ' ') # adding spaces in the place of punctuation
    
    # this replaces *any* amount of spaces with a single space:
    text = re.sub('\s+', ' ', text)
    
    # then we can use nltk's tokenizer
    return nltk.word_tokenize(text)

In [30]:
thoreau_toks = tokenize(thoreau)

In [31]:
thoreau_toks[:20]

['walden',
 'and',
 'on',
 'the',
 'duty',
 'of',
 'civil',
 'disobedience',
 'by',
 'henry',
 'david',
 'thoreau',
 'contents',
 'walden',
 'economy',
 'where',
 'i',
 'lived',
 'and',
 'what']

In [32]:
def ttr(text):
    '''
    Calculates the type-token ratio for any given text. Depends on function tokenize.
    '''
    tokens = tokenize(text, keep_punct=False)
    types = set(tokens)
    return len(types)/len(tokens)

In [33]:
ttr(thoreau) # long texts are very repetitive!

0.09475101859902146

In [34]:
ttr("Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.")

0.125

In [35]:
ttr('the cat sat on the mat')

0.8333333333333334