## Overview

This week, we discuss dictionaries (or hash tables), data structures which are highly efficient for storage and retrieval of information associated with a 'key'.  This is done through hashing, which will be discussed in class. In this assignment, we will learn to use dictionaries, and to read files. File reading is an essential skill for scientific programmers, as virtually all data, even data on the web, resides in files of some sort. Your operating system will treat many external devices that measure something as a file too. 

## Assignment

Write a program to read the file `human_evolution.txt`, which holds information about human ancestors including weight, height, when they lived, and brain volume. Write a program that reads the file and stores the data in a dictionary called `humans`. The keys in the dictionary correspond to the species names, eg `"H. habilis"`, `"H. erectus"`, etc. The values associated with each key are themselves dictionaries. These dictionaries have keys corresponding to each column title, eg "Lived", "Height", "Weight", etc. The values of these nested dictionaries (one for each species) correspond to the data in the file. If done correctly, the dictionary called `humans` can be used as follows:

`print(humans['H. floresiensis']['Mass'])`

`>>>25`

For instructional purposes, the data set here is small, but you should see the utility of such a data type when the data set is large. When reading the file, you are not permitted to change anything in the data file to make it easier to read. Print the entire dictionary (and sub-dictionaries) at the end of the program, so that the output can be inspected.

Please note that due to the structure of the data itself, all data remain strings! You need not try and coerce data to make it numerical.

### Technical Notes

This assingment contains two new ideas; 1) dictionaries, and 2) file Input Output (I/O). First, the dictionary. 

#### Dictionary

There are many  [good overviews](http://www.sthurlow.com/python/lesson06/) as well as a cheat sheet on how to use a dictionary. Important points for the assignment:

* Dictionaries store information according to keys. 
* The information stored are values.
* Creating a dictionary is like a list, but done with `d = {<key0>:<value0>,<key1>:<value1>,...}` noting that key/value pairs are serperated with "`:`" and entries are seperated with "`,`" and the whole thing is surrounded by "`{}`".
* Values can be any valid Python data type, including dictionaries.
* Getting a value back from a dictionary is done with `d[<key>]`.
* Values can be added or changed with `d[<key>] = new_value`.

#### File Input/Output

While important, file I/O is often overlooked in introductory classes. Not here! We use it here, and later when reading spreadsheets. 

As with dictionaries, there are many good tutorials online. I liked [this one](http://www.afterhoursprogramming.com/tutorial/Python/Reading-Files/). Key points are as follows. Note that I use `<>` to indicate a variable that the programmer must provide - you can call these whatever you like, or specify a particular file.

* Make the file something that can be accessed in a program with `<file_object> = open("<file_name>")`, where `<file_object>` is any valid variable name and `<file_name>` is a full (including extensions) name of a file in the same directory as the program. 
* If the file is not in the same directory, then `<file_name>` has to include path information.
* `<file_object>.readline()` will read one line, often useful for special cases in a text file, like the column titles, or a line of commments. 
* `for line in <file_object>:` is the key to getting the job done. Looping over each line of the file, things like `data = line.split()` will give you a list of the contents of each column.
* Don't forget to do a `<file_object>.close()` at the end of file use to avoid 'unflushed buffers'.
* "Cleaning" irregularities in data files is often a big part of the hassle. `<string>.strip()` might help  you, it removes white space before and after a string (but not in the middle. Logic tests on a leading character, such as `if l[0] == 'H':` are also useful. Many lines that are not interesting start with a special character, like `#`.

##### File Example
Below is an example that reads the file:

`# test data file`

1 2 3

4 5 6

7 8 9

8 1 4

6 9 2

`# This is comment. It breaks the pattern in the data!`

1 1 1

Which I called test_data.txt. The program reads the file, adds the first two columns, and muliplies by the third.

In [3]:
#Example Dictionary
phone_book={}
while True:
    print("What is the name? (nothing to exit)")
    name=input() #what will be our key
    if name=="":
        break
    print()
    print("What is the phone number?")
    number=input() #what will be our value
    phone_book[name]=number #how we add key-value pairs into a dictionary
    print()

print(phone_book.keys()) #gives all the keys ina dictionary
print(phone_book.values()) #gives all the values
print()

for names, numbers in phone_book.items(): #gives the key-value pairs
    print(names,numbers)

What is the name? (nothing to exit)
Dan

What is the phone number?
5328061230986

What is the name? (nothing to exit)
Dave

What is the phone number?
158906589

What is the name? (nothing to exit)
Pink

What is the phone number?
13589609851

What is the name? (nothing to exit)

dict_keys(['Dan', 'Dave', 'Pink'])
dict_values(['5328061230986', '158906589', '13589609851'])

Dan 5328061230986
Dave 158906589
Pink 13589609851


In [7]:
#You can make a nested dictioanry
periodic_table={"H":{"Name":"Hydrogen","Mass":1.004,"Number":1},"He":{"Name":"Helium","Mass":2.006,"Number":2}}
print(periodic_table["He"]) #grabbign the outer thing
print(periodic_table["He"]["Mass"]) #grabbing outer and inner thing

{'Name': 'Helium', 'Mass': 2.006, 'Number': 2}
2.006


In [16]:
example_file = open("test_data.txt")
for line in example_file:          # Each line in the file is iterated over.
    if line[0]!="#":#we are removing any comments
        columns=line.split() #finds any space and gets rid of it
        print(float(columns[0])+float(columns[2])) #float is required to make it add floatign point numbers
        #if we did nto use float, it would have concatenated the numbers
        #exmaple_file.seek(#) would bring teh points back to start
example_file.close()

4.0
10.0
16.0
12.0
8.0
2.0


### Working with `human_evolution.txt`
Now dow this yourself, using the file `human_evolution.txt`. **Hint -** while I used `split()` in the previous example, achieving your goals here is best done by thinking carefully about which columns each data field is in. Don't forget to print the contents of the dictionary at the end, that's an important part of understanding how to iterate over keys.

In [43]:
#not sure whwre I'm goign with this thing, we'll see what I do...

In [44]:
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #starting by printing every line
    print(line) 
print()

humans.close()

Species              Lived when      Adult        Adult       Brain volume 

                     (mill. yrs)     height (m)   mass (kg)   (cm**3) 

-------------------------------------------------------------------------------

H. habilis           2.2 - 1.6       1.0 - 1.5    33 - 55     660

H. erectus           1.4 - 0.2       1.8          60          850 (early) - 1100 (late)

H. ergaster          1.9 - 1.4       1.9                      700 - 850

H. heidelbergensis   0.6 - 0.35      1.8          60          1100 - 1400

H. neanderthalensis  0.35 - 0.03     1.6          55 - 70     1200 - 1900

H. sapiens sapiens   0.2 - present   1.4 - 1.9    50 - 100    1000 - 1850

H. floresiensis      0.10 - 0.012    1.0          25          400

-------------------------------------------------------------------------------



Source: http://en.wikipedia.org/wiki/Human_evolution




In [45]:
humans=open("human_evolution.txt")

for line in humans: #starting by printing every line
    s_line=line.split() #creatign a list that excludes all whitespace characters
    print(s_line,len(s_line))

humans.close()

['Species', 'Lived', 'when', 'Adult', 'Adult', 'Brain', 'volume'] 7
['(mill.', 'yrs)', 'height', '(m)', 'mass', '(kg)', '(cm**3)'] 7
['-------------------------------------------------------------------------------'] 1
['H.', 'habilis', '2.2', '-', '1.6', '1.0', '-', '1.5', '33', '-', '55', '660'] 12
['H.', 'erectus', '1.4', '-', '0.2', '1.8', '60', '850', '(early)', '-', '1100', '(late)'] 12
['H.', 'ergaster', '1.9', '-', '1.4', '1.9', '700', '-', '850'] 9
['H.', 'heidelbergensis', '0.6', '-', '0.35', '1.8', '60', '1100', '-', '1400'] 10
['H.', 'neanderthalensis', '0.35', '-', '0.03', '1.6', '55', '-', '70', '1200', '-', '1900'] 12
['H.', 'sapiens', 'sapiens', '0.2', '-', 'present', '1.4', '-', '1.9', '50', '-', '100', '1000', '-', '1850'] 15
['H.', 'floresiensis', '0.10', '-', '0.012', '1.0', '25', '400'] 8
['-------------------------------------------------------------------------------'] 1
[] 0
['Source:', 'http://en.wikipedia.org/wiki/Human_evolution'] 2


In [24]:
#let's try processing the information some
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        names=line[0:20] #the relatiev section where names would be
        names=names.replace(" ","") #getting rid of teh whitespaces
        print(names)
print()

humans.close()

H.habilis
H.erectus
H.ergaster
H.heidelbergensis
H.neanderthalensis
H.sapienssapiens
H.floresiensis





In [32]:
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        living=line[20:35]
        living=living.replace(" ","") #getting rid of teh whitespaces
        print(living)
print()

humans.close()

2.2-1.6
1.4-0.2
1.9-1.4
0.6-0.35
0.35-0.03
0.2-present
0.10-0.012




In [33]:
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        height=line[35:50]
        height=height.replace(" ","") #getting rid of teh whitespaces
        print(height)
print()

humans.close()

1.0-1.5
1.8
1.9
1.8
1.6
1.4-1.9
1.0




In [34]:
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        mass=line[50:60]
        mass=mass.replace(" ","") #getting rid of teh whitespaces
        print(mass)
print()

humans.close()

33-55
60

60
55-70
50-100
25




In [35]:
humans=open("human_evolution.txt") #open sthe fiel as a variable we can access

for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        brain=line[60:]
        brain=brain.replace(" ","") #getting rid of teh whitespaces
        print(brain)
print()

humans.close()

660

850(early)-1100(late)

700-850

1100-1400

1200-1900

1000-1850

400





In [9]:
humans=open("human_evolution.txt")
import re #i'm going to use regular expressions, because make life easier
l_f=re.compile('.*$') #we find eveyrthing up until the newline

for line in humans: #starting by printing every line
    l_s=l_f.match(line) #we search the string for matches to the regular expression
    l_h=l_s.group() #we make a variable of the whole matches which would be thta entire line
    print(l_h) #we print 
humans.close()

Species              Lived when      Adult        Adult       Brain volume 
                     (mill. yrs)     height (m)   mass (kg)   (cm**3) 
-------------------------------------------------------------------------------
H. habilis           2.2 - 1.6       1.0 - 1.5    33 - 55     660
H. erectus           1.4 - 0.2       1.8          60          850 (early) - 1100 (late)
H. ergaster          1.9 - 1.4       1.9                      700 - 850
H. heidelbergensis   0.6 - 0.35      1.8          60          1100 - 1400
H. neanderthalensis  0.35 - 0.03     1.6          55 - 70     1200 - 1900
H. sapiens sapiens   0.2 - present   1.4 - 1.9    50 - 100    1000 - 1850
H. floresiensis      0.10 - 0.012    1.0          25          400
-------------------------------------------------------------------------------

Source: http://en.wikipedia.org/wiki/Human_evolution


In [52]:
humans=open("human_evolution.txt")
l_f=re.compile('\w. \w*') #finding matches for the human names

for line in humans: #starting by printing every line
    try:
        l_s=l_f.match(line) #we search the string for matches to the regular expression
        l_h=l_s.group() #we make a variable of the whole matches which would be thta entire line
        print(l_h) #we print 
    except:
        print("Monkey on a pedestal")
humans.close()

Monkey on a pedestal
Monkey on a pedestal
Monkey on a pedestal
H. habilis
H. erectus
H. ergaster
H. heidelbergensis
H. neanderthalensis
H. sapiens
H. floresiensis
Monkey on a pedestal
Monkey on a pedestal
Monkey on a pedestal


In [12]:
#putting it all together

import re #regular expressions just b/c I want to get rid of the newline characters
l_f=re.compile('.*$') #we find eveyrthing up until the newline character

names=[]
living=[]
height=[]
mass=[]
brain=[]

humans=open("human_evolution.txt") #open sthe fiel as a variable we can access
for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        #the names
        names_i=line[0:20] #the relatiev section where names would be
        names.append(names_i.replace(" ","")) #getting rid of teh whitespaces
        #living
        living_i=line[20:35]
        living.append(living_i.replace(" ",""))
        #height
        height_i=line[35:50]
        height.append(height_i.replace(" ",""))
        #mass
        mass_i=line[50:60]
        mass.append(mass_i.replace(" ",""))
        #Braaaaiinzz!!!
        l_s=l_f.match(line) #we search the string for matches to the regular expression
        l_h=l_s.group() #we make a variable of the whole matches which would be thta entire line
        brain_i=l_h[60:]
        brain.append(brain_i.replace(" ",""))
humans.close()

print(names,"\n",living,"\n",height,"\n",mass,"\n",brain)
print()

#lets' clean up the contents of our lists
def cleaner(lst):
    if "\n" in lst:
        lst.remove("\n")
    if lst[-1]=='':
        del lst[-1]
        

#cleanign all of the lists up
cleaner(names)
cleaner(living)
cleaner(height)
cleaner(mass)
cleaner(brain)

print(names,len(names),"\n",living,len(living),"\n",height,len(height),"\n",mass,len(mass),"\n",brain,len(brain))

['H.habilis', 'H.erectus', 'H.ergaster', 'H.heidelbergensis', 'H.neanderthalensis', 'H.sapienssapiens', 'H.floresiensis', '\n'] 
 ['2.2-1.6', '1.4-0.2', '1.9-1.4', '0.6-0.35', '0.35-0.03', '0.2-present', '0.10-0.012', ''] 
 ['1.0-1.5', '1.8', '1.9', '1.8', '1.6', '1.4-1.9', '1.0', ''] 
 ['33-55', '60', '', '60', '55-70', '50-100', '25', ''] 
 ['660', '850(early)-1100(late)', '700-850', '1100-1400', '1200-1900', '1000-1850', '400', '']

['H.habilis', 'H.erectus', 'H.ergaster', 'H.heidelbergensis', 'H.neanderthalensis', 'H.sapienssapiens', 'H.floresiensis'] 7 
 ['2.2-1.6', '1.4-0.2', '1.9-1.4', '0.6-0.35', '0.35-0.03', '0.2-present', '0.10-0.012'] 7 
 ['1.0-1.5', '1.8', '1.9', '1.8', '1.6', '1.4-1.9', '1.0'] 7 
 ['33-55', '60', '', '60', '55-70', '50-100', '25'] 7 
 ['660', '850(early)-1100(late)', '700-850', '1100-1400', '1200-1900', '1000-1850', '400'] 7


In [20]:
#putting it all into a dictionary

import re #regular expressions just b/c I want to get rid of the newline characters
l_f=re.compile('.*$') #we find eveyrthing up until the newline character

#our lists of data from the columns, yet to be filled
names=[]
living=[]
height=[]
mass=[]
brain=[]

humans=open("human_evolution.txt") #open sthe fiel as a variable we can access
for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        #the names
        names_i=line[0:20] #the relatiev section where names would be
        names.append(names_i.replace(" ","")) #getting rid of teh whitespaces and appending it to our lists
        #living
        living_i=line[20:35]
        living.append(living_i.replace(" ",""))
        #height
        height_i=line[35:50]
        height.append(height_i.replace(" ",""))
        #mass
        mass_i=line[50:60]
        mass.append(mass_i.replace(" ",""))
        #Braaaaiinzz!!!
        l_s=l_f.match(line) #we search the string for matches to the regular expression
        l_h=l_s.group() #we make a variable of the whole matches which would be thta entire line
        brain_i=l_h[60:]
        brain.append(brain_i.replace(" ",""))
humans.close()

#lets' clean up the contents of our lists
def cleaner(lst):
    if "\n" in lst:
        lst.remove("\n")
    if lst[-1]=='':
        del lst[-1]
        

#cleanign all of the lists up
cleaner(names)
cleaner(living)
cleaner(height)
cleaner(mass)
cleaner(brain)

#Our dictioanries being made
human_evo={}
for person in range(len(names)):
    personal_data={}
    n_person=names[person]
    personal_data["Lived When (mil. yrs)"]=living[person]
    personal_data["Adult Height (m)"]=height[person]
    personal_data["Body Mass (kg)"]=mass[person]
    personal_data["Brain Volume (cm**3)"]=brain[person]
    human_evo[n_person]=personal_data

print(human_evo)

{'H.habilis': {'Lived When (mil. yrs)': '2.2-1.6', 'Adult Height (m)': '1.0-1.5', 'Body Mass (kg)': '33-55', 'Brain Volume (cm**3)': '660'}, 'H.erectus': {'Lived When (mil. yrs)': '1.4-0.2', 'Adult Height (m)': '1.8', 'Body Mass (kg)': '60', 'Brain Volume (cm**3)': '850(early)-1100(late)'}, 'H.ergaster': {'Lived When (mil. yrs)': '1.9-1.4', 'Adult Height (m)': '1.9', 'Body Mass (kg)': '', 'Brain Volume (cm**3)': '700-850'}, 'H.heidelbergensis': {'Lived When (mil. yrs)': '0.6-0.35', 'Adult Height (m)': '1.8', 'Body Mass (kg)': '60', 'Brain Volume (cm**3)': '1100-1400'}, 'H.neanderthalensis': {'Lived When (mil. yrs)': '0.35-0.03', 'Adult Height (m)': '1.6', 'Body Mass (kg)': '55-70', 'Brain Volume (cm**3)': '1200-1900'}, 'H.sapienssapiens': {'Lived When (mil. yrs)': '0.2-present', 'Adult Height (m)': '1.4-1.9', 'Body Mass (kg)': '50-100', 'Brain Volume (cm**3)': '1000-1850'}, 'H.floresiensis': {'Lived When (mil. yrs)': '0.10-0.012', 'Adult Height (m)': '1.0', 'Body Mass (kg)': '25', 'Bra

In [31]:
#putting it all into a dictionary

import re #regular expressions just b/c I want to get rid of the newline characters
l_f=re.compile('.*$') #we find eveyrthing up until the newline character

#our lists of data from the columns, yet to be filled
names=[]
living=[]
height=[]
mass=[]
brain=[]

humans=open("human_evolution.txt") #open sthe fiel as a variable we can access
for line in humans: #visualizing how we're goign to take the lines and make them into columns
    if line[0] not in ["S","-"," "]: #avoiding all the excess stuff
        #the names
        names_i=line[0:20] #the relatiev section where names would be
        names.append(names_i.replace(" ","")) #getting rid of teh whitespaces and appending it to our lists
        #living
        living_i=line[20:35]
        living.append(living_i.replace(" ",""))
        #height
        height_i=line[35:50]
        height.append(height_i.replace(" ",""))
        #mass
        mass_i=line[50:60]
        mass.append(mass_i.replace(" ",""))
        #Braaaaiinzz!!!
        l_s=l_f.match(line) #we search the string for matches to the regular expression
        l_h=l_s.group() #we make a variable of the whole matches which would be thta entire line
        brain_i=l_h[60:]
        brain.append(brain_i.replace(" ",""))
humans.close()

#lets' clean up the contents of our lists
def cleaner(lst):
    if "\n" in lst: #gettign rid of nay newline characters that got through
        lst.remove("\n")
    if lst[-1]=='': #deletign the empty strings at teh end of the lsist
        del lst[-1]
        

#cleanign all of the lists up
cleaner(names)
cleaner(living)
cleaner(height)
cleaner(mass)
cleaner(brain)

#Our dictioanries being made
human_evo={}+
for person in range(len(names)): #person is a number
    personal_data={} #personal data is an empty dictionary to fill
    n_person=names[person] #the name of a person is 
    personal_data["Lived When (mil. yrs)"]=living[person] #we add key-val pairs into the dictioanry
    personal_data["Adult Height (m)"]=height[person] #from our lists
    personal_data["Body Mass (kg)"]=mass[person]
    personal_data["Brain Volume (cm**3)"]=brain[person]
    human_evo[n_person]=personal_data

for key,val in human_evo.items(): #we get items from the dictionary
    print(key) #we print the key
    for kee,vaa in val.items(): #teh value accessed is a dictionary
        print(kee,vaa) #so we print a key and val from that dictionary
    print() #we make a new line b/c why nto make it look nice
        

H.habilis
Lived When (mil. yrs) 2.2-1.6
Adult Height (m) 1.0-1.5
Body Mass (kg) 33-55
Brain Volume (cm**3) 660

H.erectus
Lived When (mil. yrs) 1.4-0.2
Adult Height (m) 1.8
Body Mass (kg) 60
Brain Volume (cm**3) 850(early)-1100(late)

H.ergaster
Lived When (mil. yrs) 1.9-1.4
Adult Height (m) 1.9
Body Mass (kg) 
Brain Volume (cm**3) 700-850

H.heidelbergensis
Lived When (mil. yrs) 0.6-0.35
Adult Height (m) 1.8
Body Mass (kg) 60
Brain Volume (cm**3) 1100-1400

H.neanderthalensis
Lived When (mil. yrs) 0.35-0.03
Adult Height (m) 1.6
Body Mass (kg) 55-70
Brain Volume (cm**3) 1200-1900

H.sapienssapiens
Lived When (mil. yrs) 0.2-present
Adult Height (m) 1.4-1.9
Body Mass (kg) 50-100
Brain Volume (cm**3) 1000-1850

H.floresiensis
Lived When (mil. yrs) 0.10-0.012
Adult Height (m) 1.0
Body Mass (kg) 25
Brain Volume (cm**3) 400

