_HDS5210 Programming for Data Science_

# Week 8 - Data Structures

https://drive.google.com/open?id=1CEFBEQaoFiA4fuVMfVeD2iq_rlfyu0U0oVV3SyotkBY

At the beginning of the semester, we learned about some basic variables that were of types like `int`, `str`, `float`, and `bool`.  These "single valued" data types are referred to as **"scalar"** data types.  In this context **scalar** refers to a data type that just holds a single variable, as opposed to a **complex** data type that holds a more complicated data structure, like a `list` or `class`.

In this lecture, we'll talk about **sets** and **tuples**, but primarily focus another complex data type called a **dictionary**.  https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Dictionaries are a collection of variable keys and values within a single variable.  In this example, we have a variable `v` that contains keys of `a` and `b` with corresponding values `'Three'` and `3`.
```
v = { 
  'a': 'Three', 
  'b': 3 
}
```

In [1]:
# We create an empty dictionary using curly braces:
d = {}

In [2]:
# Or we can initialize it with key / value pairs:
v = { 'a': 'Three', 'b': 3 }

In [3]:
v

{'a': 'Three', 'b': 3}

In [4]:
# The keys within a dictionary can be of any type:
elements = { 
    'H' : {
        'name': 'Hydrogen',
        'number': 1,
        'isotopes': [1,2,3]
    },
    'He' : {
        'name': 'Helium',
        'number': 2,
        'isotopes': [2,3,4]
    }
}

In [5]:
elements

{'H': {'isotopes': [1, 2, 3], 'name': 'Hydrogen', 'number': 1},
 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [13]:
elements[0] 

KeyError: 0

In [6]:
# Note that the keys within a dictionary can only be used once.  
# The following isn't illegal, but it may not do what you expect it to do.

duplicate = {
    'name': 'Paul Boal',
    'name': 'Eric Westhus'
}

In [7]:
duplicate

{'name': 'Eric Westhus'}

In [8]:
duplicate = {}
duplicate['name'] = 'Paul Boal'

In [9]:
duplicate

{'name': 'Paul Boal'}

In [10]:
duplicate['name'] = 'Eric Westhus'

In [11]:
duplicate

{'name': 'Eric Westhus'}

## Accessing Keys/Values of the Dictionary

**Note that the ordering of keys is ARBITRARY, not necessarily how they are entered or alphabetically**

In [14]:
# __getite__ == [] 
help(dict.__getitem__)

Help on method_descriptor:

__getitem__(...)
    x.__getitem__(y) <==> x[y]



In [20]:
# for key in dictionary
for e in elements:
    print(e)
    print(elements[e])
    print("{:s} is short for {:s}".format(e, elements[e]['name']))

He
{'name': 'Helium', 'number': 2, 'isotopes': [2, 3, 4]}
He is short for Helium
H
{'name': 'Hydrogen', 'number': 1, 'isotopes': [1, 2, 3, 4]}
H is short for Hydrogen


In [18]:
elements['H']['isotopes'].append(4)

In [19]:
elements['H']

{'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen', 'number': 1}

In [21]:
elements['H']['name'] = 'Hydrogen?'

In [22]:
elements['H']


{'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen?', 'number': 1}

## Other ways of creating dictionaries

In [23]:
keys = ['one', 'two', 'three']
vals = [1,     2,     3      ]


In [26]:
d = dict(zip(keys, vals))

In [27]:
d

{'one': 1, 'three': 3, 'two': 2}

If key names are really simple, we can use a different syntax:

In [None]:
d = dict(one=1, two=2, three=3)

In [None]:
d

In [32]:
keys = ['one', 'two', 'three']
vals = [1,2,3]
dict(zip(keys, vals))

{'one': 1, 'three': 3, 'two': 2}

In [37]:
d = {}

In [38]:
for index in range(len(keys)):
    d[vals[index]] = keys[index]

In [39]:
d

{1: 'one', 2: 'two', 3: 'three'}

In [40]:
d[1] 

'one'

In [41]:
d[1] = 'Seven'
d

{1: 'Seven', 2: 'two', 3: 'three'}

In [42]:
elements

{'H': {'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen?', 'number': 1},
 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [44]:
elements['H'] = 1

In [45]:
elements

{'H': 1, 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [51]:
d = dict(zip(keys, vals))

In [52]:
d[1] = 'Seven'

In [53]:
d

{1: 'Seven', 'two': 2, 'three': 3, 'one': 1}

In [54]:
d = { 'one': 1, 'uno': 1 }
d

{'one': 1, 'uno': 1}

## Other ways of looping over dictionaries

In [59]:
for abbr, info in elements.items():
    print("{:s} = {:s}".format(str(abbr),str(info)))

He = {'name': 'Helium', 'number': 2, 'isotopes': [2, 3, 4]}
H = 1


## Ways to use dictionaries

In [62]:
Asprin100 = dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr")
Asprin100

{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'}

In [63]:
dosages = [
    dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr"),
    dict( drug="Digoxin", amount=50,  mass_unit="mg", time_unit="hr")
]

In [64]:
dosages

[{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'},
 {'amount': 50, 'drug': 'Digoxin', 'mass_unit': 'mg', 'time_unit': 'hr'}]

In [65]:
for d in dosages:
    print("{:s} {:d} {:s}/{:s}".format(d["drug"],d["amount"],d["mass_unit"],d["time_unit"]))

Aspirin 100 mg/hr
Digoxin 50 mg/hr


## Reading CSV as a dictionary

The csv module has a DictReader class that will read the CSV file into a list of dictionaries with one row per input data row, and key/value pairs that use the column names from the header row as key names.

In [71]:
import csv
with open('/midterm/census.csv') as f:
    l = csv.DictReader(f)
    for item in l:
        print(item['NAME'])
        


UNITED STATES
NORTHEAST REGION
MIDWEST REGION
SOUTH REGION
WEST REGION
ALABAMA
ALASKA
ARIZONA
ARKANSAS
CALIFORNIA
COLORADO
CONNECTICUT
DELAWARE
DISTRICT OF COLUMBIA
FLORIDA
GEORGIA
HAWAII
IDAHO
ILLINOIS
INDIANA
IOWA
KANSAS
KENTUCKY
LOUISIANA
MAINE
MARYLAND
MASSACHUSETTS
MICHIGAN
MINNESOTA
MISSISSIPPI
MISSOURI
MONTANA
NEBRASKA
NEVADA
NEW HAMPSHIRE
NEW JERSEY
NEW MEXICO
NEW YORK
NORTH CAROLINA
NORTH DAKOTA
OHIO
OKLAHOMA
OREGON
PENNSYLVANIA
RHODE ISLAND
SOUTH CAROLINA
SOUTH DAKOTA
TENNESSEE
TEXAS
UTAH
VERMONT
VIRGINIA
WASHINGTON
WEST VIRGINIA
WISCONSIN
WYOMING
PUERTO RICO


In [74]:
import csv
with open('/midterm/census.csv') as f:
    L = csv.DictReader(f)
    row1 = next(L)
#    print(row1)
    row2 = next(L)
    print(row2)

{'RBIRTH2011': '11.64168009', 'RDOMESTICMIG2011': '-2.947186446', 'STATE': '00', 'RNATURALINC2014': '2.7602304128', 'RDOMESTICMIG2015': '-5.763683328', 'RNATURALINC2013': '2.7844946193', 'NATURALINC2010': '52643', 'BIRTHS2014': '631620', 'NPOPCHG_2013': '184297', 'INTERNATIONALMIG2012': '242341', 'POPESTIMATE2011': '55638038', 'RNATURALINC2012': '3.1727476767', 'RNATURALINC2011': '3.159516597', 'RNETMIG2013': '0.78949056', 'RDEATH2014': '8.4995330359', 'NETMIG2013': '44154', 'NPOPCHG_2014': '151928', 'SUMLEV': '020', 'DEATHS2012': '461046', 'RBIRTH2014': '11.259763449', 'POPESTIMATE2010': '55387174', 'DOMESTICMIG2015': '-324078', 'RNETMIG2015': '-0.448231941', 'NPOPCHG_2015': '112610', 'RINTERNATIONALMIG2015': '5.3154513872', 'REGION': '1', 'RDEATH2013': '8.5832289365', 'NATURALINC2015': '155837', 'RNATURALINC2015': '2.7715399342', 'BIRTHS2013': '635765', 'RDEATH2011': '8.4821634927', 'NETMIG2010': '18218', 'NATURALINC2011': '175393', 'RINTERNATIONALMIG2012': '4.3479729736', 'RESIDUAL2

In [78]:
import csv

def get_population(filename, year):
    with open(filename) as f:
        dict = csv.DictReader(f)
        row1 = next(dict)
        #print(row1['NAME'], row1['POPESTIMATE'+str(year)])
        return row1['POPESTIMATE'+str(year)]
    
print(get_population('/midterm/census.csv', 2014))

318907401


In [80]:
import csv
with open('/midterm/census.csv') as f:
    L = csv.DictReader(f)
    row1 = next(L)
    print(list(row1.keys()))

['RBIRTH2011', 'RDOMESTICMIG2011', 'STATE', 'RNATURALINC2014', 'RDOMESTICMIG2015', 'RNATURALINC2013', 'NATURALINC2010', 'BIRTHS2014', 'NPOPCHG_2013', 'INTERNATIONALMIG2012', 'POPESTIMATE2011', 'RNATURALINC2012', 'RNATURALINC2011', 'RNETMIG2013', 'RDEATH2014', 'NETMIG2013', 'NPOPCHG_2014', 'SUMLEV', 'DEATHS2012', 'RBIRTH2014', 'POPESTIMATE2010', 'DOMESTICMIG2015', 'RNETMIG2015', 'NPOPCHG_2015', 'RINTERNATIONALMIG2015', 'REGION', 'RDEATH2013', 'NATURALINC2015', 'RNATURALINC2015', 'BIRTHS2013', 'RDEATH2011', 'NETMIG2010', 'NATURALINC2011', 'RINTERNATIONALMIG2012', 'RESIDUAL2011', 'NPOPCHG_2011', 'DOMESTICMIG2014', 'RNETMIG2014', 'ESTIMATESBASE2010', 'RINTERNATIONALMIG2011', 'RESIDUAL2010', 'RDEATH2012', 'POPESTIMATE2015', 'RESIDUAL2013', 'NETMIG2014', 'INTERNATIONALMIG2015', 'BIRTHS2010', 'RDOMESTICMIG2014', 'DEATHS2014', 'BIRTHS2015', 'POPESTIMATE2012', 'DOMESTICMIG2012', 'RNETMIG2012', 'NETMIG2015', 'BIRTHS2012', 'POPESTIMATE2014', 'NETMIG2011', 'NATURALINC2014', 'RBIRTH2012', 'DEATHS20

# Inverting Dictionaries

In [81]:
patient_ages = {
    "E143291": 19,
    "E872839": 32,
    "E878198": 19,
    "E871111": 21,
    "E143299": 3,
    "E123332": 21,
    "E989891": 19
} 

In [85]:
age_patients = {}
for pat_id, age in patient_ages.items():
    print(pat_id, age)
    
    if age in age_patients:
        print(age, "is already here")
        age_patients[age].append(pat_id)
    else:
        print(age, "is not here")
        age_patients[age] = [pat_id]

age_patients

E123332 21
21 is not here
E989891 19
19 is not here
E143291 19
19 is already here
E872839 32
32 is not here
E143299 3
3 is not here
E871111 21
21 is already here
E878198 19
19 is already here


{3: ['E143299'],
 19: ['E989891', 'E143291', 'E878198'],
 21: ['E123332', 'E871111'],
 32: ['E872839']}

In [89]:
age_patients = {}
for pat_id, age in patient_ages.items():
    print(pat_id, age)

    if age not in age_patients:
        age_patients[age] = []
    
    age_patients[age].append(pat_id)
    print("Age {:d} has {:d} patients".format(age, len(age_patients[age])))

age_patients

for age, pat_list in age_patients.items():
    print("-- Age {:d} has {:d} patients".format(age, len(pat_list)))
    

E123332 21
Age 21 has 1 patients
E989891 19
Age 19 has 1 patients
E143291 19
Age 19 has 2 patients
E872839 32
Age 32 has 1 patients
E143299 3
Age 3 has 1 patients
E871111 21
Age 21 has 2 patients
E878198 19
Age 19 has 3 patients
-- Age 32 has 1 patients
-- Age 3 has 1 patients
-- Age 19 has 3 patients
-- Age 21 has 2 patients


# Sets

Sets are a special kind of list that is always a unique set of values - there won't be any duplicates.

You can also think about it as just a dictionary with only keys - remember that dictionaries can only have one entry for each key.

In [90]:
sex = {'M', 'F', 'U', 'O'}
sex

{'F', 'M', 'O', 'U'}

In [91]:
sex = {'M', 'F', 'U', 'M', 'F', 'O'}
sex

{'F', 'M', 'O', 'U'}

In [92]:
sex.add('T')
sex

{'F', 'M', 'O', 'T', 'U'}

You can also do all kinds of other set operations: See Chapter 11, p202

# Tuples

There's another special kind of ordered list called a tuple.  What's special about tuples is that they can't be altered once they're created, even though the objects inside them can be.  Weird.

One thing you can do with tuples, is assign to multiple variables at once.

In [93]:
(a, b) = (1, 2)
a

1

In [94]:
b

2