_HDS5210 Programming for Data Science_

# Week 8 - Data Structures

https://drive.google.com/drive/folders/12qaE8sNPWRMqcfeI9eSBxzUscz8KZDBf?usp=sharing

At the beginning of the semester, we learned about some basic variables that were of types like `int`, `str`, `float`, and `bool`.  These "single valued" data types are referred to as **"scalar"** data types.  In this context **scalar** refers to a data type that just holds a single variable, as opposed to a **complex** data type that holds a more complicated data structure, like a `list` or `class`.

In this lecture, we'll talk about **sets** and **tuples**, but primarily focus another complex data type called a **dictionary**.  https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Dictionaries are a collection of variable keys and values within a single variable.  In this example, we have a variable `v` that contains keys of `a` and `b` with corresponding values `'Three'` and `3`.
```
v = { 
  'a': 'Three', 
  'b': 3 
}
```

In [1]:
# We create an empty dictionary using curly braces:
d = {}

In [2]:
d

{}

In [5]:
# Or we can initialize it with key / value pairs:
v = { 
    'a': 'Three', 
    'b': 3 }

In [6]:
v

{'a': 'Three', 'b': 3}

In [7]:
v = { 1: 'one', 2:'two'}

In [8]:
v

{1: 'one', 2: 'two'}

In [9]:
v = { one: 1, two: 2 }

NameError: name 'one' is not defined

In [10]:
one  = 'one'
two = 'two'
v = {one: 1, two:2}

In [11]:
v

{'one': 1, 'two': 2}

In [12]:
v = {
    [1,2]: 'one and two'
}

TypeError: unhashable type: 'list'

In [38]:
# The values within a dictionary can be of any type:
elements = { 
    'H' : {
        'name': 'Hydrogen',
        'number': 1,
        'isotopes': [1,2,3]
    },
    'He' : {
        'name': 'Helium',
        'number': 2,
        'isotopes': [2,3,4]
    }
}

In [14]:
elements

{'H': {'isotopes': [1, 2, 3], 'name': 'Hydrogen', 'number': 1},
 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [17]:
elements['H']['name']

'Hydrogen'

In [18]:
# Note that the keys within a dictionary can only be used once.  
# The following isn't illegal, but it may not do what you expect it to do.

duplicate = {
    'name': 'Paul Boal',
    'name': 'Eric Westhus'
}

In [19]:
duplicate

{'name': 'Eric Westhus'}

In [20]:
duplicate = {}
duplicate['name'] = 'Paul Boal'

In [21]:
duplicate

{'name': 'Paul Boal'}

In [22]:
duplicate['name'] = 'Eric Westhus'

In [23]:
duplicate

{'name': 'Eric Westhus'}

## Accessing Keys/Values of the Dictionary

**Note that the ordering of keys is ARBITRARY, not necessarily how they are entered or alphabetically**

In [None]:
# __getitem__ == [] 
help(dict.__getitem__)

In [24]:
# for key in dictionary
for e in elements:
    print(e)
    print(elements[e])
    print("{:s} is short for {:s}".format(e, elements[e]['name']))

H
{'name': 'Hydrogen', 'number': 1, 'isotopes': [1, 2, 3]}
H is short for Hydrogen
He
{'name': 'Helium', 'number': 2, 'isotopes': [2, 3, 4]}
He is short for Helium


In [25]:
elements['H']['isotopes'].append(4)

In [26]:
elements['H']

{'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen', 'number': 1}

In [27]:
elements['H']['name'] = 'Hydrogen?'

In [28]:
elements['H']


{'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen?', 'number': 1}

## Other ways of creating dictionaries

In [29]:
keys = ['one', 'two', 'three']
vals = [1,     2,     3      ]


In [30]:
d = dict(zip(keys, vals))

In [31]:
d

{'one': 1, 'three': 3, 'two': 2}

If key names are really simple, we can use a different syntax:

In [32]:
d = dict(one=1, two=2, three=3)

In [33]:
d

{'one': 1, 'three': 3, 'two': 2}

In [34]:
keys = ['one', 'two', 'three']
vals = [1,2,3]
dict(zip(keys, vals))

{'one': 1, 'three': 3, 'two': 2}

In [None]:
d = {}

In [None]:
for index in range(len(keys)):
    d[vals[index]] = keys[index]

In [None]:
d

In [None]:
d[1] 

In [None]:
d[1] = 'Seven'
d

In [35]:
elements

{'H': {'isotopes': [1, 2, 3, 4], 'name': 'Hydrogen?', 'number': 1},
 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [36]:
elements['H'] = 1

In [37]:
elements

{'H': 1, 'He': {'isotopes': [2, 3, 4], 'name': 'Helium', 'number': 2}}

In [None]:
d = dict(zip(keys, vals))

In [None]:
d[1] = 'Seven'

In [None]:
d

In [None]:
d = { 'one': 1, 'uno': 1 }
d

## Other ways of looping over dictionaries

In [39]:
for abbr, info in elements.items():
    print("{:s} = {:s}".format(str(abbr),str(info)))

H = {'name': 'Hydrogen', 'number': 1, 'isotopes': [1, 2, 3]}
He = {'name': 'Helium', 'number': 2, 'isotopes': [2, 3, 4]}


In [45]:
alpha = { 'a': 1, 'b': 2, 'd': 4, 'c': 3}
alpha

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

In [46]:
for letter in alpha:
    print(letter)

c
b
a
d


In [47]:
for letter in sorted(alpha):
    print(letter)

a
b
c
d


## Ways to use dictionaries

In [48]:
Asprin100 = dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr")
Asprin100

{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'}

In [49]:
dosages = [
    dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr"),
    dict( drug="Digoxin", amount=50,  mass_unit="mg", time_unit="hr")
]

In [50]:
dosages

[{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'},
 {'amount': 50, 'drug': 'Digoxin', 'mass_unit': 'mg', 'time_unit': 'hr'}]

In [51]:
for d in dosages:
    print("{:s} {:d} {:s}/{:s}".format(d["drug"],d["amount"],d["mass_unit"],d["time_unit"]))

Aspirin 100 mg/hr
Digoxin 50 mg/hr


In [62]:
Asprin100 = dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr")
Asprin100

{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'}

In [63]:
Asprin100['drug name'] = Asprin100['drug']
del(Asprin100['drug'])
Asprin100

{'amount': 100, 'drug name': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'}

In [64]:
Asprin100 = {'drug name': 'Asprin', 'amount':100, 'mass_unit': 'mg', 'time_unit': 'hr'}
Asprin100

{'amount': 100, 'drug name': 'Asprin', 'mass_unit': 'mg', 'time_unit': 'hr'}

## Reading CSV as a dictionary

The csv module has a DictReader class that will read the CSV file into a list of dictionaries with one row per input data row, and key/value pairs that use the column names from the header row as key names.

In [65]:
import csv
f = open('/data/preexisting.csv')
reader = csv.DictReader(f)
    
    

In [66]:
row  = next(reader)

In [67]:
row

{'Enrolled Through April 30, 2011': '103',
 'Enrolled Through April 30, 2012': '524',
 'Enrolled Through April 30, 2013': '820',
 'Enrolled Through August 31, 2011': '230',
 'Enrolled Through August 31, 2012': '679',
 'Enrolled Through August 31, 2013': '711',
 'Enrolled Through Dec. 31, 2012': '911',
 'Enrolled Through December 31, 2011': '389',
 'Enrolled Through December 31, 2013': '123',
 'Enrolled Through Feb. 28, 2013': '1006',
 'Enrolled Through February 1, 2011': '77',
 'Enrolled Through February 29, 2012': '429',
 'Enrolled Through Jan. 31, 2013': '972',
 'Enrolled Through January 31, 2014  ': '115',
 'Enrolled Through July 31, 2011': '182',
 'Enrolled Through July 31, 2012': '635',
 'Enrolled Through July 31, 2013': '736',
 'Enrolled Through June 30, 2011': '138',
 'Enrolled Through June 30, 2012': '590',
 'Enrolled Through June 30, 2013': '766',
 'Enrolled Through Mar. 31, 2013': '821',
 'Enrolled Through March 31, 2011': '91',
 'Enrolled Through March 31, 2012': '466',
 'En

In [None]:
# State,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016
# AL,3098123,3498234,3398475
# AR,4587691,,
# CO,,



import csv

years = [2014, 2015, 2016]
with open(filename) as f:
    reader = csv.DictReader(f)
    for row in reader:
        # avg = (row['POPESTIMATE2015'] + row['POPESTIMATE2016']) / 2
        total = 0
        count = 0
        for yr in years:
            total += row['POPESTIMATE' + str(yr)]
        avg = total / len(years)
        
        print(avg)
        

In [None]:
import csv
with open('/midterm/census.csv') as f:
    L = csv.DictReader(f)
    row1 = next(L)
    print(list(row1.keys()))

# Inverting Dictionaries

In [68]:
patient_ages = {
    "E143291": 19,
    "E872839": 32,
    "E878198": 19,
    "E871111": 21,
    "E143299": 3,
    "E123332": 21,
    "E989891": 19
} 

In [69]:
age_counts = {}
for pat_id, age in patient_ages.items():
    if age in age_counts:
        age_counts[age] += 1
    else:
        age_counts[age] = 1
        
age_counts

{3: 1, 19: 3, 21: 2, 32: 1}

In [70]:
age_counts = {}
for pat_id, age in patient_ages.items():
    age_counts.setdefault(age, 0)
    age_counts[age] += 1
    
age_counts

{3: 1, 19: 3, 21: 2, 32: 1}

In [71]:
age_patients = {}
for pat_id, age in patient_ages.items():
    print(pat_id, age)
    
    if age in age_patients:
        print(age, "is already here")
        age_patients[age].append(pat_id)
    else:
        print(age, "is not here")
        age_patients[age] = [pat_id]

age_patients

E123332 21
21 is not here
E143299 3
3 is not here
E871111 21
21 is already here
E878198 19
19 is not here
E143291 19
19 is already here
E989891 19
19 is already here
E872839 32
32 is not here


{3: ['E143299'],
 19: ['E878198', 'E143291', 'E989891'],
 21: ['E123332', 'E871111'],
 32: ['E872839']}

In [72]:
age_patients = {}
for pat_id, age in patient_ages.items():
    print(pat_id, age)
    
    age_patients.setdefault(age,[])
    age_patients[age].append(pat_id)

age_patients

E123332 21
E143299 3
E871111 21
E878198 19
E143291 19
E989891 19
E872839 32


{3: ['E143299'],
 19: ['E878198', 'E143291', 'E989891'],
 21: ['E123332', 'E871111'],
 32: ['E872839']}

In [None]:
age_patients = {}
for pat_id, age in patient_ages.items():
    print(pat_id, age)

    if age not in age_patients:
        age_patients[age] = []
    
    age_patients[age].append(pat_id)
    print("Age {:d} has {:d} patients".format(age, len(age_patients[age])))

age_patients

for age, pat_list in age_patients.items():
    print("-- Age {:d} has {:d} patients".format(age, len(pat_list)))
    

# Sets

Sets are a special kind of list that is always a unique set of values - there won't be any duplicates.

You can also think about it as just a dictionary with only keys - remember that dictionaries can only have one entry for each key.

In [73]:
sex = {'M', 'F', 'U', 'O'}
sex

{'F', 'M', 'O', 'U'}

In [74]:
sex = {'M', 'F', 'U', 'M', 'F', 'O'}
sex

{'F', 'M', 'O', 'U'}

In [75]:
sex.add('T')
sex

{'F', 'M', 'O', 'T', 'U'}

You can also do all kinds of other set operations: See Chapter 11, p202

# Tuples

There's another special kind of ordered list called a tuple.  What's special about tuples is that they can't be altered once they're created, even though the objects inside them can be.  Weird.

One thing you can do with tuples, is assign to multiple variables at once.

In [76]:
(a, b) = (1, 2) 
a

1

In [77]:
b

2