In [1]:
import sys
print("Python Version:", sys.version, '\n')

Python Version: 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] 



# Advanced Data Types: The Collections Module

Throughout Python's existence, several tasks have popped up over time that are regularly a pain for people. To address those, the collections model has several "new" data types that smooth over constant issues in python. Let's look at some of those types.

## DefaultDict

Dictionaries expect that you will create a key-value pair before using the value. That's pretty reasonable most of the time, but sometimes you just want it to assume some basic value whenever a new key is entered. See this example.

In [2]:
count = {}
count['duck'] = 0

animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    print(animal)

count

duck
duck
duck


KeyError: 'goose'

It didn't have a value for `goose` so it couldn't add 1 to it. We can get around that with some try-except work - but that's sort of annoying. The `defaultdict` allows us to specify ahead of time to just assume a basic type of value for any new key. For instance, if we tell it to expect an `int` it will assume 0.

In [3]:
count = {}

animals = ['duck','duck','duck','goose']

for animal in animals:
    try:
        count[animal] += 1
    except KeyError:
        count[animal] = 1

count

{'duck': 3, 'goose': 1}

In [4]:
count = {} 

animals = ['duck','duck','duck','goose']

for animal in animals:
    if animal in count:
        count[animal] += 1
    else:
        count[animal] = 1

count

{'duck': 3, 'goose': 1}

In [5]:
from collections import defaultdict

count = defaultdict(int)
animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    
count

defaultdict(int, {'duck': 3, 'goose': 1})

## Named Tuple

Sometimes you want to create a class, but the class only needs to store data, and you are lazy.

You could put the data in a dictionary, but there is a set amount of info that never changes for each instance. You could put the data in a tuple, but then you need to remember the order. What if you could have the simplicity of a tuple, but labels like a dictionary, and access methods by name like a dictionary? That's a **named tuple**.

In [6]:
from collections import namedtuple

Alumni = namedtuple('Alumni','name age gender degree title salary employer')

alice = Alumni(name='Alice',
               age=29,
               gender='F',
               degree ='PhD',
               title = 'Data Scientist',
               salary = 115000,
               employer = 'Thumbtack')

alice.age

29

In [9]:
print(alice.name, alice.age, alice.gender , alice.degree, alice.title, alice.salary, alice.employer)

Alice 29 F PhD Data Scientist 115000 Thumbtack


## Deque

A deque (double-ended queue) is a lovely type of object that's designed for accessing data on either end. A normal list is only optimized for adding-removing from the right with things like append and pop. Deque's are designed to be ambivalent about sides. 

In [18]:
from collections import deque

d = deque([1,2,3,4])
d.appendleft(3)
d

deque([3, 1, 2, 3, 4])

In [19]:
# remove the left element from the list
print(d.popleft())
d

3


deque([1, 2, 3, 4])

In [20]:
d.append(9)
d

deque([1, 2, 3, 4, 9])

In [21]:
# remove the right element from the list
print(d.pop())
d

9


deque([1, 2, 3, 4])

We can also use deque's as a sliding window so we don't have to play weird games about chopping bits and pieces off if we want a fixed length.

In [22]:
window = deque(maxlen=4)
for idx in range(10):
    window.append(idx)
    print(window)
    
print("---SWITCH---")
for idx in range(10):
    window.appendleft(idx)
    print(window)

deque([0], maxlen=4)
deque([0, 1], maxlen=4)
deque([0, 1, 2], maxlen=4)
deque([0, 1, 2, 3], maxlen=4)
deque([1, 2, 3, 4], maxlen=4)
deque([2, 3, 4, 5], maxlen=4)
deque([3, 4, 5, 6], maxlen=4)
deque([4, 5, 6, 7], maxlen=4)
deque([5, 6, 7, 8], maxlen=4)
deque([6, 7, 8, 9], maxlen=4)
---SWITCH---
deque([0, 6, 7, 8], maxlen=4)
deque([1, 0, 6, 7], maxlen=4)
deque([2, 1, 0, 6], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)


In [23]:
new_window = deque(maxlen=8)
for i in range(10):
    new_window.append(i)
    print(new_window)
    
for i in range(10):
    new_window.appendleft(i)
    print(new_window)

deque([0], maxlen=8)
deque([0, 1], maxlen=8)
deque([0, 1, 2], maxlen=8)
deque([0, 1, 2, 3], maxlen=8)
deque([0, 1, 2, 3, 4], maxlen=8)
deque([0, 1, 2, 3, 4, 5], maxlen=8)
deque([0, 1, 2, 3, 4, 5, 6], maxlen=8)
deque([0, 1, 2, 3, 4, 5, 6, 7], maxlen=8)
deque([1, 2, 3, 4, 5, 6, 7, 8], maxlen=8)
deque([2, 3, 4, 5, 6, 7, 8, 9], maxlen=8)
deque([0, 2, 3, 4, 5, 6, 7, 8], maxlen=8)
deque([1, 0, 2, 3, 4, 5, 6, 7], maxlen=8)
deque([2, 1, 0, 2, 3, 4, 5, 6], maxlen=8)
deque([3, 2, 1, 0, 2, 3, 4, 5], maxlen=8)
deque([4, 3, 2, 1, 0, 2, 3, 4], maxlen=8)
deque([5, 4, 3, 2, 1, 0, 2, 3], maxlen=8)
deque([6, 5, 4, 3, 2, 1, 0, 2], maxlen=8)
deque([7, 6, 5, 4, 3, 2, 1, 0], maxlen=8)
deque([8, 7, 6, 5, 4, 3, 2, 1], maxlen=8)
deque([9, 8, 7, 6, 5, 4, 3, 2], maxlen=8)


# Generators

Generators aren't in the `collections` package, but are instead a standard part of Python 3. They're extremely powerful and solve a lot of problems for us.

Often times in an analysis, we don't really want to load a whole thing into memory. We really just want a `cursor` that knows where it is in the data. For instance, imagine I was trying to load all the books ever written into Python... that's too big for my RAM. However, if I just had an object that kept track of which book it was on, and what page it needs to read next, I could load things page-by-page. That's exactly what a generator does (albeit, I've oversimplified a bit). 

We can use that to give us data over and over, without having to pre-generate all the data. Let's see an example.

In [24]:
def generate_numbers():
    """
    An infinite number generator
    """
    x = 0
    while True:
        x += 1
        yield x # instead of return, I use yield, which makes this into a generator!
        
        
my_generator = generate_numbers()
for iteration in range(10):
    next_number = next(my_generator)
    print(next_number)

1
2
3
4
5
6
7
8
9
10


This could go on until infinity! Now realistically, if I asked python to generate an infinite `list` of numbers, I'd run out of RAM. But here, I've just asked Python to keep track of what number comes next, and to forget everything else. Then when it updates, it just says, "oh this number comes next now". Let's prove to ourselves that Python isn't pre-generating the whole `list` by comparing the size in memory of the generator and the list.

In [25]:
from sys import getsizeof as sizeof

In [26]:
a = [idx for idx in range(200)]
b = (idx for idx in range(200)) # By wrapping in parens, this is a generator
print(sizeof(a))
print(sizeof(b))

1672
88


The list is 1672 bytes, the generator is only 88 bytes! That's because it's not storing all the data, just a cursor to loop through the data.

In [28]:
a
b

<generator object <genexpr> at 0x106df36d0>

In [29]:
type(b)

generator

Generators are iterables, so we can loop through them with a `for` just like normal.

In [30]:
for ix in b:
    print(ix)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199


Why does this matter? Because if we want to work with large, streaming data, we can't always fit it into memory. The generator doesn't ask it to fit in memory, it just remembers where it is pulling the data from... for instance, what line in the CSV am I on? Then it hands to the next data as you ask for it. You can keep adding data to a file, or always pull the most recent data and use that with generators.

In [97]:
#!/bin/python3

import sys
import os
import tempfile
from pathlib import Path
import csv
import re
from collections import defaultdict

facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biostatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""


#temp_dir = Path(os.environ['OUTPUT_PATH']).parent

#faculty_path = temp_dir / "faculty.csv"

#faculty_path.write_text(facultycsv)


file = open("faculty.csv","w") 
file.writelines(facultycsv) 
file.close() #to change file access modes 

# Complete the function below.

def count_degrees(csv_file_name):
    
    ## open the file
    with open(csv_file_name, 'r') as infile:
        ## read the file as a dictionary for each row ({header : value}) and remove whitespace 
        reader = csv.DictReader(infile, skipinitialspace=True)
        data = {}
        for row in reader:
            for header, value in row.items():
                try:
                    data[header].append(value)
                except KeyError:
                    data[header] = [value]

                    
    ## remove punctuation and whitespace characters
    degrees = [re.sub(r'[^\w\s]','',i).split() for i in data['degree']]
    
    ## flatten lists in list
    degrees = sum(degrees , [])

    count = defaultdict(int)

    for degree in degrees:
        count[degree] += 1
    
    return count


degreecounts = count_degrees(faculty_path)

    
degreecounts = {
    str(key).replace(' ', '').replace('.', '').upper(): val
    for key, val in degreecounts.items()
}

degrees = ['MD', 'MA', 'SCD', 'BSED', 'PHD', '0', 'MPH', 'MS', 'JD']
assert len(degrees) >= len(degreecounts), 'did you get all the different degrees?'
assert len(degrees) == len(degreecounts), 'your output has too many degrees'
for degree in degrees:
    count = degreecounts.get(degree, -1)
    print(count)

1
1
6
1
31
1
2
2
1


In [98]:
facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biotatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""

file = open("faculty.csv","w") 
file.writelines(facultycsv) 
file.close() #to change file access modes 
  
#faculty=[line.strip().split(',') for line in open('faculty.csv')]

# open the file in universal line ending mode 
with open('faculty.csv', 'r') as infile:
  # read the file as a dictionary for each row ({header : value}) and remove whitespace 
    reader = csv.DictReader(infile, skipinitialspace=True)
    data = {}
    for row in reader:
        for header, value in row.items():
            try:
                data[header].append(value)
            except KeyError:
                data[header] = [value]

In [99]:
print(data.keys())
degrees = [i.strip() for i in data['degree']]
degrees

dict_keys(['name', 'degree', 'title', 'email'])


['Sc.D.',
 'Ph.D.',
 'PhD',
 'Ph.D.',
 'Ph.D.',
 'Ph.D.',
 'Ph.D',
 'PhD',
 'Ph.D',
 'Ph.D',
 'Ph.D.',
 'PhD',
 'Ph.D.',
 'MD MPH Ph.D',
 'B.S.Ed. M.S. Ph.D.',
 'Ph.D.',
 'Ph.D.',
 'Ph.D',
 'JD MA MPH MS PhD',
 'Ph.D.',
 'Sc.D.',
 'Sc.D.',
 'PhD ScD',
 'Ph.D.',
 'PhD',
 'Ph.D.',
 'Sc.D.',
 'PhD',
 '0',
 'Ph.D.',
 'Ph.D.',
 'Ph.D.',
 'ScD',
 'PhD',
 'Ph.D.',
 'PhD',
 'Ph.D.']

In [100]:
import re
s = "string. With. Punctuation?"
print(s)
s = re.sub(r'[^\w\s]','',s)
s

string. With. Punctuation?


'string With Punctuation'

In [101]:
## remove whitespace
#degrees = [i.strip() for i in data['degree']]

## remove punctuation characters
#degrees = [re.sub(r'[^\w\s]','',i) for i in degrees]

## remove whitespace
#degrees = [i.split() for i in degrees]

## combine all together
degrees = [re.sub(r'[^\w\s]','',i).split() for i in data['degree']]
degrees = sum(degrees , [])
print(data['degree'], '\n')
print(degrees)

['Sc.D.', 'Ph.D.', 'PhD', 'Ph.D.', 'Ph.D.', 'Ph.D.', 'Ph.D', 'PhD', 'Ph.D', 'Ph.D', 'Ph.D.', 'PhD', 'Ph.D.', 'MD MPH Ph.D', 'B.S.Ed. M.S. Ph.D.', 'Ph.D.', 'Ph.D.', 'Ph.D', 'JD MA MPH MS PhD', 'Ph.D.', 'Sc.D.', 'Sc.D.', 'PhD ScD', 'Ph.D.', 'PhD', 'Ph.D.', 'Sc.D.', 'PhD', '0', 'Ph.D.', 'Ph.D.', 'Ph.D.', 'ScD', 'PhD', 'Ph.D.', 'PhD', 'Ph.D.'] 

['ScD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'PhD', 'MD', 'MPH', 'PhD', 'BSEd', 'MS', 'PhD', 'PhD', 'PhD', 'PhD', 'JD', 'MA', 'MPH', 'MS', 'PhD', 'PhD', 'ScD', 'ScD', 'PhD', 'ScD', 'PhD', 'PhD', 'PhD', 'ScD', 'PhD', '0', 'PhD', 'PhD', 'PhD', 'ScD', 'PhD', 'PhD', 'PhD', 'PhD']


In [102]:
from collections import defaultdict

count = defaultdict(int)

for degree in degrees:
    count[degree] += 1
    
count

defaultdict(int,
            {'ScD': 6,
             'PhD': 31,
             'MD': 1,
             'MPH': 2,
             'BSEd': 1,
             'MS': 2,
             'JD': 1,
             'MA': 1,
             '0': 1})

In [112]:
#!/bin/python3

import sys
import os
import tempfile
from pathlib import Path


facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biostatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""

#temp_dir = Path(os.environ['OUTPUT_PATH']).parent

#faculty_path = temp_dir / "faculty.csv"

#faculty_path.write_text(facultycsv)


file = open("faculty.csv","w") 
file.writelines(facultycsv) 
file.close() #to change file access modes 

# Complete the function below.
import csv
from collections import defaultdict

def count_titles(csv_file_name):

    ## open the file
    with open(csv_file_name, 'r') as infile:
        ## read the file as a dictionary for each row ({header : value}) and remove whitespace 
        reader = csv.DictReader(infile, skipinitialspace=True)
        data = {}
        for row in reader:
            for header, value in row.items():
                try:
                    data[header].append(value)
                except KeyError:
                    data[header] = [value]

                    
    ## correct some errors in title
    titles = [i.replace(' is ', ' of ') for i in data['title']]
    titles = [i.replace('Biotatistics', 'Biostatistics') for i in titles]

    count = defaultdict(int)

    for title in titles:
        count[title] += 1
    
    return count

titlecounts = count_titles(faculty_path)
    
titlecounts = {
    str(key).replace(' ', '').replace('.', '').lower()[:9]: val
    for key, val in titlecounts.items()
}

titles = ['professor', 'associate', 'assistant']
assert len(titles) >= len(titlecounts), 'did you get all the different titles?'
assert len(titles) == len(titlecounts), 'your output has too many titles'
for title in titles:
    count = titlecounts.get(title, -1)
    print(count)

13
12
12


In [113]:
print(data.keys())
titles = [i.replace(' is ', ' of ') for i in data['title']]
titles = [i.replace('Biotatistics', 'Biostatistics') for i in titles]
titles

dict_keys(['name', 'degree', 'title', 'email'])


['Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Assistant Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Professor of Biostatistics',
 'Assistant Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Professor of Biostatistics',
 'Assistant Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Professor of Biostatistics',
 'Assistant Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Assistant Professor of Biostatistics',
 'Associate Professor of Biostatistics',
 'Professor of Biostatistics',
 'A

In [114]:
from collections import defaultdict

count = defaultdict(int)

for degree in titles:
    count[degree] += 1
    
count

defaultdict(int,
            {'Associate Professor of Biostatistics': 12,
             'Professor of Biostatistics': 13,
             'Assistant Professor of Biostatistics': 12})

In [None]:
#!/bin/python3

import sys
import os
import tempfile
from pathlib import Path


facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biostatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""

#temp_dir = Path(os.environ['OUTPUT_PATH']).parent

#faculty_path = temp_dir / "faculty.csv"

#faculty_path.write_text(facultycsv)


file = open("faculty.csv","w") 
file.writelines(facultycsv) 
file.close() #to change file access modes 

# Complete the function below.
import csv
from collections import defaultdict


def emails(csv_file_name):
    ## open the file
    with open(csv_file_name, 'r') as infile:
        ## read the file as a dictionary for each row ({header : value}) and remove whitespace 
        reader = csv.DictReader(infile, skipinitialspace=True)
        data = {}
        for row in reader:
            for header, value in row.items():
                try:
                    data[header].append(value)
                except KeyError:
                    data[header] = [value]

                    
    emails = data['email']
    
    return emails


email_list = emails(faculty_path)
    
for email in sorted(email_list):
    print(email.lower())

In [116]:
print(data.keys())
emails = data['email']
emails

dict_keys(['name', 'degree', 'title', 'email'])


['bellamys@mail.med.upenn.edu',
 'warren@upenn.edu',
 'bryanma@upenn.edu',
 'jinboche@upenn.edu',
 'sellenbe@upenn.edu',
 'jellenbe@mail.med.upenn.edu',
 'ruifeng@upenn.edu',
 'bcfrench@mail.med.upenn.edu',
 'pgimotty@upenn.edu',
 'wguo@mail.med.upenn.edu',
 'hsu9@mail.med.upenn.edu',
 'rhubb@mail.med.upenn.edu',
 'whwang@mail.med.upenn.edu',
 'mjoffe@mail.med.upenn.edu',
 'jrlandis@mail.med.upenn.edu',
 'liy3@email.chop.edu',
 'mingyao@mail.med.upenn.edu',
 'hongzhe@upenn.edu',
 'rlocalio@upenn.edu',
 'nanditam@mail.med.upenn.edu',
 'knashawn@mail.med.upenn.edu',
 'propert@mail.med.upenn.edu',
 'mputt@mail.med.upenn.edu',
 'sratclif@upenn.edu',
 'michross@upenn.edu',
 'jaroy@mail.med.upenn.edu',
 'msammel@cceb.med.upenn.edu',
 'shawp@upenn.edu',
 'rshi@mail.med.upenn.edu',
 'hshou@mail.med.upenn.edu',
 'jshults@mail.med.upenn.edu',
 'alisaste@mail.med.upenn.edu',
 'atroxel@mail.med.upenn.edu',
 'rxiao@mail.med.upenn.edu',
 'sxie@mail.med.upenn.edu',
 'dxie@upenn.edu',
 'weiyang@mail.m

In [None]:
#!/bin/python3
import sys
import os



# Complete the function below.
import re 

def unique_domains(emails):
    domain = [re.search("@[\w.]+", i).group()[1:] for i in emails]
    domain = list(set(domain))
    return domain

emails = sys.stdin.readlines()
emails = [email.strip() for email in emails]
domains = unique_domains(emails)
for domain in sorted(domains):
    print(domain)

In [122]:
import re
s = 'My name is Conrad, and blahblah@gmail.com is my email.'
domain = re.search("@[\w.]+", s)
print(domain.group())
print(domain)

My
<_sre.SRE_Match object; span=(0, 2), match='My'>


In [126]:
domain = [re.search("@[\w.]+", i).group()[1:] for i in emails]
print(domain)
domain = list(set(domain))
domain

['mail.med.upenn.edu', 'upenn.edu', 'upenn.edu', 'upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'email.chop.edu', 'mail.med.upenn.edu', 'upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'cceb.med.upenn.edu', 'upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'mail.med.upenn.edu', 'upenn.edu', 'mail.med.upenn.edu']


['upenn.edu', 'email.chop.edu', 'cceb.med.upenn.edu', 'mail.med.upenn.edu']

In [None]:
#!/bin/python3
import sys
import os



# Complete the function below.

def write_to_csv(list_of_emails):
    with open('emails.csv','w') as file:
    file.write("list_of_emails\n")
    for line in list_of_emails:
        file.write(line)
        file.write('\n')

emails = list(map(str.strip, sys.stdin.readlines()))
write_to_csv(emails)
assert os.path.exists('emails.csv'), 'did you write to "emails.csv"?'
with open('emails.csv', 'r') as f:
    header = f.readline()
    emails2 = []
    for line in f.readlines():
        emails2.append(line.strip())
os.remove('emails.csv')
assert all(i == j for i, j in zip(emails, emails2)), 'this list of emails is different'
assert len(emails) == len(emails2), 'this list of emails is different'
print(1)

In [184]:
import csv
list_of_emails = emails

outfile = open("./emails.csv", "w")
writer = csv.writer(outfile)
writer.writerow(["list_of_emails"])
writer.writerows(list_of_emails)

In [182]:
list_of_emails

['bellamys@mail.med.upenn.edu',
 'warren@upenn.edu',
 'bryanma@upenn.edu',
 'jinboche@upenn.edu',
 'sellenbe@upenn.edu',
 'jellenbe@mail.med.upenn.edu',
 'ruifeng@upenn.edu',
 'bcfrench@mail.med.upenn.edu',
 'pgimotty@upenn.edu',
 'wguo@mail.med.upenn.edu',
 'hsu9@mail.med.upenn.edu',
 'rhubb@mail.med.upenn.edu',
 'whwang@mail.med.upenn.edu',
 'mjoffe@mail.med.upenn.edu',
 'jrlandis@mail.med.upenn.edu',
 'liy3@email.chop.edu',
 'mingyao@mail.med.upenn.edu',
 'hongzhe@upenn.edu',
 'rlocalio@upenn.edu',
 'nanditam@mail.med.upenn.edu',
 'knashawn@mail.med.upenn.edu',
 'propert@mail.med.upenn.edu',
 'mputt@mail.med.upenn.edu',
 'sratclif@upenn.edu',
 'michross@upenn.edu',
 'jaroy@mail.med.upenn.edu',
 'msammel@cceb.med.upenn.edu',
 'shawp@upenn.edu',
 'rshi@mail.med.upenn.edu',
 'hshou@mail.med.upenn.edu',
 'jshults@mail.med.upenn.edu',
 'alisaste@mail.med.upenn.edu',
 'atroxel@mail.med.upenn.edu',
 'rxiao@mail.med.upenn.edu',
 'sxie@mail.med.upenn.edu',
 'dxie@upenn.edu',
 'weiyang@mail.m

In [188]:
list_of_emails = emails

with open('emails.csv','w') as file:
    file.write("list_of_emails\n")
    for line in list_of_emails:
        file.write(line)
        file.write('\n')

In [278]:
#!/bin/python3

import sys
import os
#import tempfile
#from pathlib import Path


facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biostatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""

with open('faculty.csv', 'w') as f:
    f.write(facultycsv)
    
#temp_dir = Path(os.environ['OUTPUT_PATH']).parent

#faculty_path = temp_dir / "faculty.csv"

#faculty_path.write_text(facultycsv)

# Complete the function below.
# Read the csv file "faculty.csv"
# Don't hit the url in the problem description.
from collections import defaultdict, OrderedDict

def get_dict():
     
    ## open the file
    with open('faculty.csv', 'r') as infile:
        ## read the file as a dictionary for each row ({header : value}) and remove whitespace 
        reader = csv.DictReader(infile, skipinitialspace=True)
        data = {}
        for row in reader:
            for header, value in row.items():
                try:
                    data[header].append(value)
                except KeyError:
                    data[header] = [value]
                    
                    
    ## extract last names from data
    last_names = [i.split()[-1] for i in data['name']]
    
    ## extract the corresponding row (list of degree, title and email)
    others = [ [' ' + i, j, k] for i, j, k in zip(data['degree'], data['title'], data['email'])]

    
   
    
    ## dictionary keys as a list of degree, title and email and dictionary keys as last name
    ## because keys are IMMUTABLE, we should convert LIST to TUPLE before set them as keys
    new_dict = {tuple(k): v for k, v in zip(others, last_names)}

    
    ## dictionary keys as a list and create a new dictionary with this list as a value
    v = defaultdict(list)

    for key, value in sorted(new_dict.items()):
        v[value].append(list(key))
    
    
    v = dict( OrderedDict(sorted(v.items())) )
    
    #v = dict((k, v) for k, v in zip(last_names, others) )
 
    return v


answer = get_dict()
n = 0
for key, vals in answer.items():
    print('{key},{val}'.format(key=key, val=','.join(val)))
    if in facultycsv for val in vals
    #assert all('{key},{val}'.format(key=key, val=','.join(val)) in facultycsv for val in vals)
    n += len(vals)
assert n == facultycsv.count('\n')
print(1)

Bellamy, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Bilker, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Bryan, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Chen, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Ellenberg, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Feng, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
French, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Gimotty, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Guo, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Hsu, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Hubbard, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Hwang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu
Joffe, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upe

In [203]:
print(data.keys())
last_names = [i.split()[-1] for i in data['name']]
last_names

dict_keys(['name', 'degree', 'title', 'email'])


['Bellamy',
 'Bilker',
 'Bryan',
 'Chen',
 'Ellenberg',
 'Ellenberg',
 'Feng',
 'French',
 'Gimotty',
 'Guo',
 'Hsu',
 'Hubbard',
 'Hwang',
 'Joffe',
 'Landis',
 'Li',
 'Li',
 'Li',
 'Localio',
 'Mitra',
 'Morales',
 'Propert',
 'Putt',
 'Ratcliffe',
 'Ross',
 'Roy',
 'Sammel',
 'Shaw',
 'Shinohara',
 'Shou',
 'Shults',
 'Stephens',
 'Troxel',
 'Xiao',
 'Xie',
 'Xie',
 'Yang']

In [193]:
[(i , i.split()[-1]) for i in data['name']]

[('Scarlett L. Bellamy', 'Bellamy'),
 ('Warren B. Bilker', 'Bilker'),
 ('Matthew W Bryan', 'Bryan'),
 ('Jinbo Chen', 'Chen'),
 ('Susan S Ellenberg', 'Ellenberg'),
 ('Jonas H. Ellenberg', 'Ellenberg'),
 ('Rui Feng', 'Feng'),
 ('Benjamin C. French', 'French'),
 ('Phyllis A. Gimotty', 'Gimotty'),
 ('Wensheng Guo', 'Guo'),
 ('Yenchih Hsu', 'Hsu'),
 ('Rebecca A Hubbard', 'Hubbard'),
 ('Wei-Ting Hwang', 'Hwang'),
 ('Marshall M. Joffe', 'Joffe'),
 ('J. Richard Landis', 'Landis'),
 ('Yimei Li', 'Li'),
 ('Mingyao Li', 'Li'),
 ('Hongzhe Li', 'Li'),
 ('A. Russell Localio', 'Localio'),
 ('Nandita Mitra', 'Mitra'),
 ('Knashawn H. Morales', 'Morales'),
 ('Kathleen Joy Propert', 'Propert'),
 ('Mary E. Putt', 'Putt'),
 ('Sarah Jane Ratcliffe', 'Ratcliffe'),
 ('Michelle Elana Ross', 'Ross'),
 ('Jason A. Roy', 'Roy'),
 ('Mary D. Sammel', 'Sammel'),
 ('Pamela Ann Shaw', 'Shaw'),
 ('Russell Takeshi Shinohara', 'Shinohara'),
 ('Haochang Shou', 'Shou'),
 ('Justine Shults', 'Shults'),
 ('Alisa Jane Stephens'

In [218]:
values = [ [i, j, k] for i, j, k in zip(data['degree'], data['title'], data['email'])]
#print(len(values), len(last_names))
#dict(zip(values, last_names))

new_dict = {tuple(k): v for k, v in zip(values, last_names)}
new_dict

{('Sc.D.',
  'Associate Professor of Biostatistics',
  'bellamys@mail.med.upenn.edu'): 'Bellamy',
 ('Ph.D.', 'Professor of Biostatistics', 'warren@upenn.edu'): 'Bilker',
 ('PhD', 'Assistant Professor of Biostatistics', 'bryanma@upenn.edu'): 'Bryan',
 ('Ph.D.',
  'Associate Professor of Biostatistics',
  'jinboche@upenn.edu'): 'Chen',
 ('Ph.D.', 'Professor of Biostatistics', 'sellenbe@upenn.edu'): 'Ellenberg',
 ('Ph.D.',
  'Professor of Biostatistics',
  'jellenbe@mail.med.upenn.edu'): 'Ellenberg',
 ('Ph.D', 'Assistant Professor of Biostatistics', 'ruifeng@upenn.edu'): 'Feng',
 ('PhD',
  'Associate Professor of Biostatistics',
  'bcfrench@mail.med.upenn.edu'): 'French',
 ('Ph.D', 'Professor of Biostatistics', 'pgimotty@upenn.edu'): 'Gimotty',
 ('Ph.D', 'Professor of Biostatistics', 'wguo@mail.med.upenn.edu'): 'Guo',
 ('Ph.D.',
  'Assistant Professor of Biostatistics',
  'hsu9@mail.med.upenn.edu'): 'Hsu',
 ('PhD',
  'Associate Professor of Biostatistics',
  'rhubb@mail.med.upenn.edu'): '

In [248]:
v = defaultdict(list)

for key, value in sorted(new_dict.items()):
    v[value].append(list(key))
    
v = dict(v)
v

{'Shinohara': [['0',
   'Assistant Professor of Biostatistics',
   'rshi@mail.med.upenn.edu']],
 'Landis': [['B.S.Ed. M.S. Ph.D.',
   'Professor of Biostatistics',
   'jrlandis@mail.med.upenn.edu']],
 'Localio': [['JD MA MPH MS PhD',
   'Associate Professor of Biostatistics',
   'rlocalio@upenn.edu']],
 'Joffe': [['MD MPH Ph.D',
   'Professor of Biostatistics',
   'mjoffe@mail.med.upenn.edu']],
 'Feng': [['Ph.D',
   'Assistant Professor of Biostatistics',
   'ruifeng@upenn.edu']],
 'Li': [['Ph.D', 'Professor of Biostatistics', 'hongzhe@upenn.edu'],
  ['Ph.D.', 'Assistant Professor of Biostatistics', 'liy3@email.chop.edu'],
  ['Ph.D.',
   'Associate Professor of Biostatistics',
   'mingyao@mail.med.upenn.edu']],
 'Gimotty': [['Ph.D', 'Professor of Biostatistics', 'pgimotty@upenn.edu']],
 'Guo': [['Ph.D', 'Professor of Biostatistics', 'wguo@mail.med.upenn.edu']],
 'Stephens': [['Ph.D.',
   'Assistant Professor of Biostatistics',
   'alisaste@mail.med.upenn.edu']],
 'Shou': [['Ph.D.',
   

In [250]:
answer = get_dict()

n = 0
for key, vals in answer.items():
    print(key, vals)
    n +=1
    #assert all('{key},{val}'.format(key=key, val=','.join(val)) in facultycsv for val in vals)
print(n, facultycsv.count('\n'))

{'Bellamy': [['Sc.D.', 'Associate Professor of Biostatistics', 'bellamys@mail.med.upenn.edu']], 'Bilker': [['Ph.D.', 'Professor of Biostatistics', 'warren@upenn.edu']], 'Bryan': [['PhD', 'Assistant Professor of Biostatistics', 'bryanma@upenn.edu']], 'Chen': [['Ph.D.', 'Associate Professor of Biostatistics', 'jinboche@upenn.edu']], 'Ellenberg': [['Ph.D.', 'Professor of Biostatistics', 'jellenbe@mail.med.upenn.edu'], ['Ph.D.', 'Professor of Biostatistics', 'sellenbe@upenn.edu']], 'Feng': [['Ph.D', 'Assistant Professor of Biostatistics', 'ruifeng@upenn.edu']], 'French': [['PhD', 'Associate Professor of Biostatistics', 'bcfrench@mail.med.upenn.edu']], 'Gimotty': [['Ph.D', 'Professor of Biostatistics', 'pgimotty@upenn.edu']], 'Guo': [['Ph.D', 'Professor of Biostatistics', 'wguo@mail.med.upenn.edu']], 'Hsu': [['Ph.D.', 'Assistant Professor of Biostatistics', 'hsu9@mail.med.upenn.edu']], 'Hubbard': [['PhD', 'Associate Professor of Biostatistics', 'rhubb@mail.med.upenn.edu']], 'Hwang': [['Ph.D

In [249]:
from collections import defaultdict
d = {1: 6, 2: 1, 3: 1, 4: 9, 5: 9, 6: 1}
v = defaultdict(list)

for key, value in sorted(d.items()):
    v[value].append(key)
    
v

defaultdict(list, {6: [1], 1: [2, 3, 6], 9: [4, 5]})

In [275]:
answer = get_dict()
n = 0
for key, vals in answer.items():
    for val in vals:
        #print(val)
        #print('{key},{val}'.format(key=key, val=','.join(val)) )
        x = '{key},{val}'.format(key=key, val=','.join(val))
        if x in facultycsv:
            print(1)
    #assert all('{key},{val}'.format(key=key, val=','.join(val)) in facultycsv for val in vals)
    #n += len(vals)

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1


In [274]:
x='Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu'
x='Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu'
x='Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu'
if x in facultycsv:
    print(1)

1


In [300]:
#!/bin/python3

import sys
import os


facultycsv = """name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu
Susan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu
Jonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu
Rui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu
Benjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu
Phyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu
Wensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu
Yenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu
Rebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu
Wei-Ting Hwang, Ph.D.,Associate Professor of Biostatistics,whwang@mail.med.upenn.edu
Marshall M. Joffe, MD MPH Ph.D,Professor of Biostatistics,mjoffe@mail.med.upenn.edu
J. Richard Landis, B.S.Ed. M.S. Ph.D.,Professor of Biostatistics,jrlandis@mail.med.upenn.edu
Yimei Li, Ph.D.,Assistant Professor of Biostatistics,liy3@email.chop.edu
Mingyao Li, Ph.D.,Associate Professor of Biostatistics,mingyao@mail.med.upenn.edu
Hongzhe Li, Ph.D,Professor of Biostatistics,hongzhe@upenn.edu
A. Russell Localio, JD MA MPH MS PhD,Associate Professor of Biostatistics,rlocalio@upenn.edu
Nandita Mitra, Ph.D.,Associate Professor of Biostatistics,nanditam@mail.med.upenn.edu
Knashawn H. Morales, Sc.D.,Associate Professor of Biostatistics,knashawn@mail.med.upenn.edu
Kathleen Joy Propert, Sc.D.,Professor of Biostatistics,propert@mail.med.upenn.edu
Mary E. Putt, PhD ScD,Professor of Biostatistics,mputt@mail.med.upenn.edu
Sarah Jane Ratcliffe, Ph.D.,Associate Professor of Biostatistics,sratclif@upenn.edu
Michelle Elana Ross, PhD,Assistant Professor is Biostatistics,michross@upenn.edu
Jason A. Roy, Ph.D.,Associate Professor of Biostatistics,jaroy@mail.med.upenn.edu
Mary D. Sammel, Sc.D.,Professor of Biostatistics,msammel@cceb.med.upenn.edu
Pamela Ann Shaw, PhD,Assistant Professor of Biostatistics,shawp@upenn.edu
Russell Takeshi Shinohara,0,Assistant Professor of Biostatistics,rshi@mail.med.upenn.edu
Haochang Shou, Ph.D.,Assistant Professor of Biostatistics,hshou@mail.med.upenn.edu
Justine Shults, Ph.D.,Professor of Biostatistics,jshults@mail.med.upenn.edu
Alisa Jane Stephens, Ph.D.,Assistant Professor of Biostatistics,alisaste@mail.med.upenn.edu
Andrea Beth Troxel, ScD,Professor of Biostatistics,atroxel@mail.med.upenn.edu
Rui Xiao, PhD,Assistant Professor of Biostatistics,rxiao@mail.med.upenn.edu
Sharon Xiangwen Xie, Ph.D.,Associate Professor of Biostatistics,sxie@mail.med.upenn.edu
Dawei Xie, PhD,Assistant Professor of Biostatistics,dxie@upenn.edu
Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu"""

with open('faculty.csv', 'w') as f:
    f.write(facultycsv)

# Complete the function below.

def get_dict():
    
    ## open the file
    with open('faculty.csv', 'r') as infile:
        ## read the file as a dictionary for each row ({header : value}) and remove whitespace 
        reader = csv.DictReader(infile, skipinitialspace=True)
        data = {}
        for row in reader:
            for header, value in row.items():
                try:
                    data[header].append(value)
                except KeyError:
                    data[header] = [value]

                    
    ## extract last names from data
    names = [tuple(i.split()) for i in data['name']]
    
    ## extract the corresponding row (list of degree, title and email)
    others = [ [' ' + i, j, k] for i, j, k in zip(data['degree'], data['title'], data['email'])]
    
 
    ## dictionary keys as tuple of name and dictionary values as a list of degree, title and email  
    new_dict = {k: v for k, v in zip(names, others)}
 
    return new_dict                    
                    
                    
answer = get_dict()
for key, val in answer.items():
    assert '{key},{val}'.format(key=' '.join(key), val=','.join(val)) in facultycsv
assert len(answer) == facultycsv.count('\n')
print(1)

AssertionError: 

In [301]:
names = [tuple(i.split()) for i in data['name']]
print(len(names))
others = [ [' ' + i, j, k] for i, j, k in zip(data['degree'], data['title'], data['email'])]
print(len(others))
names

new_dict = {k: v for k, v in zip(names, others)}
new_dict

37
37


{('Scarlett', 'L.', 'Bellamy'): [' Sc.D.',
  'Associate Professor of Biostatistics',
  'bellamys@mail.med.upenn.edu'],
 ('Warren', 'B.', 'Bilker'): [' Ph.D.',
  'Professor of Biostatistics',
  'warren@upenn.edu'],
 ('Matthew', 'W', 'Bryan'): [' PhD',
  'Assistant Professor of Biostatistics',
  'bryanma@upenn.edu'],
 ('Jinbo', 'Chen'): [' Ph.D.',
  'Associate Professor of Biostatistics',
  'jinboche@upenn.edu'],
 ('Susan', 'S', 'Ellenberg'): [' Ph.D.',
  'Professor of Biostatistics',
  'sellenbe@upenn.edu'],
 ('Jonas', 'H.', 'Ellenberg'): [' Ph.D.',
  'Professor of Biostatistics',
  'jellenbe@mail.med.upenn.edu'],
 ('Rui', 'Feng'): [' Ph.D',
  'Assistant Professor of Biostatistics',
  'ruifeng@upenn.edu'],
 ('Benjamin', 'C.', 'French'): [' PhD',
  'Associate Professor of Biostatistics',
  'bcfrench@mail.med.upenn.edu'],
 ('Phyllis', 'A.', 'Gimotty'): [' Ph.D',
  'Professor of Biostatistics',
  'pgimotty@upenn.edu'],
 ('Wensheng', 'Guo'): [' Ph.D',
  'Professor of Biostatistics',
  'wguo

In [347]:
answer = get_dict()
for key, val in answer.items():
    #assert '{key},{val}'.format(key=' '.join(key), val=','.join(val)) in facultycsv
    #print('{key},{val}'.format(key=' '.join(key), val=','.join(val)) )
    x = '{key},{val}'.format(key=' '.join(key), val=','.join(val))
    assert x in facultycsv
    print(x,'\n')
    #print(x1)

Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu 



AssertionError: 

In [336]:
x = 'Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu'
x = 'Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu'
print(x, type(x))
assert x in facultycsv
assert 'Wei (Peter) Yang, Ph.D.,Assistant Professor of Biostatistics,weiyang@mail.med.upenn.edu' in facultycsv

Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu <class 'str'>


In [339]:
facultycsv

'name, degree, title, email\nScarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu\nWarren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu\nMatthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu\nJinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu\nSusan S Ellenberg, Ph.D.,Professor of Biostatistics,sellenbe@upenn.edu\nJonas H. Ellenberg, Ph.D.,Professor of Biostatistics,jellenbe@mail.med.upenn.edu\nRui Feng, Ph.D,Assistant Professor of Biostatistics,ruifeng@upenn.edu\nBenjamin C. French, PhD,Associate Professor of Biostatistics,bcfrench@mail.med.upenn.edu\nPhyllis A. Gimotty, Ph.D,Professor of Biostatistics,pgimotty@upenn.edu\nWensheng Guo, Ph.D,Professor of Biostatistics,wguo@mail.med.upenn.edu\nYenchih Hsu, Ph.D.,Assistant Professor of Biostatistics,hsu9@mail.med.upenn.edu\nRebecca A Hubbard, PhD,Associate Professor of Biostatistics,rhubb@mail.med.upenn.edu\nWei-Ting Hwang, Ph.D.,Associate P