# Advanced Python
Michelle L. Gill  
2016/05/18  

This is a Jupyter notebook that will reproduce all of the analysis in the following files:
* [advanced_python_cleaning.py](advanced_python_cleaning.py)
* [advanced_python_regex.py](advanced_python_regex.py)
* [advanced_python_csv.py](advanced_python_csv.py)
* [advanced_python_dict.py](advanced_python_dict.py)

In [1]:
! head -n 5 faculty.csv

name, degree, title, email
Scarlett L. Bellamy, Sc.D.,Associate Professor of Biostatistics,bellamys@mail.med.upenn.edu
Warren B. Bilker,Ph.D.,Professor of Biostatistics,warren@upenn.edu
Matthew W Bryan, PhD,Assistant Professor of Biostatistics,bryanma@upenn.edu
Jinbo Chen, Ph.D.,Associate Professor of Biostatistics,jinboche@upenn.edu


In [2]:
import pandas as pd
import re

strip_sp = lambda x: x.strip()

# Add period whenever next character is a capital or a space
clean_degrees = lambda x: re.sub(r"""(\w+?)(?=[ A-Z])""", r'\1.', x+' ').strip()

# Import the data and strip unused spaces
faculty = pd.read_csv('faculty.csv',
                      names=['name', 'degree', 'title', 'email'],
                      skiprows=[0],
                      converters = {'name' : strip_sp, 'degree' : clean_degrees,
                                    'title': strip_sp, 'email'  : strip_sp})

# Add missing degree
faculty.loc[faculty['name'] == 'Russell Takeshi Shinohara', 'degree'] = 'Ph.D.'

# Remove department name from title
faculty['title'] = faculty['title'].str.extract(r"""((?:(?:Assistant|Associate) )?Professor)""", expand=True)

# Determine email domain
faculty['email_domain'] = faculty.email.str.extract(r"""@(.+)""", expand=True)

# Extract first and last name:
# 1. Skip first initial if only listed as single letter
# 2. Also skip any nicknames in parenthesis
faculty[['first_name','last_name']] = ( faculty['name'].str
                                        .extract(r"""(?:[A-Z]\. )?(.+?)(?: \(?([\w.]+)\)?)? (.+)""", 
                                                 expand=True)
                                        .fillna('')
                                        [[0,2]]
                                       )

### Q1. Find how many different degrees there are, and their frequencies: Ex:  PhD, ScD, MD, MPH, BSEd, MS, JD, etc.

In [3]:
# Split multiple degress up and flatten into series
degrees = faculty.degree.apply(lambda x: x.split(' ')).values
degrees = pd.Series(sum(degrees, []))

print(degrees.nunique())

8


In [4]:
print(degrees.value_counts())

Ph.D.      32
Sc.D.       6
M.P.H.      2
M.S.        2
B.S.Ed.     1
J.D.        1
M.D.        1
M.A.        1
dtype: int64


### Q2. Find how many different titles there are, and their frequencies:  Ex:  Assistant Professor, Professor

In [5]:
print(faculty.title.value_counts())

Professor              13
Associate Professor    12
Assistant Professor    12
Name: title, dtype: int64


### Q3. Search for email addresses and put them in a list.  Print the list of email addresses.

In [6]:
print(list(faculty.email))

['bellamys@mail.med.upenn.edu', 'warren@upenn.edu', 'bryanma@upenn.edu', 'jinboche@upenn.edu', 'sellenbe@upenn.edu', 'jellenbe@mail.med.upenn.edu', 'ruifeng@upenn.edu', 'bcfrench@mail.med.upenn.edu', 'pgimotty@upenn.edu', 'wguo@mail.med.upenn.edu', 'hsu9@mail.med.upenn.edu', 'rhubb@mail.med.upenn.edu', 'whwang@mail.med.upenn.edu', 'mjoffe@mail.med.upenn.edu', 'jrlandis@mail.med.upenn.edu', 'liy3@email.chop.edu', 'mingyao@mail.med.upenn.edu', 'hongzhe@upenn.edu', 'rlocalio@upenn.edu', 'nanditam@mail.med.upenn.edu', 'knashawn@mail.med.upenn.edu', 'propert@mail.med.upenn.edu', 'mputt@mail.med.upenn.edu', 'sratclif@upenn.edu', 'michross@upenn.edu', 'jaroy@mail.med.upenn.edu', 'msammel@cceb.med.upenn.edu', 'shawp@upenn.edu', 'rshi@mail.med.upenn.edu', 'hshou@mail.med.upenn.edu', 'jshults@mail.med.upenn.edu', 'alisaste@mail.med.upenn.edu', 'atroxel@mail.med.upenn.edu', 'rxiao@mail.med.upenn.edu', 'sxie@mail.med.upenn.edu', 'dxie@upenn.edu', 'weiyang@mail.med.upenn.edu']


### Q4. Find how many different email domains there are (Ex:  mail.med.upenn.edu, upenn.edu, email.chop.edu, etc.).  Print the list of unique email domains.

In [7]:
print(faculty.email_domain.unique())

['mail.med.upenn.edu' 'upenn.edu' 'email.chop.edu' 'cceb.med.upenn.edu']


### Q5.  Write email addresses from Part I to csv file

In [8]:
faculty.email.to_csv('../emails.csv', header=False, index=False)
! head ../emails.csv

bellamys@mail.med.upenn.edu
warren@upenn.edu
bryanma@upenn.edu
jinboche@upenn.edu
sellenbe@upenn.edu
jellenbe@mail.med.upenn.edu
ruifeng@upenn.edu
bcfrench@mail.med.upenn.edu
pgimotty@upenn.edu
wguo@mail.med.upenn.edu


### Q6.  Create a dictionary in the below format:
```
faculty_dict = { 'Ellenberg': [\
              ['Ph.D.', 'Professor', 'sellenbe@upenn.edu'],\
              ['Ph.D.', 'Professor', 'jellenbe@mail.med.upenn.edu']
                            ],
              'Li': [\
              ['Ph.D.', 'Assistant Professor', 'liy3@email.chop.edu'],\
              ['Ph.D.', 'Associate Professor', 'mingyao@mail.med.upenn.edu'],\
              ['Ph.D.', 'Professor', 'hongzhe@upenn.edu']
                            ]
            }
```

In [9]:
faculty_dict = dict([(x,y.drop('last_name', axis=1).values.tolist())
                      for x,y in 
                      faculty[['last_name', 'degree', 'title', 'email']]
                      .groupby('last_name')])

print(dict([(key, faculty_dict[key]) for key in faculty_dict.keys()[:3]]))

{'Putt': [['Ph.D. Sc.D.', 'Professor', 'mputt@mail.med.upenn.edu']], 'Feng': [['Ph.D.', 'Assistant Professor', 'ruifeng@upenn.edu']], 'Bilker': [['Ph.D.', 'Professor', 'warren@upenn.edu']]}


### Q7.  The previous dictionary does not have the best design for keys.  Create a new dictionary with keys as:

```
professor_dict = {('Susan', 'Ellenberg'): ['Ph.D.', 'Professor', 'sellenbe@upenn.edu'],\
                ('Jonas', 'Ellenberg'): ['Ph.D.', 'Professor', 'jellenbe@mail.med.upenn.edu'],\
                ('Yimei', 'Li'): ['Ph.D.', 'Assistant Professor', 'liy3@email.chop.edu'],\
                ('Mingyao','Li'): ['Ph.D.', 'Associate Professor', 'mingyao@mail.med.upenn.edu'],\
                ('Hongzhe','Li'): ['Ph.D.', 'Professor', 'hongzhe@upenn.edu']
            }
```

In [10]:
professor_dict = dict([(x,y.drop(['last_name', 'first_name'], axis=1).values.tolist())
                        for x,y in 
                        faculty[['last_name', 'first_name', 'degree', 'title', 'email']]
                        .groupby(['first_name', 'last_name'])])

print(dict([(key, professor_dict[key]) for key in professor_dict.keys()[:3]]))

{('Hongzhe', 'Li'): [['Ph.D.', 'Professor', 'hongzhe@upenn.edu']], ('Knashawn', 'Morales'): [['Sc.D.', 'Associate Professor', 'knashawn@mail.med.upenn.edu']], ('Yimei', 'Li'): [['Ph.D.', 'Assistant Professor', 'liy3@email.chop.edu']]}


### Q8.  It looks like the current dictionary is printing by first name.  Sort by last name and print the first 3 key and value pairs.

In [11]:
last_name_sort = lambda x: x[1]

print(dict([(key, professor_dict[key]) 
            for key in sorted(professor_dict, key=last_name_sort)[:3]]))

{('Warren', 'Bilker'): [['Ph.D.', 'Professor', 'warren@upenn.edu']], ('Scarlett', 'Bellamy'): [['Sc.D.', 'Associate Professor', 'bellamys@mail.med.upenn.edu']], ('Matthew', 'Bryan'): [['Ph.D.', 'Assistant Professor', 'bryanma@upenn.edu']]}
