In [30]:
# add imports here


# Designing Functions: Build Hierarchically

We have 800 data files to go through in `Data/Roster`. We could put all the code into a `for` loop which can easily rip through every single file in-turn, but if we ever want to do anything else (e.g. get the height), major modifications would need to be made. What's more, we have dozens of variables flying around, so tracking program flow can be difficult (e.g. "where did that object get set", "where is it being modified"). 

Also, if our data source changes (e.g. to a CSV or database), we would need to carefully replace the code that loads the data. If it's not compartmentalized into a function, it might be easy to overlook some of the variables used or created.

Functions can help.

Let's rough out a program in pseudocode to help us find insurance liabilities.

In [14]:
# data_files = find_student_records(directory)

# for each file in data_files:
    # data = parse_student_record(file)
    
    # age = calculate_age(data)
    
    # if age < threshold:
        # remember the person

# 1. Sketching out the functions 

What are the inputs and outputs of the functions that we expect?

In [15]:
def find_student_records(directory):
    """Finds all the data filenames in a directory"""
    return paths

In [16]:
def parse_student_record(path):
    """Loads the data in the file"""
    return data

In [11]:
def calculate_age(dob):
    """Find our how old someone is in years"""
    return age

# 2. `find_student_records` - Finding the records

With separate functions, it can be easy to develop and test each one without having to process the entire data set every time. Lets start by finding all the records.

> **Note:** We're going to keep all the `import`s together at the top of the notebook. 
> This reduces the chance we'll forget one of we move code around.

### Step 1. Do a flat (no functions) prototype

We're using `glob`, so add it up at the top of the notebook.

In [18]:
glob.glob('../Data/Roster/*.txt')

['../Data/Roster/Ezekiel_Bryant_644.txt',
 '../Data/Roster/Betty_Gonzalez_787.txt',
 '../Data/Roster/Buzz_Torres_598.txt',
 '../Data/Roster/Matthias_Collins_703.txt',
 '../Data/Roster/Howard_Kelly_516.txt',
 '../Data/Roster/Eustace_Scoot_488.txt',
 '../Data/Roster/Matthias_Lopez_317.txt',
 '../Data/Roster/Agatha_Perez_387.txt',
 '../Data/Roster/Trip_Powell_217.txt',
 '../Data/Roster/Wallace_Nelson_553.txt',
 '../Data/Roster/Tabitha_williams_635.txt',
 '../Data/Roster/Betty_Barnes_178.txt',
 '../Data/Roster/Zelda_Simmons_379.txt',
 '../Data/Roster/Annabelle_Foster_621.txt',
 '../Data/Roster/Zelda_Nelson_249.txt',
 '../Data/Roster/May-Sue_Simmons_414.txt',
 '../Data/Roster/Zelda_Collins_39.txt',
 '../Data/Roster/Zelda_Kelly_220.txt',
 '../Data/Roster/Orville_Bailey_254.txt',
 '../Data/Roster/May-Sue_Johnson_462.txt',
 '../Data/Roster/Ernest_Evans_425.txt',
 '../Data/Roster/Yvonne_Brooks_244.txt',
 '../Data/Roster/Buzz_Allen_77.txt',
 '../Data/Roster/Annabelle_Bailey_28.txt',
 '../Data/Ro

### Step 2. Separate the inputs and outputs

In [19]:
directory = '../Data/Roster'

paths = # ...

paths[:5]

SyntaxError: invalid syntax (<ipython-input-19-bb7e62ec5d00>, line 3)

### Step 3. Wrap it in a function

In [None]:
def find_student_records(directory):
    """Find all the student records in the specified directory."""
    # ...
    return paths

In [31]:
find_student_records('../Data/Roster/')[:5]

TypeError: 'generator' object is not subscriptable

### Extra! Avoiding arbitrary limitations: Optional arguments
The extension of the files is arbitrary, so rather than have some arbitrary value hard-coded into the function, we can make it an optional argument by adding a `=<default_value>` after the argument or annotation

In [34]:
def find_student_records(directory, extension='txt'):
    """
    Find all the student records in the specified directory. Specifying extension
    is optional; defaults to "txt"
    
    Returns a list.
    """
    paths = glob.glob(directory + '/*.txt')
    return paths

If we call it just as before, it gives the same result:

In [35]:
find_student_records('../Data/Roster')[:5]

['../Data/Roster/Ezekiel_Bryant_644.txt',
 '../Data/Roster/Betty_Gonzalez_787.txt',
 '../Data/Roster/Buzz_Torres_598.txt',
 '../Data/Roster/Matthias_Collins_703.txt',
 '../Data/Roster/Howard_Kelly_516.txt']

But now, we have the option to explicitly specify the extension. Lets look somewhere else for some other type of files (`txt`'s are the only thing in the `Roster` directory):

In [36]:
find_student_records('../Data', extension='csv')

['../Data/Shakespeare.txt']

# 3. `parse_student_record` - Reading an individual record

Now that we have a way to find all the records: the `find_student_records` function, the next piece of the puzzle we laid out in our pseudocode is a function to read an individual record. Recall that the records looked like the following:

    #This is a file that holds important personal information that should not be shared. 
    #You are being watched.



    Name:	Buzz M. Baker
    Date of Birth:	4/20/87
    Email Address:	buzz.baker@northwestern.edu
    Department:	Engineering
    Height:	5ft,3in
    Weight:	194lbs
    Favorite Color:	Pink
    Favorite Animal:	Snake
    Zodiac Sign:	April


### Processing the data file

**Question**: What is the best data type to represent the student's data?

1. string
2. list
3. dictionary
4. set
5. tuple

### Roughing it out

In [21]:
path = '../Data/Roster/Agatha_Bailey_798.txt'

# create something to hold the data

# open the file
    # for each line in the file
        # ignore comment lines (those that start with "#")

        # Exercise parts
        # --------------
        # split the line
        # make sure the line has the correct number of parts
        # clean up the parts (strip whitespace)
        # store data in the 'data holder'

### Turn it into a function

In [37]:
def parse_student_record(path: 'data file location') -> 'dict':
    """Load a data file"""
    # create something to hold the data
    data = {}

    with open(path) as file:
        for line in file:
            # ignore comment lines (those that start with "#")
            if line.startswith('#'):
                continue

            # split the line
            parts = line.split(':')

            # make sure the line has the correct number of parts
            if len(parts) != 2:
                continue

            # clean up the parts (strip whitespace)
            key, value = parts
            key = key.strip()
            value = value.strip()

            # store data in the 'data holder'
            data[key] = value
    return data

# validators
assert parse_student_record('../Data/Roster/Agatha_Lee_11.txt')['Favorite Animal'] == 'Dog'
assert parse_student_record('../Data/Roster/Buzz_Baker_618.txt')['Department'] == 'Engineering'

In [38]:
data_files = find_student_records('../Data/Roster/')
some_file = data_files[0]
some_file

'../Data/Roster/Ezekiel_Bryant_644.txt'

In [39]:
parse_student_record(some_file)

{'Date of Birth': '6/11/83',
 'Department': 'Engineering',
 'Email Address': 'ezekiel.bryant@northwestern.edu',
 'Favorite Animal': 'Snake',
 'Favorite Color': 'Lime',
 'Height': '6ft,0in',
 'Name': 'Ezekiel Z. Bryant',
 'Weight': '205lbs',
 'Zodiac Sign': 'June'}

# 4. Discovering there's more to do: Data Cleaning

**Question**: How many fields are storing numeric data as strings and could be improved by using a better data type?

1. 1
2. 2
3. 3
4. 4
5. 5


## Exercise: Clean up the Date of Birth field.

First get some code working...

In [None]:
dob = '7/12/68'

# ... code ...

assert type(dob) == datetime.datetime
assert dob.year == 1968
assert dob.month == 7

In [40]:
def clean_dob(dob: "string of form M/D/YY") -> "datetime object":
    # your code from above
    
    return dob

assert type(clean_dob('1/1/11')) == datetime.datetime
assert clean_dob('1/1/03').year == 2003
assert clean_dob('1/1/83').year == 1983
assert clean_dob('2/4/00').month == 2
assert clean_dob('3/5/84').day == 5

### Where to call it?

We could either call the `clean_dob` function on the value returned from the `parse_student_record` function or do it inside. Why make the caller clean up later?

In [None]:
# adding clean functions
def parse_student_record(path: "pathlib object") -> "dict":
    # COPY CODE FROM ABOVE
    
    data['Date of Birth'] = clean_dob(data['Date of Birth'])
    
    return data

assert parse_student_record('../Data/Roster/Buzz_Baker_618.txt')['Date of Birth'].month == 4

# Sidebar: Partial Assembly

We have a couple functions squared away, lets start assembling them and see if they work well together. Fixing little errors or incompatibilities early is easier as we remember how we made everything. Let's try to grab everyone who was born in 1975.

In [None]:
data_dir = '../Data/Roster'

# data_files = find_student_records(data_dir)

# for each file in data_files:
    # data = parse_student_record(file)
    
    # age = calculate_age(date of birth)
    
    # if born in 1975:
        # remember the person
        

**Question**: How many people were born in March?
1. 24
2. 49
3. 53
4. 69
5. 75

In [None]:
# ...code...

len(born_in_march)

# 5. `calculate_age` - Turning the age algorithm developed earlier into a function

Your function may have looked something like this:

    currentDay = 3
    currentMonth = 3
    currentYear = 2014

    bornDay = 3
    bornMonth = 3
    bornYear = 1984

    correction = 0
    if currentMonth < bornMonth:
        correction = 1
    elif currentMonth == bornMonth and currentDay < bornDay:
        correction = 1
    
    age = currentYear - bornYear - correction

Lets adapt that into a function.

In [24]:
def calculate_age(dob: "datetime object", today: "datetime object"=None):
    """Calculate the age of someone born on 'dob' on date 'today' (today if not specified)"""
    if today is None:
        today = datetime.datetime.today()
    
    ## your algorithm, adapted for datetime objects
    
    return age

calculate_age(datetime.datetime(1985, 4, 4))

NameError: name 'datetime' is not defined

In [41]:
# test suite
assert calculate_age(datetime.datetime(2000, 1, 1), datetime.datetime(2001, 1, 1)) == 1
assert calculate_age(datetime.datetime(1000, 1, 1), datetime.datetime(2000, 1, 1)) == 1000
assert calculate_age(datetime.datetime(2000, 1, 1), datetime.datetime(2010, 1, 1)) == 10
assert calculate_age(datetime.datetime(2000, 1, 31), datetime.datetime(2011, 1, 1)) == 10
assert calculate_age(datetime.datetime(2000, 6, 1), datetime.datetime(2011, 1, 1)) == 10

# 6. Completing the Project

Now that we have a function to calculate age, we can implement the final part of our pseudocode and check the age

In [26]:
datafiles = find_student_records('../Data/Roster/')

youths = []
for file in datafiles:
    data = parse_student_record(file)
    
    age = calculate_age(data['Date of Birth'])
    if age < 25:
        youths.append(data)
        
print(len(youths))
youths[:2]

NameError: name 'paths' is not defined

# 7. More cleaning, more analysis

## Who's the tallest?

In [27]:
def clean_height(height):
    """Convert a foot/inches string (e.g. Xft,Yin) into a number"""
    
    return height_in_inches

# 8. Back of the Book

Not just the odd problems.

    import glob
    import datetime
    import re

    def find_student_records(directory, extension="txt"):
        """
        Find all the student records in the specified directory. Specifying extension
        is optional; defaults to "txt"

        Returns an iterator.
        """
        paths = glob.iglob(directory + '/*.' + extension)
        return paths

    def calculate_age(dob: "datetime object", today: "datetime object"=None) -> int:
        """Calculate the age of someone born on 'dob' on date 'today' (today if not specified)"""
        if today is None:
            today = datetime.datetime.today()

        correction = 0
        if today.month < dob.month:
            correction = 1
        elif today.month == dob.month and today.day < dob.day:
            correction = 1

        age = today.year - dob.year - correction

        return age

    def clean_dob(dob: "string of form M/D/YY") -> datetime.datetime:
        month, day, year = dob.split('/')

        month = int(month)
        day = int(day)
        year = int(year)

        year += 1900
        if year < 1920:
            year += 100

        dob = datetime.datetime(year=year, month=month, day=day)

        return dob

    def clean_height(height: "string with format Xft,Y.ZZin") -> float:
        feet, inches = (float(x) for x in re.findall('[0-9.]+', height))
        return 12 * feet + inches

    def clean_weight(weight: "string with format 123lbs") -> float:
        return float(re.findall('[0-9.]+', weight)[0])

    def parse_student_record(path: "pathlib object") -> dict:
        """Load a data file"""
        data = {}

        with open(path) as file:
            for line in file:
                # ignore comment lines (those that start with "#")
                if line.startswith('#'):
                    continue

                # split the line
                parts = line.split(':')

                # make sure the line has the correct number of parts
                if len(parts) != 2:
                    continue

                # clean up the parts (strip whitespace) and store them
                key, value = parts
                key = key.strip()
                value = value.strip()

                data[key] = value

        data['Date of Birth'] = clean_dob(data['Date of Birth'])
        data['Weight'] = clean_weight(data['Weight'])
        data['Height'] = clean_height(data['Height'])

        return data

    THRESHOLD = 25

    data_dir = "../Data/Roster"

    # list comprehensions and generators can be used for some efficiency and brevity.
    records = [parse_student_record(f) for f in find_student_records(data_dir)]
    n_march = sum(1 for r in records if r['Date of Birth'].month == 3)
    n_youths = sum(1 for r in records if calculate_age(r['Date of Birth']) < THRESHOLD)
    tallest = max(records, key=lambda r: r['Height'])

    # longer method
    n_march = 0
    n_youths = 0
    tallest = {'Height': 0}
    for f in find_student_records(data_dir):
        data = parse_student_record(f)
        if data['Date of Birth'].month == 3:
            n_march += 1
        if calculate_age(data['Date of Birth']) < THRESHOLD:
            n_youths += 1
        if data['Height'] > tallest['Height']:
            tallest = data

    print('{} people born in March.'.format(n_march))
    print('{} people younger than {}.'.format(n_youths, THRESHOLD))
    print('{} is the tallest ({:.1f} inches or {:.0f} cm).'.format(
        tallest['Name'], tallest['Height'], tallest['Height'] * 2.54))

In [32]:
import glob
import datetime
import re

def find_student_records(directory, extension="txt"):
    """
    Find all the student records in the specified directory. Specifying extension
    is optional; defaults to "txt"

    Returns an iterator.
    """
    paths = glob.iglob(directory + '/*.' + extension)
    return paths

def calculate_age(dob: "datetime object", today: "datetime object"=None) -> int:
    """Calculate the age of someone born on 'dob' on date 'today' (today if not specified)"""
    if today is None:
        today = datetime.datetime.today()

    correction = 0
    if today.month < dob.month:
        correction = 1
    elif today.month == dob.month and today.day < dob.day:
        correction = 1

    age = today.year - dob.year - correction

    return age

def clean_dob(dob: "string of form M/D/YY") -> datetime.datetime:
    month, day, year = dob.split('/')

    month = int(month)
    day = int(day)
    year = int(year)

    year += 1900
    if year < 1920:
        year += 100

    dob = datetime.datetime(year=year, month=month, day=day)

    return dob

def clean_height(height: "string with format Xft,Y.ZZin") -> float:
    feet, inches = (float(x) for x in re.findall('[0-9.]+', height))
    return 12 * feet + inches

def clean_weight(weight: "string with format 123lbs") -> float:
    return float(re.findall('[0-9.]+', weight)[0])

def parse_student_record(path: "pathlib object") -> dict:
    """Load a data file"""
    data = {}

    with open(path) as file:
        for line in file:
            # ignore comment lines (those that start with "#")
            if line.startswith('#'):
                continue

            # split the line
            parts = line.split(':')

            # make sure the line has the correct number of parts
            if len(parts) != 2:
                continue

            # clean up the parts (strip whitespace) and store them
            key, value = parts
            key = key.strip()
            value = value.strip()

            data[key] = value

    data['Date of Birth'] = clean_dob(data['Date of Birth'])
    data['Weight'] = clean_weight(data['Weight'])
    data['Height'] = clean_height(data['Height'])

    return data

THRESHOLD = 25

data_dir = "../Data/Roster"

# list comprehensions and generators can be used for some efficiency and brevity.
records = [parse_student_record(f) for f in find_student_records(data_dir)]
n_march = sum(1 for r in records if r['Date of Birth'].month == 3)
n_youths = sum(1 for r in records if calculate_age(r['Date of Birth']) < THRESHOLD)
tallest = max(records, key=lambda r: r['Height'])

# longer method
n_march = 0
n_youths = 0
tallest = {'Height': 0}
for f in find_student_records(data_dir):
    data = parse_student_record(f)
    if data['Date of Birth'].month == 3:
        n_march += 1
    if calculate_age(data['Date of Birth']) < THRESHOLD:
        n_youths += 1
    if data['Height'] > tallest['Height']:
        tallest = data

print('{} people born in March.'.format(n_march))
print('{} people younger than {}.'.format(n_youths, THRESHOLD))
print('{} is the tallest ({:.1f} inches or {:.0f} cm).'.format(
    tallest['Name'], tallest['Height'], tallest['Height'] * 2.54))

75 people born in March.
117 people younger than 25.
Betty T. Gonzalez is the tallest (73.0 inches or 185 cm).
