In [None]:
# add imports here


# Designing your code around functions

You have 800 data files to go through in `Data/Roster`. You could put all the code we wrote in the previous lesson into a `for` loop and "rip" through every single file in-turn.

**But why shouldn't you?**

Because of **code readibility**!

If we ever want to do expand the capacities of your code, you would need to dramatically cahnge the code. Unfortunately, with all the variables floating around it gets really hard to keep track of everything when you are writing the code, and really hard to understand what you are doing when you read your code later.

This is where functions come in.  They can make code modular and easy to read. They also avoid statement duplication.


In [None]:
# Let's rough out a program in pseudocode to help us find insurance liabilities.

# data_files = find_student_records(directory)

# for each file in data_files:
    # data = parse_student_record(file)
    
    # age = calculate_age(data)
    
    # if age < threshold:
        # remember the person

## 1. Sketch out your functions 

What are the inputs and outputs of the functions that you plan to write?

In [None]:
def find_student_records(directory):
    """Finds all the names of data files in a given directory"""
    return paths

In [None]:
def parse_student_record(path):
    """Loads the data in the file located at path"""
    return data

In [None]:
def calculate_age(dob):
    """Finds the current age of an individual in years, given a date of birth"""
    return age

With separate functions, it is easier to **develop** and **test** each one without having to process the entire data set every time. 

>**Note:** We're going to keep all the `import`s together at the top of the notebook. 
>This reduces the chance we'll forget one of we move code around.


## 2. The function `find_student_records` 


### Step 1. Do a flat (no functions) prototype

We're using `glob`, so add it up at the top of the notebook.

In [None]:
glob.glob('../Data/Roster/*.txt')

### Step 2. Separate the inputs and outputs

In [None]:
directory = '../Data/Roster'

paths = # ...

paths[:5]

### Step 3. Wrap it in a function

In [None]:
def find_student_records(directory):
    """Finds all the student records in the specified directory."""
    # ...
    return paths

In [None]:
find_student_records('../Data/Roster/')[:5]

### Extra! Avoiding arbitrary limitations: Optional arguments

The extension of the files is arbitrary, so rather than have some arbitrary value hard-coded into the function, we can make it an optional argument by adding a `=<default_value>` after the argument or annotation

In [None]:
def find_student_records(directory, extension='txt') -> "list":
    """
    Finds all the student records in the specified directory. Specifying extension
    is optional; defaults to "txt"
    """
    paths = glob.glob(directory + '/*.txt')
    return paths

If we call it just as before, it gives the same result:

In [None]:
find_student_records('../Data/Roster')[:5]

But now, we have the option to explicitly specify the extension. Lets look somewhere else for some other type of files (`txt`'s are the only thing in the `Roster` directory):

In [None]:
find_student_records('../Data', extension='csv')

## 3.  The function `parse_student_record` 

The purpose of this functions is to  read an individual record. Recall that the are structured as follows:

    #This is a file that holds important personal information that should not be shared. 
    #You are being watched.



    Name:	Buzz M. Baker
    Date of Birth:	4/20/87
    Email Address:	buzz.baker@northwestern.edu
    Department:	Engineering
    Height:	5ft,3in
    Weight:	194lbs
    Favorite Color:	Pink
    Favorite Animal:	Snake
    Zodiac Sign:	April


### Processing the data file

**Question**: What is the best data type to represent the student's data?

1. string
2. list
3. dictionary
4. set
5. tuple

### Sketch your code

In [None]:
path = '../Data/Roster/Agatha_Bailey_798.txt'

# create something to hold the data

# open the file
    # for each line in the file
        # ignore comment lines (those that start with "#")

        # Exercise parts
        # --------------
        # split the line
        # make sure the line has the correct number of parts
        # clean up the parts (strip whitespace)
        # store data in the 'data holder'

### Turn it into a function

In [None]:
def parse_student_record(path: 'data file location') -> 'dict':
    """Load a data file"""
    # create something to hold the data
    data = {}

    with open(path) as file:
        for line in file:
            # ignore comment lines (those that start with "#")
            if line.startswith('#'):
                continue

            # split the line
            parts = line.split(':')

            # make sure the line has the correct number of parts
            if len(parts) != 2:
                continue

            # clean up the parts (strip whitespace)
            key, value = parts
            key = key.strip()
            value = value.strip()

            # store data in the 'data holder'
            data[key] = value
    return data

# validators
assert parse_student_record('../Data/Roster/Agatha_Lee_11.txt')['Favorite Animal'] == 'Dog'
assert parse_student_record('../Data/Roster/Buzz_Baker_618.txt')['Department'] == 'Engineering'

In [None]:
data_files = find_student_records('../Data/Roster/')
some_file = data_files[0]
some_file

In [None]:
parse_student_record(some_file)

## 4. Data Cleaning

**Question**: You are storing all **values** in the record as strings. But some of them clearly should not be strings and being strings makes it impossible to operate on them.

How many fields are storing numeric data as strings?

1. 1
2. 2
3. 3
4. 4
5. 5


### Exercise: Clean up the Date of Birth field.

First get some code working...

In [None]:
dob = '7/12/68'

# ... code ...

assert type(dob) == datetime.datetime
assert dob.year == 1968
assert dob.month == 7

In [None]:
def clean_dob(dob: "string of form M/D/YY") -> "datetime object":
    # your code from above
    
    return dob

assert type(clean_dob('1/1/11')) == datetime.datetime
assert clean_dob('1/1/03').year == 2003
assert clean_dob('1/1/83').year == 1983
assert clean_dob('2/4/00').month == 2
assert clean_dob('3/5/84').day == 5

### Where should the cleaning occur?

We could either call the `clean_dob` function on the value returned from the `parse_student_record` function or do it inside. 

Does it make sense to clean up later?

In [None]:
# adding data cleaning functions
#
def parse_student_record(path: "pathlib object") -> "dict":
    # COPY CODE FROM ABOVE
    
    data['Date of Birth'] = clean_dob(data['Date of Birth'])
    
    return data

assert parse_student_record('../Data/Roster/Buzz_Baker_618.txt')['Date of Birth'].month == 4

## Testing inter-operability of the functions

It is very important to test your code as you write it.

We have tested our functions using the built-in `assert` functions.  However, we have not yet tested whether the functions we wrote work well together. 

Fixing incompatibilities early on is easier because we still we remember how we made everything. 


In [None]:
# As a test, grab everyone who was born in 1975.

data_dir = '../Data/Roster'

# data_files = find_student_records(data_dir)

# for each file in data_files:
    # data = parse_student_record(file)
    
    # age = calculate_age(date of birth)
    
    # if born in 1975:
        # remember the person
        

**Question**: How many people were born in March?
1. 24
2. 49
3. 53
4. 69
5. 75

In [None]:
# ...code...

len(born_in_march)

## 5. The function `calculate_age` 

Your function may have looked something like this:

    currentDay = 3
    currentMonth = 3
    currentYear = 2014

    bornDay = 3
    bornMonth = 3
    bornYear = 1984

    correction = 0
    if currentMonth < bornMonth:
        correction = 1
    elif currentMonth == bornMonth and currentDay < bornDay:
        correction = 1
    
    age = currentYear - bornYear - correction

Now, convert it into a function.

In [None]:
def calculate_age(dob: "datetime object", today: "datetime object"=None) -> "int":
    """Calculate the age of someone born on 'dob' on date 'today' (today if not specified)"""
    if today is None:
        today = datetime.datetime.today()
    
    ## your algorithm, adapted for datetime objects
    
    return age

calculate_age(datetime.datetime(1985, 4, 4))

In [None]:
# test suite
assert calculate_age(datetime.datetime(2000, 1, 1), datetime.datetime(2001, 1, 1)) == 1
assert calculate_age(datetime.datetime(1000, 1, 1), datetime.datetime(2000, 1, 1)) == 1000
assert calculate_age(datetime.datetime(2000, 1, 1), datetime.datetime(2010, 1, 1)) == 10
assert calculate_age(datetime.datetime(2000, 1, 31), datetime.datetime(2011, 1, 1)) == 10
assert calculate_age(datetime.datetime(2000, 6, 1), datetime.datetime(2011, 1, 1)) == 10

## 6. Putting it all together

Now that we have a function to calculate age, we can implement the final part of our pseudocode and check the age

In [None]:
datafiles = find_student_records('../Data/Roster/')

youths = []
for file in datafiles:
    data = parse_student_record(file)
    
    age = calculate_age(data['Date of Birth'])
    if age < 25:
        youths.append(data)
        
print(len(youths))
youths[:2]

## 7. If we do more data cleaning, we can perform additional analyses!

**Who's the tallest?**

In [None]:
def clean_height(height) -> "int":
    """Convert a foot/inches string (e.g. Xft,Yin) into height in inches"""
    
    return height_in_inches

# Back of the Book

Not just the odd problems.

    import glob
    import datetime
    import re

    def find_student_records(directory, extension="txt"):
        """
        Find all the student records in the specified directory. Specifying extension
        is optional; defaults to "txt"

        Returns an iterator.
        """
        paths = glob.iglob(directory + '/*.' + extension)
        return paths

    def calculate_age(dob: "datetime object", today: "datetime object"=None) -> int:
        """Calculate the age of someone born on 'dob' on date 'today' (today if not specified)"""
        if today is None:
            today = datetime.datetime.today()

        correction = 0
        if today.month < dob.month:
            correction = 1
        elif today.month == dob.month and today.day < dob.day:
            correction = 1

        age = today.year - dob.year - correction

        return age

    def clean_dob(dob: "string of form M/D/YY") -> datetime.datetime:
        month, day, year = dob.split('/')

        month = int(month)
        day = int(day)
        year = int(year)

        year += 1900
        if year < 1920:
            year += 100

        dob = datetime.datetime(year=year, month=month, day=day)

        return dob

    def clean_height(height: "string with format Xft,Y.ZZin") -> float:
        feet, inches = (float(x) for x in re.findall('[0-9.]+', height))
        return 12 * feet + inches

    def clean_weight(weight: "string with format 123lbs") -> float:
        return float(re.findall('[0-9.]+', weight)[0])

    def parse_student_record(path: "pathlib object") -> dict:
        """Load a data file"""
        data = {}

        with open(path) as file:
            for line in file:
                # ignore comment lines (those that start with "#")
                if line.startswith('#'):
                    continue

                # split the line
                parts = line.split(':')

                # make sure the line has the correct number of parts
                if len(parts) != 2:
                    continue

                # clean up the parts (strip whitespace) and store them
                key, value = parts
                key = key.strip()
                value = value.strip()

                data[key] = value

        data['Date of Birth'] = clean_dob(data['Date of Birth'])
        data['Weight'] = clean_weight(data['Weight'])
        data['Height'] = clean_height(data['Height'])

        return data

    THRESHOLD = 25

    data_dir = "../Data/Roster"

    # list comprehensions and generators can be used for some efficiency and brevity.
    records = [parse_student_record(f) for f in find_student_records(data_dir)]
    n_march = sum(1 for r in records if r['Date of Birth'].month == 3)
    n_youths = sum(1 for r in records if calculate_age(r['Date of Birth']) < THRESHOLD)
    tallest = max(records, key=lambda r: r['Height'])

    # longer method
    n_march = 0
    n_youths = 0
    tallest = {'Height': 0}
    for f in find_student_records(data_dir):
        data = parse_student_record(f)
        if data['Date of Birth'].month == 3:
            n_march += 1
        if calculate_age(data['Date of Birth']) < THRESHOLD:
            n_youths += 1
        if data['Height'] > tallest['Height']:
            tallest = data

    print('{} people born in March.'.format(n_march))
    print('{} people younger than {}.'.format(n_youths, THRESHOLD))
    print('{} is the tallest ({:.1f} inches or {:.0f} cm).'.format(
        tallest['Name'], tallest['Height'], tallest['Height'] * 2.54))