## Data Workflows and Automation

So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.

### For loops

Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.

Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:

In [20]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)

['lion', 'tiger', 'crocodile', 'vulture', 'hippo']


In [21]:
for creature in animals:
    print(creature)

lion
tiger
crocodile
vulture
hippo


The line defining the loop must start with for and end with a colon, and the body of the loop must be indented.

In this example, creature is the loop variable that takes the value of the next entry in animals every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

In [22]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
for creature in animals:
    pass

In [23]:
print('The loop variable is now: ' + creature)

The loop variable is now: hippo


### Automating data processing using For Loops

The file we’ve been using so far, surveys.csv, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.

Let’s start by making a new directory inside the folder data to store all of these files using the module os:

The file we’ve been using so far, surveys.csv, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.

Let’s start by making a new directory inside the folder data to store all of these files using the module os:

In [24]:
import os

os.mkdir('../Files/yearly_files')

In [25]:
os.listdir('../Files')

['humchrx.txt',
 'test.txt',
 'surveys.csv',
 'command_out.txt',
 'surveys_complete.csv',
 'yearly_files']

The command os.listdir is equivalent to ls in the shell.

In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a CSV file. Let’s write a script that performs those three steps in sequence for the year 2002:

In [26]:
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('../Files/surveys.csv')

# Select only data for the year 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('../Files/yearly_files/surveys2002.csv')

To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.

Let’s start by writing a loop that prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:

In [27]:
surveys_df['year']

0        1977
1        1977
2        1977
3        1977
4        1977
         ... 
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

but we want only unique years, which we can get using the unique method which we have already seen.

In [28]:
surveys_df['year'].unique()

array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])

Putting this into our for loop we get

In [29]:
for year in surveys_df['year'].unique():
   filename='../Files/yearly_files/surveys' + str(year) + '.csv'
   print(filename)

../Files/yearly_files/surveys1977.csv
../Files/yearly_files/surveys1978.csv
../Files/yearly_files/surveys1979.csv
../Files/yearly_files/surveys1980.csv
../Files/yearly_files/surveys1981.csv
../Files/yearly_files/surveys1982.csv
../Files/yearly_files/surveys1983.csv
../Files/yearly_files/surveys1984.csv
../Files/yearly_files/surveys1985.csv
../Files/yearly_files/surveys1986.csv
../Files/yearly_files/surveys1987.csv
../Files/yearly_files/surveys1988.csv
../Files/yearly_files/surveys1989.csv
../Files/yearly_files/surveys1990.csv
../Files/yearly_files/surveys1991.csv
../Files/yearly_files/surveys1992.csv
../Files/yearly_files/surveys1993.csv
../Files/yearly_files/surveys1994.csv
../Files/yearly_files/surveys1995.csv
../Files/yearly_files/surveys1996.csv
../Files/yearly_files/surveys1997.csv
../Files/yearly_files/surveys1998.csv
../Files/yearly_files/surveys1999.csv
../Files/yearly_files/surveys2000.csv
../Files/yearly_files/surveys2001.csv
../Files/yearly_files/surveys2002.csv


We can now add the rest of the steps we need to create separate text files:

In [30]:
# Load the data into a DataFrame
surveys_df = pd.read_csv('../Files/surveys.csv')

for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    filename = '../Files/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

Look inside the yearly_files directory and check a couple of the files you just created to confirm that everything worked as expected.

## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time consuming and error prone. A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.

Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.

Every method used in Python (for example, print) is a function, and the libraries we import (say, pandas) are a collection of functions. We will only use functions that are housed within the same code that uses them, but we can also write functions that can be used by different programs.

Functions are declared following this general structure:

In [31]:
def this_is_the_function_name(input_argument1, input_argument2):
    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')
    
    # And returns their product
    return input_argument1 * input_argument2

The function declaration starts with the word def, followed by the function name and any arguments in parenthesis, and ends in a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a return statement at the end.

This is how we call the function:

In [32]:
product_of_inputs = this_is_the_function_name(2, 5)

The function arguments are: 2 5 (this is done inside the function!)


We can now turn our code for saving yearly data files into a function. There are many different “chunks” of this code that we can turn into functions, and we can even create functions that call other functions inside them. Let’s first write a function that separates data for just one year and saves that data to a file:

In [33]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = '../Files/yearly_files/function_surveys' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)

The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation, and we can see them by typing help(function_name):

In [34]:
help(one_year_csv_writer)

Help on function one_year_csv_writer in module __main__:

one_year_csv_writer(this_year, all_data)
    Writes a csv file for data from a given year.
    
    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data



## If Statements

The body of the test function now has two conditionals (if statements) that check the values of start_year and end_year. If statements execute a segment of code when some condition is met. They commonly look something like this:

In [35]:
a = 5

if a<0:  # Meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>0:  # Did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else:  # Met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')

a is a positive number
