# Data Workflows and Automation
**--Break--**

**60 min**
14:45 – 15:45

So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.

## For loops

We'll start with loops. Loops allow us to repeat a workflow (or series of actions) 

- a given number of times or 
- while some condition is true. 

We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.

We'll write a `for` loop that simulates what a someone might see during a visit to the zoo. Let's start by making a list of animals:

In [None]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)

The `print` function lets us see the contents of `animals`, but what if we want to go through the list of animals one at a time and print each one on it's own line? This is where we can use a `for` loop to iterate through the list:

In [None]:
for creature in animals:
    print(creature)

We can do more than print each item in the list:

In [None]:
for creature in animals:
    print("a " + creature + ", how scary!")

The line defining the loop must start with the special term `for` and end with a colon. The body of the loop must be indented.

In this example, `creature` is the loop variable that takes the value of the next entry in `animals` every time the loop goes around. We can call the loop variable anything we like. *(Show)*

After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

In [None]:
for creature in animals:
    pass

In [None]:
print('The loop variable is now: ' + creature)

We are not asking Python to print the value of the loop variable anymore, but the `for` loop still runs and the value of creature changes on each pass through the loop. 

The statement `pass` in the body of the loop just means “do nothing”.

### Automating data processing using For Loops

Let's first create a new directory for our output, try:

In [None]:
!ls

In [None]:
!mkdir 'surveys_by_year'

If that doesn't work, you can navigate to the directory with your data and create a subfolder manually. Or try:

In [None]:
import os

os.mkdir('surveys_by_year')

In previous lessons, we saw 

1. how to use the library pandas to load the species data into memory as a DataFrame, 
2. how to select a subset of the data using some criteria, and 
3. how to write the DataFrame into a CSV file. 

Let’s write a script that performs those three steps in sequence for the year 2002:

In [None]:
import pandas as pd

surveys_df = pd.read_csv('surveys.csv')

Select data from `year` `2002`

In [None]:
surveys_2002 = surveys_df[surveys_df.year == 2002]
surveys_2002

Write `surveys_2002` to a CSV

In [None]:
surveys_2002.to_csv('surveys_by_year/surveys2002.csv')

To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.

Let’s start by writing a loop that simply prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:

In [None]:
surveys_df['year']

But we want only unique years, which we can get using the unique method which we have already seen.

In [None]:
surveys_df['year'].unique()

# or
# surveys_df.year.unique()

Putting this into a `for` loop we get

In [None]:
for year in surveys_df['year'].unique():
    file_path = 'surveys_by_year/surveys' + str(year) + '.csv'
    print(file_path)

So we see that we're able to create new file names for each year, now we want to add the steps to pull out the data for each year, and save it to a file with that name:

In [None]:
for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    file_path = 'surveys_by_year/surveys' + str(year) + '.csv'
    surveys_year.to_csv(file_path)

Look inside your directory and check a couple of the files you just created to confirm that everything worked as expected.

### Unique file names

Our code writes unique file names for each iteration of the loop, let’s break down the parts of the `file_path` variable:

- The first part specifies the directory to store our data file in (surveys_by_year) followed by a forward slash and and the first part of the file name (surveys): 'data/yearly_files/surveys'
- This is concatenated with the value of the `year` variable, by using the plus sign. We also transform `year` from a number to a string with the `str` function.

In [None]:
type(year)

In [None]:
type(str(year))

- Then we add the file extension as another text string: + '.csv'

Notice that the quotes to add text strings. The variable is not surrounded by quotes.

This code produces a string that contains the path to the new file and the file name.

**Challenge**: Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and are empty in the text files). 

Modify the for loop so that the entries with null values are not included in the yearly files. 

**Hint**: Try using the `.dropna()` method for a dataframe, for help use: `surveys_2002.dropna?`

In [None]:
# Challenge solution

!mkdir 'surveys_no_nan'

for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]
    surveys_year = surveys_year.dropna(how='any',axis=0) #new line, axis=0 acts on rows
    
    # Write the new DataFrame to a CSV file
    file_path = 'surveys_no_nan/surveys' + str(year) + '.csv' # new directory here
    surveys_year.to_csv(file_path)

**Challenge**: Instead of splitting out the data by years, a colleague wants to analyse each species separately. How would you write a unique CSV file for each species?

**Note**: Make a new directory for your data first!

In [None]:
# Challenge solution
surveys_df.columns

In [None]:
# Challenge solution

!mkdir 'survey_by_species'

unique_species = surveys_df['species_id'].unique() # new line

for species in unique_species: # changed line
    # Select data for the species
    surveys_species = surveys_df[surveys_df.species_id == species] # changed line
    
    # Write the new DataFrame to a CSV file
    file_path = 'survey_by_species/species-' + str(species) + '.csv'# changed line
    surveys_species.to_csv(file_path) #changed line

## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time consuming and error prone. 

A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.

Let's start with an explanatory function to go over the structure:

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

We've now written our first function. Unlike the `for` loop, writing the function is seperate from using the function. When the function is written, it is "defined", but doesn'tproduce an output.

In order to use the function, we will need to explictly "call" it and feed it some input arguments.

In [None]:
product_of_inputs = this_is_the_function_name(2,5)

In [None]:
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

**Challenges**:
1. Change the values of the input arguments in the function to different numbers and check its output
2. Try calling the function by giving it the wrong number of arguments (not 2)
3. Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?)
4. Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?

In [None]:
# Challenge answer 3 and 4

#my_variable = "I am outside the function"

def this_is_the_function_name(input_argument1, input_argument2):
    
    my_variable = "I am inside the function"

    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

     return input_argument1 * input_argument2
#    return my_variable

In [None]:
output = this_is_the_function_name(2,3)

my_variable

In [None]:
# change to `return my_variable`
output

We can now turn our code for saving yearly data files into a function. 

There are many different “chunks” of this code that we can turn into functions, and we can even create functions that call other functions inside them. 

Let’s first write a function that separates data for just one year and saves that data to a file:

In [None]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    file_path = 'surveys_by_year/function_surveys' + str(this_year) + '.csv'
    surveys_year.to_csv(file_path)
    
    print("Now writing data for the year: {}".format(year))

The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function. 

It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. 

Docstrings in functions also become part of their ‘official’ documentation:

In [None]:
one_year_csv_writer?

Now we can call our function with some arguments:

In [None]:
one_year_csv_writer(2002, surveys_df)

What we really want to do, is create files for multiple years without having to request them one by one. 

Let’s write another function that replaces the `For` loop by simply looping through a sequence of years and repeatedly calling the function we just wrote, one_year_csv_writer:

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate CSV files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)

Let's try to call our new function by giving it a start and end year:

In [None]:
yearly_data_csv_writer(1978, 1983, surveys_df)

**Challenge**: How could you use the function yearly_data_csv_writer to create a CSV file for only one year? (Hint: think about the syntax for range)

In [None]:
# Challenge solution
range?

The functions we wrote require a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. 

Let’s look at how we could modify the function `yearly_data_csv_writer` so that the `start_year` and `end_year` default to the full range of the data if they are not supplied by the user. 

Arguments can be given default values with an equal sign in the function declaration. Any arguments in the function without default values (here, `all_data`) is a required argument and **must** come before the argument with default values (which are optional in the function call).

In [None]:
def yearly_data_arg_test(all_data, start_year=None, end_year=None):

    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)

    return start_year, end_year

In [None]:
start,end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('Both optional arguments: ', start, end)

start,end = yearly_data_arg_test (surveys_df)
print('Default values: ', start, end)

The body of the test function now has two conditionals (`if` statements) that check the values of `start_year` and `end_year`. 

`If` statements execute a segment of code when some condition is met. They commonly look something like this:

In [None]:
def check_if_zero(number):
    if a<0:  # Meets first condition?

    # if a IS less than zero
        print('This is a negative number.')

    elif a>0:  # Did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
        print('This is a positive number.')

    else:  # Met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
        print('This must be zero!')

In [None]:
a = 5
check_if_zero(a)

We can change the value of `a` to see how this function works. The statement elif means “else if”, and all of the conditional statements must end in a colon.

**(Optional) Challenges**:
1. Rewrite the `yearly_data_csv_writer` function to have keyword arguments with default values.
2. Modify the `one_year_csv_writer` function so that it doesn’t create yearly files if there is no data for a given year and display an alert to the user (**Hint**: use conditional statements to do this.

In [None]:
# Challenge answer 1

def yearly_data_csv_writer(all_data, start_year=None, end_year=None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    all_data --- DataFrame with multi-year data
    """
    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)
    
    print("Start year: {} \nEnd year: {}".format(start_year,end_year))

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)

In [None]:
# Challenge answer 2

def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    if this_year in all_data['year'].unique():
        surveys_year = all_data[all_data.year == this_year]
    
        # Write the new DataFrame to a csv file
        file_path = 'surveys_by_year/function_surveys' + str(this_year) + '.csv'
        surveys_year.to_csv(file_path)
    
        print("Now writing data for the year: {}".format(this_year))
    
    else:
        print("Alert: The year {} is missing from this dataframe".format(this_year))

In [None]:
one_year_csv_writer(2018, surveys_df)

In [None]:
yearly_data_csv_writer(surveys_df)

In [None]:
yearly_data_csv_writer(surveys_df, start_year=1990, end_year=2008)

## Key Points
- Loops help automate repetitive tasks over sets of items.
- Loops combined with functions provide a way to process data more efficiently than we could by hand.
- Conditional statements enable execution of different operations on different data.
- Functions enable code reuse.