# CORE Skills Prerequisite - Intro to Automation

This lesson is adapted from the [Data Carpentry Ecology lesson](http://www.datacarpentry.org/python-ecology-lesson/)

## How to use a Jupyter Notebook

https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

https://jupyterlab.readthedocs.io/en/stable/user/notebook.html

- The file autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info


- The notebook has different type of cells: Code and Markdown are most commonly used
- **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
- **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)

# Automating data processing using For Loops

So far, we've used Python and the pandas library to explore and manipulate
individual datasets by hand, much like we would do in a spreadsheet. The beauty
of using a programming language like Python, though, comes from the ability to
automate data processing through the use of loops and functions.

## For loops

Loops allow us to repeat a workflow (or series of actions) a given number of
times or while some condition is true. We would use a loop to automatically
process data that's stored in multiple files (daily values with one file per
year, for example). Loops lighten our work load by performing repeated tasks
without our direct involvement and make it less likely that we'll introduce
errors by making mistakes while processing each file by hand.

Let's write a simple for loop that simulates what a kid might see during a
visit to the zoo:

In [None]:
animals = ['lion','tiger','crocodile','vulture','hippo']
print(animals)

In [None]:
for animal in animals:
    print(animal)
    print(animals)

The line defining the loop must start with `for` and end with a colon, and the
body of the loop must be indented.

In this example, `creature` is the loop variable that takes the value of the next
entry in `animals` every time the loop goes around. We can call the loop variable
anything we like. After the loop finishes, the loop variable will still exist
and will have the value of the last entry in the collection:

In [None]:
for i in range(0,5):
    print(animals[i])
    print(i)

In [None]:
for animal in animals:
    pass

In [None]:
animal

We are not asking python to print the value of the loop variable anymore, but
the for loop still runs and the value of `creature` changes on each pass through
the loop. The statement `pass` in the body of the loop just means "do nothing".

---



## Starting in the same spot

To help the lesson run smoothly, let's ensure everyone is in the same directory.
This should help us avoid path and file name issues. At this time please
navigate to the workshop directory. If you working in IPython Notebook be sure
that you start your notebook in the workshop directory.

A quick aside that there are Python libraries like [OS
Library](https://docs.python.org/3/library/os.html) that can work with our
directory structure, however, that is not our focus today.

If you need to change your directory ```import os``` and use ```os.chdir```

**We want to be in the data folder**

In [None]:
# check if you need to change your directory
import os
os.getcwd()  

In [None]:
# os.listdir("../")

In [None]:
# os.chdir("../data/")

In [None]:
# os.getcwd()  

In [None]:
import pandas as pd
#check your version, we need v0.19 or higher
pd.__version__

In [None]:
surveys_df = pd.read_csv("surveys.csv")

The file we've been using so far, `surveys.csv`, contains 25 years of data and is
very large. We would like to separate the data for each year into a separate
file.

Let's start by making a new directory inside the folder `data` to store all of
these files using the module `os`:

In [None]:
os.mkdir('yearly_files')

The command `os.mkdir` is equivalent to `mkdir` in the shell. Just so we are
sure, we can check that the new directory was created within the `data` folder:

In [None]:
os.listdir('./')

The command `os.listdir` is equivalent to `ls` in the shell.

---


Previously, we saw how to use the library pandas to load the species
data into memory as a DataFrame, how to select a subset of the data using some
criteria, and how to write the DataFrame into a csv file. Let's write a script
that performs those three steps in sequence for the year 2002:

```python
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

# Select only data for 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a csv file
surveys2002.to_csv('data/yearly_files/surveys2002.csv')
```

To **create yearly data files**, we could repeat the last two commands over and
over, once for each year of data. Repeating code is neither elegant nor
practical, and is very likely to introduce errors into your code. **We want to
turn what we've just written into a loop** that repeats the last two commands for
every year in the dataset.

Let's start by writing a loop that simply prints the names of the files we want
to create - the dataset we are using covers 1977 through 2002, and we'll create
a separate file for each of those years. Listing the filenames is a good way to
confirm that the loop is behaving as we expect.


We have seen that we can loop over a list of items, so we need a list of years 
to loop over. We can get the *unique* years in our DataFrame with:

In [None]:
surveys_df.year.unique()

Putting this into our for loop we get

In [None]:
for  year in surveys_df.year.unique():
    # creating filename
    filename = 'yearly_files/surveys_year' + str(year) + '.csv'
    print(filename)


Notice that we use single quotes to add text strings. The variable is not
surrounded by quotes. This code produces the string
`data/yearly_files/surveys_year2002.csv` which contains the path to the new filename
AND the file name itself.

We can now add the rest of the steps we need to create separate text files.
Once finished look inside the `yearly_files` directory and check a couple of the files you
just created to confirm that everything worked as expected.

In [None]:
for year in surveys_df.year.unique():
    #creating filename
    filename =  'yearly_files/surveys_year' + str(year) + '.csv'
    # extracting data of a specific year
    surveys_year = surveys_df[surveys_df.year == year]
    # writing to file
    surveys_year.to_csv(filename)

In [None]:
os.listdir('yearly_files/')


### Challenge




   1. What happens if there is no data for a year in the sequence (for example, imagine we had used 1976 as the start year in range)?

   2. Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977? Hint: you will need to use range to specify the list of numbers.

   ```python
range(start, end, steps)
```
   3. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique csv file for each species?


In [None]:
# what do we get returned for a year that does not exist? 
surveys_df[(surveys_df['year']== 1976)]

In [None]:
# only save data for every 5th year using range
for year in range(surveys_df.year.min(),surveys_df.year.max()+1,5):
    #creating filename
    filename = 'yearly_files/5yeardata_' + str(year) + '.csv'
    # extracting data of a specific year
    surveys_year = surveys_df[surveys_df.year == year]
    # writing to file
    surveys_year.to_csv(filename)

In [None]:
os.listdir('yearly_files/')

In [None]:
#find the unique species
surveys_df.species.dropna().unique()

In [None]:
# create the new folder for the species data
os.mkdir('species')

In [None]:
# save data files for each species, 
# Caution: skip the nan
for species in surveys_df.species.dropna().unique():
    #creating filename
    filename = 'species/species_' + species + '.csv'
    # extracting data of a specific year
    surveys_species = surveys_df[surveys_df.species == species]
    # writing to file
    surveys_species.to_csv(filename)

In [None]:
os.listdir('species/')

## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task
that we frequently have to perform. We could write a **for loop** like the one above
every time we needed to do it but that would be time consuming and error prone.
A more elegant solution would be to create a reusable tool that performs this
task with minimum input from the user. To do this, we are going to turn the code
we've already written into a function.

Functions are reusable, self-contained pieces of code that are called with a
single command. They can be designed to accept arguments as input and return
values, but they don't need to do either. Variables declared inside functions
only exist while the function is running and if a variable within the function
(a local variable) has the same name as a variable somewhere else in the code,
the local variable hides but doesn't overwrite the other.

Every method used in Python (for example, `print`) is a function, and the
libraries we import (say, `pandas`) are a collection of functions. We will only
use functions that are housed within the same code that uses them, but it's also
easy to write functions that can be used by different programs.

Functions are declared following this general structure:

```python
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2
```

The function declaration starts with the word `def`, followed by the function
name and any arguments in parenthesis, and ends in a colon. The body of the
function is indented just like loops are. If the function returns something when
it is called, it includes a return statement at the end.

In [None]:
#let's define this function
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

In [None]:
#and now let's call the function:
this_is_the_function_name(5,2)

In [None]:
this_is_the_function_name(input_argument2=5, input_argument1=2)

### Challenge:

1. Try calling the function by giving it the wrong number of arguments (not 2)
2. Declare a variable inside the function and test to see where it exists (Hint:
   can you print it from outside the function?)
3. Explore what happens when a variable both inside and outside the function
   have the same name. What happens to the global variable when you change the
   value of the local variable?

In [None]:
# try giving only 1 or maybe 3 inputs
this_is_the_function_name(5,2,3)

In [None]:
def this_other_function(in1, in2=74646):
    new_variable = 3
    print(new_variable, in1, in2)
    return

In [None]:
this_other_function(1)

In [None]:
this_other_function(1,2)

In [None]:
new_variable

In [None]:
new_variable = 5
print(new_variable)
this_other_function(1,2)
print(new_variable)

In [None]:
def this_last_function(in1, in2=74646):
    new_variable = 3
    print(new_variable, in1, in2)
    return new_variable

In [None]:
print(new_variable)
other_variable = this_last_function(1,2)
print(new_variable)
print(other_variable)

---

We can now turn our code for saving yearly data files into a function. There are
many different "chunks" of this code that we can turn into functions, and we can
even create functions that call other functions inside them. Let's first write a
function that separates data for just one year and saves that data to a file:

```python
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'data/yearly_files/function_surveys_year' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
```

In [None]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'yearly_files/function_surveys_year' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)

The text between the two sets of triple double quotes is called a docstring and
contains the documentation for the function. It does nothing when the function
is running and is therefore not necessary, but it is good practice to include
docstrings as a reminder of what the code does. Docstrings in functions also
become part of their 'official' documentation:

In [None]:
sum?

In [None]:
one_year_csv_writer(2002,surveys_df)

In [None]:
os.listdir('yearly_files/')

In [None]:
one_year_csv_writer?

We changed the root of the name of the csv file so we can distinguish it from
the one we wrote before. Check the `yearly_files` directory for the file. Did it
do what you expect?

---

What we really want to do, though, is **create files for multiple years without
having to request them one by one**. Let's write another function that replaces
the entire `for loop` by simply looping through a sequence of years and repeatedly
calling the function we just wrote, `one_year_csv_writer`:


```python
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)
```

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)

Because people will naturally expect that the end year for the files is the last
year with data, the for loop inside the function ends at `end_year + 1`. 
This is because when we specify `range()` the last number is not included, try it for yourself.

In [None]:
list(range(5))

By writing the entire loop into a function, we've made a reusable tool for whenever
we need to break a large data file into yearly files. Because we can specify the
first and last year for which we want files, we can even use this function to
create files for a subset of the years available. This is how we call this
function:

In [None]:
yearly_data_csv_writer(1980,1990,surveys_df)

In [None]:
os.listdir('yearly_files/')

**BEWARE!** If you are using IPython Notebooks and you modify a function, you MUST
re-run that cell in order for the changed function to be available to the rest
of the code. Nothing will visibly happen when you do this, though, because
simply defining a function without *calling* it doesn't produce an output. Any
cells that use the now-changed functions will also have to be re-run for their
output to change.

### Challenge:

1. **Add two arguments** to the functions we wrote that take the *path* of the
   directory where the files will be written and the *root* of the file name.
   Additionally, **add default values for all year inputs**. Note, rearrange the order of inputs so that arguments with default are listed last.
   Create a new set of files with a different name in a different directory.
2. Make the functions **return a list** of the files they have written. There are
   many ways you can do this (and you should try them all!): either of the
   functions can print to screen, either can use a return statement to give back
   numbers or strings to their function call, or you can use some combination of
   the two. You could also try using the `os` library to list the contents of
   directories.

In [None]:
# adding two arguments 

def one_year_csv_writer(all_data,  directory,  file_root, this_year = 1977):
    """
    Writes a csv file for data from a given year in the speciefied directory using specified root name

    this_year --- year for which data is extracted --- default: 1977
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = directory +  file_root + str(this_year) + '.csv'
    surveys_year.to_csv(filename)

def yearly_data_csv_writer(all_data,  directory,  file_root, start_year = 1977, end_year = 1979):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want --- default: 1977
    end_year --- the last year of data we want --- default: 2002
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(all_data, directory,  file_root, year)

In [None]:
# adding a list of filenames to be returned
# adding two arguments 

def one_year_csv_writer(all_data,  directory,  file_root, this_year = 1977):
    """
    Writes a csv file for data from a given year in the speciefied directory using specified root name

    this_year --- year for which data is extracted --- default: 1977
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = directory +  file_root + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
    return(filename)

def yearly_data_csv_writer(all_data,  directory,  file_root, start_year = 1977, end_year = 1979):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want --- default: 1977
    end_year --- the last year of data we want --- default: 2002
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """
    fname = []
    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        fname.append(one_year_csv_writer(all_data, directory,  file_root, year))
    return(fname)

In [None]:
yearly_data_csv_writer(surveys_df, 'yearly_files/','test_')

--- 

But what if our dataset doesn't start in 1977 and end in 2002? We can modify the
function so that it looks for the start and end years in the dataset if those
dates are not provided:

```python
    def yearly_data_arg_test(all_data, start_year = None, end_year = None):
        """
        Modified from yearly_data_csv_writer to test default argument values!

        start_year --- the first year of data we want --- default: None - check all_data
        end_year --- the last year of data we want --- default: None - check all_data
        all_data --- DataFrame with multi-year data
        """

        if not start_year:
            start_year = min(all_data.year)
        if not end_year:
            end_year = max(all_data.year)

        return start_year, end_year
```

In [None]:
# define function
def yearly_data_arg_test(all_data, start_year = None, end_year = None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    all_data --- DataFrame with multi-year data
    """

    if not start_year:
        start_year = min(all_data.year)
    if not end_year:
        end_year = max(all_data.year)
    if start_year == 1980:
        print("don't do that")

    return start_year, end_year

In [None]:
yearly_data_arg_test(surveys_df, 1980, 1990)

In [None]:
# test function
yearly_data_arg_test(surveys_df)

The default values of the `start_year` and `end_year` arguments in the function
`yearly_data_arg_test` are now `None`. This is a build-it constant in Python
that indicates the absence of a value - essentially, that the variable exists in
the namespace of the function (the directory of variable names) but that it
doesn't correspond to any existing object.

The body of the test function now has two conditional 'loops' (**if statement**) that
check the values of `start_year` and `end_year`. If statements execute the body of
the 'loop' when some condition is met. 

`if statements` work like the boolean logic we saw earlier when we created masks to select our data.
As a function they commonly look something like this:

```python
a = 5

if a<0: # meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>0: # did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else: # met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')
```

In [None]:
a = 0.0

if a<0: # meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>=0: # did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else: # met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')

Change the value of `a` to see how this function works. The statement `elif`
means "else if", and all of the conditional statements must end in a colon.

The if statements in the function `yearly_data_arg_test` check whether there is an
object associated with the variable names `start_year` and `end_year`. If those
variables are `None`, the if statements return the boolean `True` and execute whatever
is in their body. On the other hand, if the variable names are associated with
some value (they got a number in the function call), the if statements return `False`
and do not execute. The opposite conditional statements, which would return
`True` if the variables were associated with objects (if they had received value
in the function call), would be `if start_year` and `if end_year`.

### Challenge:

1. Rewrite the `one_year_csv_writer` and `yearly_data_csv_writer` functions to use `none` as default for the years.


3. The code below checks to see whether a directory exists and creates one if it
doesn't. Add some code to your function that writes out the CSV files, to check
for a directory to write to.

```Python
	if 'dir_name_here' in os.listdir('.'):
	    print('Processed directory exists')
	else:
	    os.mkdir('dir_name_here')
	    print('Processed directory created')
```

2. Modify the functions so that they don't create yearly files if there is no
data for a given year and display an alert to the user (Hint: use conditional
statements to do this.)

In [None]:
# adding a list of filenames to be returned
# adding two arguments 

def one_year_csv_writer(all_data,  directory,  file_root, this_year = 1977):
    """
    Writes a csv file for data from a given year in the speciefied directory using specified root name

    this_year --- year for which data is extracted --- default: 1977
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """
    
    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]
    
    if len(surveys_year)==0:
        print("No data for year "+str(this_year))
        
    # Write the new DataFrame to a csv file
    filename = directory +  file_root + str(this_year) + '.csv'
    #surveys_year.to_csv(filename)
    return(filename)

def yearly_data_csv_writer(all_data,  directory,  file_root, start_year = 1977, end_year = 1979):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want --- default: 1977
    end_year --- the last year of data we want --- default: 2002
    all_data --- DataFrame with multi-year data
    directory --- directory in which the data is to be saved, include the /
    file_root --- prefix for the filename [prefix_year.csv], include any _ wanted
    """
    if directory in os.listdir('.'):
        print('Processed directory exists')
    else:
        os.mkdir(directory)
        print('Processed directory created')
        
    fname = []
    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        fname.append(one_year_csv_writer(all_data, directory,  file_root, year))
    return(fname)

In [None]:
one_year_csv_writer(all_data=surveys_df,directory='yearly_files/',file_root='final_test_', this_year=1976)

In [None]:
yearly_data_csv_writer(all_data=surveys_df,directory='test_creation/',file_root='final_test_', start_year=1977, end_year=1978)