# Data Analysis and Visualization in Python
## Data workflows and automation
Questions
* Can I automate operations in Python?
* What are functions and why should I use them?

Objectives
* Describe why `for` loops are used in Python.
* Employ `for` loops to automate data analysis.
* Write unique filenames in Python.
* Build reusable code in Python.
* Write functions using conditional statements (`if`, `then`, `else`).

## Loading our Data

In [None]:
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('../data/surveys.csv')
species_df = pd.read_csv("../data/species.csv")

## Automating data processing using For Loops

In [None]:
import os

In [None]:
os.mkdir('yearly_files')

In [None]:
os.listdir('.')

In [None]:
# Select only data for 2002
surveys2002 = surveys_df[surveys_df['year'] ### ###]

# Write the new DataFrame to a csv file
surveys2002.to_csv('yearly_files/surveys2002.csv')
os.listdir('yearly_files')

In [None]:
surveys_df['year']###

In [None]:
for year in surveys_df['year'].unique():
    filename='yearly_files/surveys' + ### + '.csv'
    print(filename)

In [None]:
for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_### = surveys_df[surveys_df['year'] == ###]
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

os.listdir('yearly_files')

### Exercises
`1`. Some of the surveys you saved have missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files. Hint: `dropna()`

In [None]:
for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_year = surveys_df[surveys_df['year'] == year].###
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

`2`. Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977? Hint: `range(start, stop, step)`

In [None]:
for year in ###(1977, surveys_df['year'].### + 1, ###):
    # Select data for the year
    surveys_year = surveys_df[surveys_df['year'] == year].dropna()
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

`3`. Instead of splitting out the data by years, a colleague wants to analyse each species separately. How would you write a unique csv file for each species?

In [None]:
os.mkdir('species_files')

In [None]:
merged_left = pd.merge(left=surveys_df, right=species_df, how='###', on="species###")

for species in merged_left['###'].unique():
    # Select data for the year
    merged_left_species = merged_left[merged_left### == ###].dropna()
    
    # Write the new DataFrame to a csv file
    filename='species_files/surveys_' + ### + '.csv'
    merged_left_species.to_csv(filename)

os.listdir('species_files')

## Building reusable and modular code with functions

In [None]:
### one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = ###_data[###_data['year'] == ###_year]

    # Write the new DataFrame to a csv file
    filename = 'yearly_files/function_surveys' + ###(###_year) + '.csv'
    surveys_year.to_csv(filename)

In [None]:
one_year_csv_writer###

In [None]:
one_year_csv_writer(2002, surveys_df)
os.listdir('yearly_files')

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in ###(start_year, end_year ###):
        one_year_csv_writer(year, all_data)

In [None]:
yearly_data_csv_writer(1977, 2002, surveys_df)
os.listdir('yearly_files')

### Exercise - More functions
1. Add two arguments to the functions we wrote that take the path of the directory where the files will be written (`folder_to_save`) and the root of the file name (`root_name`). Create a new set of files with a different name.
1. How could you use the function `yearly_data_csv_writer` to create a CSV file for only one year?
1. Make the functions return a list of the files they have written.

In [None]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year. Returns the filename.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    ### --- folder to save the data files
    ### --- root of the filenames to save the data
    """

    # Select data for the year
    surveys_year = all_data[all_data['year'] == this_year]

    # Write the new DataFrame to a csv file
    filename = os.path.join(###, ''.join([###, str(this_year), '.csv']))
    surveys_year.to_csv(filename)
    ###

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Modified from yearly_data_csv_writer to collect and return filenames

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    ### --- folder to save the data files
    ### --- root of the filenames to save the data
    """
    
    # filenames = ###

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year + 1):
        one_year_csv_writer(year, all_data ###)
        
    ###

In [None]:
yearly_data_csv_writer(###, ###, surveys_df, 'yearly_files', 'rootname')

### Default values for Arguments

In [None]:
def yearly_data_arg_test(all_data, start_year ###, end_year ###):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data --- DataFrame with multi-year data
    start_year --- the first year of data we want --- default: ###
    end_year --- the last year of data we want --- default: ###
    """

    return start_year### end_year

In [None]:
start###end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start###end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)

In [None]:
def yearly_data_arg_test(all_data, start_year = ###, end_year = ###):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data --- DataFrame with multi-year data
    start_year --- the first year of data we want --- default: ### - check all_data
    end_year --- the last year of data we want --- default: ### - check all_data
    """

    if ### start_year:
        start_year = ###(all_data['year'])
    if not end_year:
        end_year = ###(all_data['year'])

    return start_year, end_year

In [None]:
start,end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('No keywords:\t\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, start_year = 1988, end_year = 1993)
print('Both keywords, in order:\t', start, end)

start,end = yearly_data_arg_test (surveys_df, end_year = 1993, start_year = 1988)
print('Both keywords, flipped:\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, start_year = 1988)
print('One keyword, default end:\t', start, end)

start,end = yearly_data_arg_test (surveys_df, end_year = 1993)
print('One keyword, default start:\t', start, end)

## If Statements

### Exercise (done with the group) - Complete example
`1`. Make sure the `folder_to_save` does not exist. Hint:

```python
if 'dir_name_here' in os.listdir('.'):
   print('Processed directory exists')
else:
   os.mkdir('dir_name_here')
   print('Processed directory created')

```
`2`. Use `None` as default `start_year` and `end_year`

In [None]:
def one_year_csv_writer(all_data, folder_to_save, root_name, ###):
    """
    Writes a csv file for data from a given year. Returns the filename.

    all_data --- DataFrame with multi-year data
    folder_to_save --- folder to save the data files
    root_name --- root of the filenames to save the data
    ### --- year for which data is extracted
    """
    
    # Select data for the year
    surveys_year = all_data[all_data['year'] == this_year]

    # Write the new DataFrame to a csv file
    filename = os.path.join(folder_to_save, ''.###([root_name, str(this_year), '.csv']))
    surveys_year.to_csv(filename)
    return filename

In [None]:
def yearly_data_csv_writer(all_data, folder_to_save, root_name, start_year = ###, end_year = None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data --- DataFrame with multi-year data
    folder_to_save --- folder to save the data files
    root_name --- root of the filenames to save the data
    ### --- the first year of data we want --- default: ### - check all_data
    ### --- the last year of data we want --- default: ### - check all_data
    """
    
    if folder_to_save ### os.listdir('.'):
        print('Processed directory exists')
    else:
        os.mkdir(folder_to_save)
        print('Processed directory created')

    if not start_year:
        start_year = min(all_data['year'])
    if not end_year:
        end_year = max(all_data['year'])
        
    filenames = []

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year + 1):
        filenames.###(one_year_csv_writer(all_data, folder_to_save, root_name, year))
        
    return filenames

In [None]:
yearly_data_csv_writer(surveys_df, 'final', 'result', 1995, 1998)