# Data Analysis and Visualization in Python
## Data workflows and automation
Questions
* How to reuse the same code for multiple sets of data?

Objectives
* Describe why for loops are used in Python.
* Employ for loops to automate data analysis.
* Write unique filenames in Python.
* Build reusable code in Python.
* Write functions using conditional statements (if, then, else).

### How to Use Jupyter
When a cell is in edit mode:

  Shortcut  | Description
----------- | -----------
Shift+Enter | Run the cell, and go to the next
Tab         | Indent code or auto-completion
Esc         | Go to command mode

When a cell is in command mode:

  Shortcut   | Description
------------ | -----------
Shift+Enter  | Run the cell, and go to the next
Double-click | Go to edit mode
Enter        | Go to edit mode

  Shortcut   | Description
------------ | -----------
A            | Insert a cell above
B            | Insert a cell below
C            | Copy the current cell
V            | Paste the cell below
D D          | Delete the current cell

To reset all cells:
* Go to the top menu, and select Kernel -> Restart & Clear Output

## Making Sure Our Data Are Loaded

In [None]:
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('../data/surveys.csv')
surveys_df = surveys_df.rename(columns={'species': 'species_id'})

species_df = pd.read_csv("../data/species.csv")

## For loops

In [None]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)

In [None]:
### creature ### animals###
    print(###)

In [None]:
for creature in animals:
    ###
print('The loop variable is now: ' + creature)

### Exercise - Loops
Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example, `print(string1 + string2)` outputs ‘string1string2’).

In [None]:
creatures = animals###
for creature in animals[###]:
    creatures = ###
print(creatures)

## Automating data processing using For Loops

In [None]:
import os

In [None]:
os.mkdir('yearly_files')

In [None]:
os.listdir('.')

In [None]:
# Select only data for 2002
surveys2002 = surveys_df[surveys_df.year ### ###]

# Write the new DataFrame to a csv file
surveys2002.to_csv('yearly_files/surveys2002.csv')
os.listdir('yearly_files')

In [None]:
surveys_df['year']

In [None]:
surveys_df['year'].###

In [None]:
for year in surveys_df['year'].unique():
    filename='yearly_files/surveys' + ### + '.csv'
    print(filename)

In [None]:
for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_### = surveys_df[surveys_df.year == ###]
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

os.listdir('yearly_files')

### Exercises
`1`. Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files. Hint: `dropna()`

In [None]:
for year in surveys_df['year'].unique():
    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year].###
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

`2`. Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977? Hint: `range(start, stop, step)`

In [None]:
for year in ###(1977, surveys_df['year'].###, ###):
    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year].dropna()
    
    # Write the new DataFrame to a csv file
    filename='yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

`3`. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique csv file for each species?

In [None]:
os.mkdir('species_files')

In [None]:
merged_left = pd.merge(left=surveys_df, right=species_df, how='###', on="species###")

for species in merged_left['###'].unique():
    # Select data for the year
    merged_left_species = merged_left[merged_left.### == ###].dropna()
    
    # Write the new DataFrame to a csv file
    filename='species_files/surveys_' + ### + '.csv'
    merged_left_species.to_csv(filename)

os.listdir('species_files')

## Building reusable and modular code with functions

In [None]:
### this_is_the_function_name(input_argument1, input_argument2)###

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    ### input_argument1 ### input_argument2

In [None]:
product_of_inputs = this_is_the_function_name(2,5)

In [None]:
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

In [None]:
### one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = ###_data[###_data.year == ###_year]

    # Write the new DataFrame to a csv file
    filename = 'yearly_files/function_surveys' + ###(###_year) + '.csv'
    surveys_year.to_csv(filename)

In [None]:
one_year_csv_writer###

In [None]:
one_year_csv_writer(2002, surveys_df)
os.listdir('yearly_files')

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in ###(start_year, end_year ###):
        one_year_csv_writer(year, all_data)

In [None]:
yearly_data_csv_writer(1977, 2002, surveys_df)
os.listdir('yearly_files')

In [None]:
def yearly_data_arg_test(all_data, start_year ###, end_year ###):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: 1977
    end_year --- the last year of data we want --- default: 2002
    all_data --- DataFrame with multi-year data
    """

    return start_year### end_year

In [None]:
start###end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start###end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)

In [None]:
a = 5

### a < 0# # meets first condition?

    # if a IS less than zero
    print('a is a negative number')

### a > 0## # did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

###### # met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')

In [None]:
def yearly_data_arg_test(all_data, start_year = ###, end_year = ###):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    all_data --- DataFrame with multi-year data
    """

    if ### start_year:
        start_year = min(all_data.year)
    if not end_year:
        end_year = max(all_data.year)

    return start_year, end_year

In [None]:
start,end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('No keywords:\t\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, start_year = 1988, end_year = 1993)
print('Both keywords, in order:\t', start, end)

start,end = yearly_data_arg_test (surveys_df, end_year = 1993, start_year = 1988)
print('Both keywords, flipped:\t\t', start, end)

start,end = yearly_data_arg_test (surveys_df, start_year = 1988)
print('One keyword, default end:\t', start, end)

start,end = yearly_data_arg_test (surveys_df, end_year = 1993)
print('One keyword, default start:\t', start, end)

### Complete example

In [None]:
def one_year_csv_writer(all_data, folder_to_save, root_name, this_year):
    """
    Writes a csv file for data from a given year.

    Parameters
    ---------
    all_data: pd.DataFrame
        DataFrame with multi-year data 
    folder_to_save : str
        folder to save the data files
    root_name: str
        root of the filenames to save the data
    this_year : int
        year for which data is extracted
    """
    
    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = os.path.join(folder_to_save, ''.###([root_name, str(this_year), '.csv']))
    surveys_year.to_csv(filename)
    return filename

In [None]:
def yearly_data_arg_test(all_data, folder_to_save, root_name, start_year = None, end_year = None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    all_data --- DataFrame with multi-year data
    """
    
    if folder_to_save ### os.listdir('.'):
        print('Processed directory exists')
    else:
        os.mkdir(folder_to_save)
        print('Processed directory created')

    if not start_year:
        start_year = min(all_data.year)
    if not end_year:
        end_year = max(all_data.year)
        
    filenames = []

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year + 1):
        filenames.###(one_year_csv_writer(all_data, folder_to_save, root_name, year))
        
    return filenames

In [None]:
yearly_data_arg_test(surveys_df, 'final', 'result', 1995, 1998)