# Data Analysis and Visualization in Python
## Data workflows and automation
Questions
* Can I automate operations in Python?
* What are functions and why should I use them?

Objectives
* Employ `for` loops to automate data analysis.
* Write unique filenames in Python.
* Build reusable code in Python.
* Write functions using conditional statements (`if`, `then`, `else`).

## Loading our Data

In [None]:
import pandas as pd

# Load the data
surveys_df = pd.read_csv("../data/surveys.csv")
species_df = pd.read_csv("../data/species.csv")

## Automating data processing using For Loops

In [None]:
import os

In [None]:
folder_years = "yearly_files"
os.mkdir(folder_years)

In [None]:
os.listdir('.')

In [None]:
for year in surveys_df['year'].unique():
    # Create a unique filename for each year
    filename = os.path.join(folder_years, "surveys_" + str(year) + ".csv")
    print(filename)

    # Select data for the year
    surveys_year = surveys_df[surveys_df['year'] == year]
    surveys_year.to_csv(filename, index=False)

os.listdir(folder_years)

### Exercises - Creating multiple CSV files
Instead of splitting out the data by years, a colleague wants to analyse each species separately. How would you write a unique csv file for each species?

In [None]:
folder_species = "species_files"
os.mkdir(folder_species)  # Create the directory

In [None]:
merged_left = pd.merge(left=surveys_df, right=species_df, how='left', on="species_id")

for species in merged_left['species'].unique():
    # Create a unique filename for each species
    filename = os.path.join(folder_species, "surveys_" + str(species) + ".csv")
    print(filename)

    # Select data for the current species
    merged_left_species = merged_left[merged_left['species'] == species].dropna()
    merged_left_species.to_csv(filename, index=False)

os.listdir(folder_species)

## Building reusable and modular code with functions
* Automatically create the `folder_to_save` if it does not exist.
* Use `None` as default `start_year` and `end_year`.
* Make the second function return a list of generated files.

In [None]:
def one_year_csv_writer(all_data, folder_to_save, prefix, this_year):
    """
    Writes a csv file for data from a given year. Returns the filename.

    all_data --- DataFrame with multi-year data
    folder_to_save --- folder to save the data files
    prefix --- prefix for the CSV file name
    this_year --- year for which data is extracted
    """

    # Create a unique filename for each year
    filename = os.path.join(folder_to_save, prefix + str(this_year) + ".csv")

    # Select data for the year
    data_for_year = all_data[all_data['year'] == this_year]
    data_for_year.to_csv(filename, index=False)

    return filename

In [None]:
def yearly_data_csv_writer(all_data, folder_to_save, prefix,
                           start_year = None, end_year = None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    all_data --- DataFrame with multi-year data
    folder_to_save --- folder to save the data files
    prefix --- prefix for the CSV file name
    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    """

    if folder_to_save in os.listdir('.'):
        print('Processed directory exists')
    else:
        os.mkdir(folder_to_save)
        print('Processed directory created')

    if not start_year:
        start_year = min(all_data['year'])

    if not end_year:
        end_year = max(all_data['year'])

    filenames = []

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year + 1):
        filenames.append(one_year_csv_writer(all_data, folder_to_save, prefix, year))

    return filenames

In [None]:
yearly_data_csv_writer(surveys_df, 'final', 'results_', 1995, 1998)

In [None]:
os.listdir("final")

In [None]:
yearly_data_csv_writer(surveys_df, 'final', 'results_')