# Developing Functions

This exercise will lead you through taking some common data processing steps and wrapping them up into a reusable function.

The function should import a file, do a bit of processing and formatting, output the processed data elsewhere and return the path to the processed file.

In [None]:
# Install required packages if using jupyterhub
# %pip install -r ../requirements.txt

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
from pathlib import Path # handy for working with file paths, consistent across systems (windows, mac, unix)

Let's first check we're working with a similar version of pandas:

In [None]:
pd.__version__ 

We'll be using a small timeseries air quality dataset from Oxford Street in London.

This contains hourly-averaged readings of different gas and particulate 
species in the near-road environment.

In [None]:
data_filepath = Path("../data/OxfordStreetAirQuality.csv")

In [None]:
# Set up an output folder, this will be used later on to save some data after we have processed it
output_folder = Path("../data/processed")

In [None]:
# import a csv file
df = pd.read_csv(data_filepath)

Let's check what we've got. Use df.info() or df.describe() to get some summary statistics.

In [None]:
df.info()

We have some rows where we we're missing some data:

In [None]:
df.count()

As missing values don't tend to play nicely with the machine learning steps we might apply later, let's drop the rows with *any* missing data:

In [None]:
df.dropna(how='any', axis=0, inplace=True)

Our data is indexed by time, so let's make sure it has the right data type (in this case `datetime64`).

E.g: ```
df[COLUMN_NAME] = df[COLUMN_NAME].astype('datetime64')
```

In [None]:
# set the data type for the reading datetime
df['ReadingDateTime'] = df['ReadingDateTime'].astype('datetime64')

Let's also use this as the index for the DataFrame, rather than just another column.

Setting an index for a dataframe is a bit like setting names for the rows. You will then be able to select rows using these new indices. Operations like merging, concatenation, and pivoting also depend on the index. 

In [None]:
# set the index to the reading time
df.set_index('ReadingDateTime', inplace=True, drop=True)

This table isn't in the most useful format for what we might want to do with it, for which we need the variables to be columns.

Let's pivot it to obtain a table of `Value`s for each `Species` over `Time`:

In [None]:
# pivot the table 
pivoted = pd.pivot_table(df, index=['ReadingDateTime'], columns=['Species'], values='Value')

You might notice that the data we've imported is aggregated on an hourly basis.

Perhaps we're looking at longer term trends, and this level of detail is unecessary.

We can `resample()` this data to a weekly(`W`)-`mean()`.  In other words, instead of hourly data, we get weekly data, with the values averaged over each week. 

In [None]:
# calculate a weekly mean
weekly_mean = pivoted.resample('W', label='right').mean()

For good measure, we can check what this looks like:

In [None]:
ax = weekly_mean.plot()

Let's add a flag for where one of the variables is beyond a certain threshold (you could do something simliar for e.g. data quality).

In this case let's add a column called `'Hazardous'` which contains values which are `True` where the column `'NOX'` is above 50:

In [None]:
# add a data flag
weekly_mean['Hazardous'] = weekly_mean['NOX'] > 50.

Now that we've finished processing our data file, let's save it to our `output_folder`.

In [None]:
# make a processed data folder if there isn't one already
if not output_folder.exists():
    output_folder.mkdir(parents=True)

In [None]:
# output this to a 'processed files' folder
output_filepath = output_folder / 'WeeklyMeanAQ.csv'
weekly_mean.to_csv(output_filepath)

Now that we have a workflow we can use, let's combine the data processing steps above into a reusable function.

A function is a block of code that performs a specific task. It is useful for organising your code, avoids repeated code, and can be resused in different parts of a large program. 

Example: 

```python
def calc_quadratic(x):

    y = x^2 + 2x + 5
    
    return y
```


To use the function:

```python
calc_quadratic(10)
```

Another example: 

```python
def calc_quadratic(x, a, b, c):

    y = a*x^2 + b*x + c
    
    return y
```

To use the function:
```python
calc_quadratic(10, 1, 2, 5)
```

A function is a self-contained piece of code. When you are writing your own function, check that the variables used inside the function are defined in relation to the arguments (in the above example, x, a, b, c are arguments of the function calc_quadratic). Calculations performed outside of the function will not affect what's inside the function.

We've added the rough structure for you below:

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

def process_csv_file(filepath, output_folder = Path("../data/processed/"), fill_with=np.nan):
    """
    Process a csv file so it's ready for exploratory data analysis.
    
    Parameters
    -----------
    filepath
        Path to the csv file to import.
    output_folder 
        Path to the folder where you want the processed version to reside.
    fill_with 
        Value to substitute for zero.
        
    Returns
    --------
    output_filepath
        Path to the csv file which is output.
        
    Notes
    --------
    This function will convert data types and fill zeros with the specified value.
    """
    
    # 'filepath' is in the argument of this function
    df = pd.read_csv(filepath) 
    
    # remove rows with NAs
    df.dropna(how='any', axis=0, inplace=True)
    
    df['ReadingDateTime'] = df['ReadingDateTime'].astype('datetime64')
    
    df.set_index('ReadingDateTime', inplace=True, drop=True)
    
    pivoted = pd.pivot_table(df, index=['ReadingDateTime'], columns=['Species'], values='Value')
    
    weekly_mean = pivoted.resample('W').mean()
    
    weekly_mean['Hazardous'] = weekly_mean['NOX'] > 50
    
    if not output_folder.exists():
        output_folder.mkdir(parents=True)
    
    output_filepath = output_folder / 'WeeklyMeanAQ.csv'

    weekly_mean.to_csv(output_filepath)
    
    # add the return value
    return output_filepath

Copy this function, and the libraries imported above into the separate file `processor.py` (Create a new text file and rename it as `processor.py`)

Now when we want to use this function, we can import it:

In [None]:
from process_pipeline.processor import process_csv_file

If you want to check back to see what arguments the function takes, you can use the inline help:

In [None]:
help(process_csv_file)

In [None]:
process_csv_file(data_filepath, 
                 output_folder =Path("../data/another_processed_data_folder/"), 
                 fill_with=" ")

Storing functions in separate files is useful for organising code in large programs. Functions used for processing data may be kept in one file, while functions used for importing data may be stored in another, for example. The main program then imports the required functions, remaining uncluttered. 