# Perth ResBaz 2019 - Python Stream


This lesson is adapted from the [Data Carpentry Ecology lesson](http://www.datacarpentry.org/python-ecology-lesson/)

We'll be using an EtherPad channel to share solutions to challenges, ask questions and chat:

<http://146.118.64.54/p/resbaz>

### Contents:

1. [Short Introduction to Python](#1.-Short-Introduction-to-Python)
2. [Starting With Data](#2.-Starting-With-Data)
3. [Indexing, Slicing, and Subsetting DataFrames](#3.-Indexing,-Slicing-and-Subsetting-DataFrames)
4. [Data Types and Formats](#4.-Data-Types-and-Formats)
5. [Combining DataFrames using Pandas](#5.-Combining-DataFrames-using-Pandas)
6. [Automating data processing](#6.-Automating-data-processing)
7. [A Brief intro to Matplotlib and Seaborn](#7.-A-Brief-intro-to-Matplotlib-and-Seaborn)

### Download this notebook!

We'll be working from this notebook and filling in the answers as we go.
To fetch the notebook and the example data we will be using...

**Please download the Git repository at <https://github.com/darcyabjones/resbaz-perth-2019-python>.**

If you have `git` installed you can `clone` it from the terminal at the location that you want to store it...

```bash
git clone https://github.com/darcyabjones/resbaz-perth-2019-python.git
```


If you don't have `git` installed, you can download a Zip archive of the repository from <https://github.com/darcyabjones/resbaz-perth-2019-python/archive/master.zip> and unzip it wherever you want :).


### Check that you have all of the necessary packages

In later sections we will be using some python packages that aren't always distributed with Python.
If you installed the full Anaconda distribution, you should already have what you need installed.
If you used Miniconda or another method you may need to install the packages.

We will be using:

- [jupyter](https://jupyter.org/)
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)

To check that you have these packages installed, run the following in your bash terminal from the directory that you downloaded this repository into.

In [None]:
%%bash

python3 scripts/check_packages.py

If you are missing packages you can install them using conda (e.g. if you installed Miniconda) or pip from your terminal.

```bash
conda install -y pandas matplotlib jupyter seaborn

# or

python3 -m pip install --user --upgrade pip pandas matplotlib jupyter seaborn
```

### How to use a Jupyter Notebook

https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

https://www.packtpub.com/books/content/getting-started-jupyter-notebook-part-1

- The file autosaves
- You run a cell with **Ctrl + enter** (or **Cmd + enter** on Mac OS) or using the run button in the tool bar
- If you run a cell with **Shift + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info


- The notebook has different type of cells: Code and Markdown are most commonly used
- **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
- **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)

# 1. Short Introduction to Python

# 2. Starting With Data

# 3. Indexing, Slicing and Subsetting DataFrames

We often want to work with subsets of a `DataFrame` object.

In general we can select by labels (e.g. column headings), or by specific row and/or column indices.


## Extracting Range based Subsets: Slicing

**REMINDER**: Python Uses 0-based Indexing

This means that the first element in an object is located at position `0`.


<img src="https://datacarpentry.org/python-ecology-lesson/fig/slicing-indexing.png" alt="indexing diagram" width=450>


<img src="https://datacarpentry.org/python-ecology-lesson/fig/slicing-slicing.png" alt="slicing diagram" width=450>


### Challenge (5 mins)

```python
# Create a list of numbers:
a = [1, 2, 3, 4, 5]
```

1. What value does the code below return?
   ```python
   a[0]
   ```
2. What about this?
   ```python
   a[-1]
   ```
2. What about a slice?
   ```python
   a[2:3]
   ```
3. How about this:
   ```python
   a[5]
   ```
4. Or this?
   ```python
   a[len(a)]
   ```
5. In the example above, calling `a[5]` returns an error. Why is that?

## Slicing `DataFrames` using the `[]` operator

Slicing using the `[]` operator selects a set of rows and/or columns from a DataFrame.
To slice out a set of rows, you use the following syntax: `data[start:stop]`.

When slicing, the start bound is included in the output **but the stop bound is not included**.
The stop bound is one step **BEYOND** the row you want to select.
So if you want to select rows 0, 1 and 2 your code would look like this:

In [None]:
# Load the dataframe if you don't already have it.

In [None]:
# select rows 0,1,2 (but not 3)

To access columns we will need to call them by their labels like we did in the last lesson.

```python
# selecting the plot column
surveys_df['plot']
```


if you want more than one column you need to supply a list of column names.

In [None]:
# selecting plot_id and weight columns

In [None]:
# What happens if you ask for a column that doesn't exist?

Using this information we can start to constuct more complicated subsets.

In [None]:
# Access the first three rows of the `plot_id` column using a slice

In [None]:
# Access the rows 0, 1, and 4 of the `plot_id` column using a list

In [None]:
# Access the first three rows of the `plot_id`, `weight" sub-dataframe

We can also reassign values within subsets of our DataFrame.
But while we do that, let's assign our DataFrame to a new variable to demonstrate something that you might not expect.



In [None]:
# Assign `surveys_df` to a new variable `surveys_copy` to avoid modifying the original DataFrame

# Reassign the first three rows of data in the new DataFrame variable to 0

In [None]:
# Show the head of the copied dataframe.

In [None]:
# Show the head of the original dataframe.

What is the difference between the two data frames?

What happens if you copy the following blocks and run them?

```python
numbers = [1, 2, 3, 4]
numbers_copy = numbers
numbers_copy[1:3] = [0, 0]

print(numbers)
```

or this?

```python
number = 1
number_copy = number

number_copy = 0
print(number)
```

## Referencing Objects vs Copying Objects in Python

We might have thought that we were creating a fresh copy of the `surveys_df` objects when we  used the code `surveys_copy = surveys_df`.
However the statement `y = x` doesn’t create a copy of our data frame.
It creates a new variable `y` that refers to the **same** object `x` refers to.
This means that there is only one object (the data frame), and both `x` and `y` refer to it.

To create a fresh copy of the `surveys_df` data frame we use the syntax `y = x.copy()`.
But before we have to read the `surveys_df` again because the current version contains the unintentional changes made to the first 3 columns.

In [None]:
# Import data/surveys.csv again

# Copy it to a new variable with the `.copy()` method.


In [None]:
# Reassign the first three rows of data in the new DataFrame variable to 0

# show the head of the original dataframe

The [`.copy()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) method is another thing that is specific to Pandas.

To copy objects in general you can use the builtin [`copy`](https://docs.python.org/3/library/copy.html) package.
The syntax for copying a `list` would be:

```python
import copy

numbers = [1, 2, 3, 4, 5]
numbers_copy = copy.deepcopy(numbers)
```

## Slicing and subsetting: label vs integer-based indexing

We can select specific locations or ranges of our data in both the row and column directions
using either [label](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-label) or [integer-based indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer).

- `loc`: indexing via *labels* (which can be numbers)
- `iloc`: indexing via *integers*


<img src="http://104.236.88.249/wp-content/uploads/2016/10/Pandas-selections-and-indexing.png" alt="loc_iloc_subsetting" width="550px" />

<img src="https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2019/01/pandas-dataframe-has-indexes.png" alt="dataframe_indexing" width="600px" />


To select an index subset of rows AND columns from our DataFrame, we can use the `iloc` method.

**Note**: the order of selection is ROW followed by COLUMN

In [None]:
# if you haven't done so yet read the csv back in (we had unwanted changes when 
# we 'copied' the data frame earlier)

# try .iloc

Notice that we asked for a slice from `0:3`. This yielded 3 rows of data. When you
ask for `0:3`, you are actually telling python to start at index 0 and select rows
`0`, `1`, and `2` (**up to but not including 3**).


---

Next let's explore subsetting our data using labels.
**Note** When "slicing" labels, the start bound and the stop bound are **included**. 

In [None]:
# select all columns for rows of index values 0 and 10

In [None]:
# We can also select a multiple columns using a list

A common "gotcha" is trying to select a non-existant value from the data.
Which can produce some unexpected results.


```python

# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

# Try using iloc instead?
surveys_df.iloc[[0, 10, 35549], :]
```

If using `iloc`, labels must be found in the DataFrame or you will get a `KeyError`.
Using `loc` (at least for now) you'll get `NaN` entries returned so be careful!


### Challenge

1. What happens when you type:
	- `surveys_df[0:3]`
	- `surveys_df[:5]`
	- `surveys_df[-1:]`

2. What happens when you call:
    - `surveys_df.iloc[0:4, 1:4]`
    - `surveys_df.loc[0:4, 1:4]`
    - How are the two commands different?

## Subsetting using masks

A mask can be useful to locate where a particular subset of values exist or don't exist - for example, `NaN`, or "Not a Number" values.

To understand masks we also need to understand `bool` objects in python.

Boolean values include `True` or `False`.
So for example:

```python
# set x to 5
x = 5

# what does the code below return?
x > 5

# how about this?
x == 5
```


To create a [boolean mask](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#boolean-indexing), you first create the `True` / `False` criteria (e.g. `values > 5`).
Python will then assess each value in the object to determine whether the value meets the criteria (`True`) or not (`False`).
Python creates an output object that is the same shape as the original object, but with a `True` or `False` value for each index location.

You can use the syntax below when querying data from a DataFrame.
Experiment with selecting various subsets of our data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

Let's try this out. 

In [None]:
# Select all rows with `weight` < 100

In [None]:
# Select all rows with `weight` > 100 and `sex` == "F"

Note the use of the `&` operator, which is a bit like `and` but in Pandas it does the `and` for each element of a `Series`.


Next, let's identify all locations in the survey data that have null (missing or `NaN`) data values.
We can use the pandas [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) function to do this.
Each cell with a null value will be assigned a value of  `True` in the new boolean object.

In [None]:
# Use isnull() to find all null values in the dataframe

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

In [None]:
# To select just the rows with NaN values in any column, we can use the `.any()` method

Note that there are many null or `NaN` values in the `wgt` column of our DataFrame.
We explored different ways of dealing with these earlier in this notebook.

We can run `isnull` on a particular column too.
What does the code below do?

```python
# what does this do?
empty_weights = surveys_df.loc[pd.isnull(surveys_df["wgt"])]
```

Let's take a minute to look at the statement above. 

We are using the Boolean object as an index. 
We are asking python to select rows that have a `NaN` value for weight.

### Challenge

1. Select a subset of rows in the `surveys_df` DataFrame that contain data from
   the year 1999 and that contain weight values less than or equal to 8.
   How many columns did you end up with?
   What did your neighbor get?
2. You can use the [`.isin()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) method to query a DataFrame based upon a list of values as follows:
   ```python
   surveys_df[surveys_df['species_id'].isin([listGoesHere])]
   ```
   Use the `.isin()` method to find all plots that contain particular species in
   the surveys DataFrame.
   How many records contain these values?
3. Experiment with other queries. Create a query that finds all rows with a weight value > or equal to 0.
4. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. 

It is equivalent to **is not in**.
Write a query that selects all rows that are NOT equal to 'M' or 'F' in the surveys data.

# 4. Data Types and Formats

The format of individual columns and rows will impact analysis performed on a dataset read into python.
For example, you can’t perform mathematical calculations on a string (text formatted data).
This might seem obvious, however sometimes numeric values are read into Python as strings.


## Types of Data

There are two main types of data that we’re explore in this lesson: numeric and text data types.


### Numeric Data Types

Numeric data types include integers and floats.
A floating point (known as a float) number has decimal points even if that decimal point value is 0.
For example: `1.13`, `2.0`, `1234.345`.

An integer will never have a decimal point.
Thus if we wanted to store 1.13 as an integer it would be stored as 1.
You will often see the data type `int64` when using Pandas which stands for "64 bit integer".
The 64 simply refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store.

If we have a column that contains both integers and floating point numbers, Pandas will assign the entire column to the float data type so the decimal points are not lost.


### Text Data Type (Strings)

Strings can contain numbers and/or characters.
For example, a string might be a word, a sentence, or several sentences.
A string can also contain or consist of numbers. For instance, ‘1234’ could be stored as a string, as could ‘10.23’. 
However strings that contain numbers can not be used for mathematical operations!


Pandas and base Python use slightly different names for data types. More on this is in the table below:


| Pandas Type | Native Python Type | Description |
|-------------|--------------------|-------------|
| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. |
| datetime64, timedelta[ns] | N/A (but see the [datetime] module in Python's standard library) | Values meant to hold time data. Look into these for time series experiments. |

[datetime]: http://doc.python.org/3/library/datetime.html



## Checking the format of our data

Now that we’re armed with a basic understanding of numeric and text data types, let’s explore the format of our survey data.
We’ll be working with the same `surveys.csv` dataset that we’ve used in previous lessons.

In [None]:
# Reload the `data/surveys.csv` dataframe

Remember that we can check the type of an object like this:

Next, let’s look at the structure of our surveys data.
In Pandas, we can check the type of one column in a DataFrame using the syntax `dataFrameName[column_name].dtype`

In [None]:
# What is the dtype of the `sex` column?

In [None]:
# What is the dtype of the `record_id` column?

We can use the `.dtypes` attribute to view the data type for each column in a DataFrame (all at once).


## Working With Integers and Floats

So we’ve learned that computers store numbers in one of two ways: as integers or as floating-point numbers (or floats).

Let’s next consider how the data type can impact mathematical operations on our data.

In [None]:
# Addition and subtraction


If we divide one integer by another, we get a float.
The result on Python 3 is different than in Python 2, where the integer and float division operators are reversed.

In [None]:
# Float division

In [None]:
# Integer division

We can also convert a floating point number to an integer or an integer to floating point number.
Notice that Python by default rounds down when it converts from floating point to integer.

In [None]:
# Convert a float to an integer

In [None]:
# Convert an integer to a float

Normally when converting a `float` to an `int`, we want to be specific about how we want to round things.
We can use the functions [`ceil`](https://docs.python.org/3/library/math.html#math.ceil) or [`floor`](https://docs.python.org/3/library/math.html#math.floor) from the builtin [`math`](https://docs.python.org/3/library/math.html) package to do this.

## Working With Our Survey Data

We can modify the format of values within our data frames if we want.

For instance, we could convert the `record_id` field to floating point values.

In [None]:
# Convert the record_id field from an integer to a float

## Challenge (3 mins)

Try converting the column `plot_id` to floats using:

```python
surveys_df["plot_id"].astype("float")
```


Next try converting weight to an integer.
What goes wrong here?
What is Pandas telling you?
We will talk about some solutions to this later.

## Missing Data Values - `NaN`

What happened in the last challenge activity?

Notice that this throws a value error:

```
ValueError: Cannot convert NA to integer
```

If we look at the `weight` column in the surveys data we notice that there are `NaN` (Not a Number) values.
`NaN` values are undefined values that cannot be represented mathematically.
Pandas, for example, will read an empty cell in a CSV or Excel sheet as a `NaN`.

`NaN`s have some desirable properties: if we were to average the weight column without replacing our `NaN`s, Pandas would know to skip over those cells.


Dealing with missing data values is always a challenge. It's sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0?

We need to know how missing values are represented in the dataset in order to make good decisions. If we're lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations.

Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.


## Where Are the `NaN`s?

Let's explore the `NaN` values in our data a bit further.
First, let's figure out how many rows contain `NaN` values for weight.

We can do this by identifying how many rows have a NULL value using the `.isnull()` function or by counting the number of rows that have a meaningful value (e.g. `weight` > 0):

In [None]:
# Remind ourselves how many values are in the df.

In [None]:
# Use `.isnull` function to find `NaN` values.

In [None]:
# Find the number of records with weight > 0 

We can replace all `NaN` values with zeroes using the `.fillna()` method.

In [None]:
# Fill na values in weight with 0 

To make a change to a DataFrame permanent, we assign the updated `Series` back to the `DataFrame` column with `=` (after making a copy of the data so we don't lose our work.

In [None]:
# Copy the dataframe using the `.copy()` method

# replace NaN with 0 using the `.fillna()` method

# Show the updated dataframe

However, `NaN` and `0` will yield different analysis results.
The mean value when `NaN` values are replaced with `0` is different from when `NaN` values are simply thrown out or ignored.

In [None]:
# check mean of weights in the new na filled dataframe

In [None]:
# how does the new mean differ from before?

We can fill `NaN` values with any value that we chose.
The code below fills all `NaN` values with a mean for all weight values.

```python
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
```

We could also chose to create a subset of our data, only keeping rows that do not contain `NaN` values, using the `.dropna()` method.

The point is to make conscious decisions about how to manage missing data using your domain expertise.

### Challenge (5 mins)

Count the number of missing values per column.
Hint: the method `.count()` gives you the number of non-NA observations per column. Try looking into the `.isnull()` method.

### A Brief aside about `inplace`

The pattern of modifying a column and reassigning it to the same column in a DataFrame is very common.
In Pandas, many methods have an optional `inplace` argument.

Setting it to `True` will modify the column "in place".

E.G. The following is equivalent to what we did with `df1` previously.

In [None]:
df2 = surveys_df.copy()

df2["weight"].fillna(0, inplace=True)
df2["weight"].mean()

## Writing out data to CSV

We’ve learned about using manipulating data to get desired outputs.
But we’ve also discussed keeping data that has been manipulated separate from our raw data.
Something we might be interested in doing is working with only the columns that have full data.
First, let’s reload the data so we're not mixing up all of our previous manipulations.

Next, let’s drop all the rows that contain missing values.
We will use the method `.dropna()`.
By default, `.dropna()` removes rows that contain missing data for even just one column.

If you now type `df_na`, you should observe that the resulting DataFrame has 30676 rows and 9 columns, much smaller than the 35549 row original.

We can now use the `.to_csv()` method to do export a `DataFrame` in CSV format.
Note that the code below will by default save the data into the current working directory.
We can save it to a different folder by adding the foldername and a slash before the filename: `df.to_csv('foldername/out.csv')`.

In [None]:
# Save the NA-dropped dataframe
# What does the `index` parameter do?


Have a look at the data!

# 5. Combining DataFrames using Pandas

In many "real world" situations, the data that we want to use come in multiple files.
We often need to combine these files into a single `DataFrame` to analyze the data.

The Pandas package provides various functions for combining DataFrames including `merge` and `concat`.


## Concatenating

We can use the `concat` function in Pandas to append either columns or rows from one DataFrame to another.
Let's grab two subsets of our data to see how this works.

In [None]:
# Save a slice of the first 10 lines of surveys table to a new variable.


In [None]:
# Save a slice of the last 10 rows to a new variable.


Notice that the subset containing the last few rows of the dataframe has an odd index.

We can reset this index to be regular sequential indices using the `.reset_index()` method.

In [None]:
# Use the .reset_index() method to 'fix' the index.
# What does the `drop` argument do?

When we concatenate DataFrames, we need to specify the axis.

`axis=0` tells Pandas to stack the second DataFrame under the first one.
It will automatically detect whether the column names are the same and will stack accordingly.

`axis=1` will stack the columns in the second DataFrame to the RIGHT of the first DataFrame.
To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets.

When we stack horizonally (`axis=1`), we want to make sure what we are doing makes sense (i.e. the data are related in some way).

In [None]:
# Stack the DataFrames on top of each other using the `concat` function.

In [None]:
# Place the DataFrames side by side
# Do we need to reset_index?

### Challenge (5-10 mins)

Have a look at the vertically stacked dataframe, do you notice anything unusual?

The row indexes for the two data frames `surveys_first_10` and `surveys_last_10` are not sequential.


1. reindex the new concatenated dataframe using the `.reset_index()` method.
2. What does `ignore_index` parameter do?
   Try entering `pd.concat([surveys_first_10, surveys_last_10], axis=0, ignore_index=True)`
3. BONUS. What does `pd.concat([surveys_first_10, surveys_last_10.reset_index(drop=True)], axis=1, ignore_index=True)` do?
   How does this relate to the axis joined on?

## Joining two `DataFrames`

Another way to combine `DataFrames` is to use columns in each dataset that contain common values (a common unique id).

Combining `DataFrames` using a common field is called "joining".
The columns containing the common values are called "join key(s)".
Joining `DataFrames` in this way is often useful when one `DataFrame` contains additional data that we want to include in the other.

NOTE: This process of joining tables is similar to what we do with tables in an SQL database.


For example, the file `data/species.csv` file contains additional data about the species we saw in `data/surveys.csv`.
This table contains the genus, species and taxa code for 55 species.
The species code is unique for each line. These species are identified in our survey data as well using the unique species code.

To work through the examples below, we first need to load the species file into a Pandas `DataFrame`.

In [None]:
# Import our surveys and species csvs into dataframes.

Now we can see the relationship between the dataframes.

In [None]:
# Show the head of surveys_df

In [None]:
# Show the head of species_df

We see that the column `species_id` in `species_df` corresponds to the `species_id` column in `surveys_df`.

So `species_id` will be our **join key** in both dataframes.

We can now use the pandas `merge` function to join the two DataFrames.

In [None]:
# Use pd.merge to combine the two dataframes.

In [None]:
# What's the shape of the output table?

We can see that the joined DataFrame now has genera and species information for the survey records.


## Types of joins

The most common type of join is called an inner join, which we just did by default with `pd.merge`.
An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

Inner joins yield a DataFrame that contains only rows where the value being joins exists in BOTH tables.


<img src="https://datacarpentry.org/python-ecology-lesson/fig/inner-join.png" alt="inner-join" width="450px" >


We can also perform other types of joins like a **left join** by providing the `how` parameter.

In [None]:
# Perform a left-join using how="left".
# What happens if you subset one of the dataframes?

A "left join" or "left outer join" joins whatever keys it can and fills any missing keys in the "right" dataframe with `NaN` values.
Unlike the "inner" join, none of the rows in the left dataframe are discarded.

<img src="https://datacarpentry.org/python-ecology-lesson/fig/left-join.png" alt="inner-join" width="450px" >

Other types of join include "right" joins and "full outer joins".


### Challenge (10 - 15 mins)

In the data folder, there is a `plots.csv` file that contains information about the "plot type" associated with each plot.

1. Read that data into a pandas dataframe and identify the **join keys** (hint: in `surveys.csv` the column will be `plot_id`).
2. Perform an **inner join** using the `pd.merge` function.
2. Find the number of different `species` for each `plot_type` using the `.groupby()` and `.nunique()` methods.

# 6. Automating data processing

So far, we've used Python and the Pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet.

The beauty of using a programming language like Python comes from the ability to automate data processing through the use of loops and functions.


## For loops

Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true.
We would use a loop to automatically process data that's stored in multiple files E.G. daily values with one file per year.

Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we'll introduce errors by making mistakes while processing each file by hand.

Let's write a simple for loop that simulates what a kid might see during a visit to the zoo:

In [None]:
# Write a for loop that prints each animal out

The line defining the loop must start with `for` and end with a colon, and the body of the loop must be indented.

In this example, `creature` is the loop variable that takes the value of the next entry in `animals` every time the loop goes around.

We can call the loop variable anything we like.
After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

We are not asking python to print the value of the loop variable anymore, but the for loop still runs and the value of `creature` changes on each pass through the loop.
The statement `pass` in the body of the loop just means "do nothing".

---

The file we've been using so far (`surveys.csv`) contains 25 years of data and is a bit large.
We might like to separate the data for each year into a separate
file.

Let's start by making a new directory called `yearly_files` inside the folder `data` to store all of these files using the function [`mkdir`](https://docs.python.org/3/library/os.html#os.mkdir) in the [`os`](https://docs.python.org/3/library/os.html) module.

In [None]:
# Create a new directory with the os.mkdir function.

In [None]:
# Check that we created the directory with os.listdir

The command `os.listdir` is equivalent to `ls` in the shell.

Previously, we saw how to use the library pandas to load the surveys dataframe, subset it, and save it to a new file.

```python
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

# Select only data for 2002
surveys2002 = surveys_df.loc[surveys_df["year"] == 2002]

# Write the new DataFrame to a csv file
surveys2002.to_csv('data/yearly_files/surveys2002.csv')
```

Rather than copy-pasting the last two commands for every subset, we can use a loop.

We have seen that we can loop over a list of items, so we need a list of years to loop over.
We can get the *unique* years in our DataFrame with:

In [None]:
# Find the unique years in surveys_df

Putting this into our for loop we get

In [None]:
# Loop over the unique years and print them.

We can combine strings using the `+` operator.

In [None]:
# Join the year with strings to produce a new for the full path.

We can now add the rest of the steps we need to create separate text files.
Once finished look inside the `yearly_files` directory and check a couple of the files you
just created to confirm that everything worked as expected.

In [None]:
# list the data/yearly_files directory


### Challenge (10 - 15 mins)

1. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique csv file for each species?

2. Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977? Hint: you will need to use `range` to specify the list of numbers.

``` python
range(start, end, steps)
```


## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform.
Again, rather than copy-pasting for loops and changing one or two things, we can package chunks of code into functions.

Functions are reusable, self-contained pieces of code that are called with a single command.
They can be designed to accept arguments as input and return values, but they don't need to do either.

Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn't overwrite the other.

We will only use functions that are housed within the same code that uses them, but it's [possible to write functions that can be used by different programs](https://docs.python.org/3/tutorial/modules.html).


Functions are declared following this general structure:

```python
def function_name(input_arg):
    print(input_arg)
    return "Returned value"
```

### Challenge (10 mins)

1. Try calling the function by giving it the wrong number of arguments (not 2)
   or not assigning the function call to a variable (no `product_of_inputs =`)
1. Declare a variable inside the function and test to see where it exists (Hint:
   can you print it from outside the function?)
1. Explore what happens when a variable both inside and outside the function
   have the same name. What happens to the global variable when you change the
   value of the local variable?

---

We can now turn our code for saving yearly data files into a function.
There are many different "chunks" of this code that we can turn into functions, and we can even create functions that call other functions inside them.
Let's first write a function that separates data for just one year and saves that data to a file:

In [None]:
# Define a `one_year_csv_writer()` function

In [None]:
# Call out new function for the a single year

In [None]:
# Loop through unique years and call new function.

The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function.
It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does.

Docstrings in functions also become part of their 'official' documentation:

In [None]:
# Show the help for our newly created function

We changed the root of the name of the csv file so we can distinguish it from the one we wrote before.
Check the `yearly_files` directory for the file.
Did it do what you expect?


Let's write another function that replaces the entire For loop by simply looping through a sequence of years and repeatedly calling the function we just wrote, `one_year_csv_writer`:

In [None]:
# Create a `yearly_data_csv_writer()` function that does the loop and calls the previous function.

By writing the entire loop into a function, we've made a reusable tool for whenever we need to break a large data file into yearly files.
Because we can specify the first and last year for which we want files, we can even use this function to create files for a subset of the years available.

This is how we call this function:

### Challenge (15 mins)

1. **Add two arguments** to the functions we wrote that take the *path* of the
   directory where the files will be written and the *root* of the file name.
   Create a new set of files with a different name in a different directory.
2. Make the functions **return a list** of the files they have written. There are
   many ways you can do this (and you should try them all!): either of the
   functions can print to screen, either can use a return statement to give back
   numbers or strings to their function call, or you can use some combination of
   the two. You could also try using the `os` library to list the contents of
   directories.

---

But what if our dataset doesn't start in 1977 and end in 2002? We can modify the
function so that it looks for the start and end years in the dataset if those
dates are not provided:

In [None]:
# define function using None defaults

The body of the test function now has two conditional blocks (**if statement**) that
check the values of `start_year` and `end_year`. If statements execute the body of
the 'block' when some condition is met. 


## If statements

`if` statements work like the boolean logic we saw earlier when we created masks to select our data.
They commonly look something like this:

```python
favourite = "apple"

if favourite == "apple":  # meets first condition?
    print("Ace! I love apples!")

elif favourite == "banana":  # did not meet first condition. meets second condition?
    print("I'm bonkers for bananas too!")

# did not meet first or second condition. meets third condition?
elif favourite == "durian":
    print("Ohh...")

else:  # met no conditions
    print("I've never heard of that fruit.")
```

In [None]:
# Create an if statement that checks if a value `a` is positive, negative or zero.


Change the value of `a` to see how this function works.
The statement `elif` means "else if", and all of the conditional statements must end in a colon.

The `if` statements in the function `yearly_data_arg_test` check whether there is an object associated with the variable names `start_year` and `end_year`.
If those variables are `None`, the if statements return the boolean `True` and execute whatever is in their body.
On the other hand, if the variable names are associated with some value (they got a number in the function call), the if statements return `False` and do not execute.
The opposite conditional statements, which would return `True` if the variables were associated with objects (if they had received value in the function call), would be `if start_year` and `if end_year`.

### Challenge (10 mins)

1. Rewrite the `one_year_csv_writer` and `yearly_data_csv_writer` functions to use `None` as default for the years.
2. Modify the functions so that they don't create yearly files if there is no
   data for a given year and display an alert to the user (Hint: use conditional
   statements to do this)

# 7. A Brief intro to Matplotlib and Seaborn

[Matplotlib](https://matplotlib.org/) is an extremely flexible plotting package for Python, but it lacks an intuitive interface and the default plots are a bit unpleasant.

There are many libraries that build on top of the matplotlib foundations to provide a more convenient way to visualise data.
We already saw Pandas' `.plot()` method, which calls some matplotlib functions.
Other popular libraries include [plotnine](https://plotnine.readthedocs.io/en/stable/) which works a little bit like [`ggplot2`](https://ggplot2.tidyverse.org/) in R.

Here we'll look more at some examples from another package called [seaborn](https://seaborn.pydata.org/index.html) which has some great tutorials and an [example gallery](https://seaborn.pydata.org/examples/index.html) with code to pick and choose from.

Lets first look at matplotlib, then we'll use seaborn to recreate some earlier plots.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Matplotlib plots consist of a `Figure` and one or more `Axes` objects.
To create these objects we use the `subplots()` function from `matplotlib.pyplot`.

The axes are the things that we actually plot on.
So we can call methods like `.hist()` on the `Axes` to show data.

In [None]:
# Plot a histogram of the weights in surveys_df

We can also plot to these `Axes` objects from other packages.

In [None]:
# Specify the `ax` to plot for pd `.hist()` method.

By default, most plotting functions and methods that use matplotlib create a new figure and axis for you if you don't provide one.

Seaborn is another library that provides convenient plotting from dataframes.

In [None]:
# Construct a scatterplot of year vs weight using seaborn

We can tell the package to use a more "ggplot"-like stype using the `set` function.

In [None]:
# Set the style to "darkgrid" using set. 
sns.set(style="darkgrid")

In [None]:
# Plot the same scatterplot

Next we can plot a bar chart of summaries.

In [None]:
# Group surveys_df by sex and species and plot the mean weights.

The reason you'll want to know some of matplotlib is that it allows you to create more complex plots.
E.g. Grid plots or adjusting small features of plots.

In [None]:
fig, axes = plt.subplots(ncols=2)
sns.barplot(
    x="species_id",
    y="weight",
    hue="sex",
    data=mean_weights,
    ax=axes[0]
)

sns.scatterplot(x="year", y="weight", data=surveys_df, ax=axes[1])

axes[0].set_xticklabels(
    axes[0].get_xticklabels(),
    rotation=90,
    fontsize=7,
    horizontalalignment="center"
)

fig.tight_layout()

## Saving figures

We can save a `Figure` using the `.savefig()` method.

In [None]:
# create and save a figure.
