# Homework 8: Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic

## Logistics

**Due date**: The homework is due 11:59pm on Tuesday, March 11

You will submit your work on [MarkUs](https://markus.teach.cs.toronto.edu/markus/main/login_remote_auth).
To submit your work:

1. Download this file (`Homework_8.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the **hw8** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one).

## Introduction

For this week's homework, we are going to continue to work with the PanTHERIA dataset and the IUCN categories.

We will create a new metric using the PanTHERIA data that estimates: how many offspring do individuals within each species produce throughout their lifetime, on average? We call this "lifetime fecundity". We will be looking to see whether there is a relationship between average lifetime fecundity and a species' risk of going extinct.

In this homework, you will:

* Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
* Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you're encourage to practice using logical operators such as `!=`, `<=`, `>=`, `>`, `<`.)
* Join two related datasets to create a larger, more comprehensive dataset.
* Perform arithmetic on several pandas series to estimate the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.

### Question

The overarching question you're answering in this homework:

> **Is there a difference in IUCN category between species with smaller mean lifetime fecundity and species with larger mean lifetime fecundity?**

## Problem 1: Read in the data files

Import the raw data from the PanTHERIA (`PanTHERIA_WR05_Aug2008.csv`) and phylacine (`phylacine.csv`) datasets and name the `DataFrame`s as `pantheria_raw` and `iucn_raw`, respectively.

In [None]:
# The following code is provided for you; please do not change it.
import pandas as pd
pd.set_option('mode.chained_assignment', None) 

# Write your code here



In [None]:
# Check your work
display(pantheria_raw.head())
display(iucn_raw.head())

## Problem 2: Cleaning the data

You'll now perform various data cleaning operations on these two datasets, similar to what you did last week.
At each step, we've specified a variable to store the result in, so that all of your work can be autograded.
Note that as we saw in lecture, all of these steps create a new `DataFrame`, rather than modifying an existing `DataFrame`. (That makes it easier for you to check your work at each step.)
You should use the result of the previous step as the "input" of the next step.

### Problem 2a: Cleaning the PanTHERIA data

1. Extract just the columns `'MSW05_Order'`, `'MSW05_Binomial'`, `'23-1_SexualMaturityAge_d'`, and `'14-1_InterbirthInterval_d'`, `'17-1_MaxLongevity_m'`, and `'15-1_LitterSize'`, in the order listed.
    Store the resulting `DataFrame` in `pantheria_data`.
    
    You are encouraged, but not required, to create a new list variable to store the column names, just like we did in lecture.

2. Rename the columns according to the table below. Store the result in `pantheria_data_renamed`.

    | Old column name              | New column name            |
    |------------------------------|----------------------------|
    | `MSW05_Order`                | `Order`                    |
    | `MSW05_Binomial`             | `Genus_Species`            |
    | `23-1_SexualMaturityAge_d`   | `Age to Maturity (days)`   |
    | `14-1_InterbirthInterval_d`  | `Interbirth Interval (days)` |
    | `17-1_MaxLongevity_m`        | `Max Longevity (months)`   |
    | `15-1_LitterSize`            | `Litter Size`              |

3. Use the `DataFrame.convert_dtypes()` method to automatically convert each column into its most appropriate type, storing the resulting `DataFrame` in a variable called `pantheria_data_converted`.

4. Finally, use the `DataFrame.replace(old, new)` method to replace all occurrences of `-999` with `pd.NA`. Store the result in a variable called `pantheria_data_clean`.

In [1]:
# Write your code here

# Check your work
pantheria_data_clean.head()

### Problem 2b: Cleaning the IUCN data

1. Extract just the columns `'Binomial.1.2'` and `'IUCN.Status.1.2'`. Store the resulting `DataFrame` in `iucn_data`.

2. Rename the columns to `Genus_Species_IUCN` and `IUCN Status`, respectively. Store the resulting `DataFrame` in `iucn_data_renamed`.

3. Convert column types using `DataFrame.convert_dtypes`, and store the resulting `DataFrame` in `iucn_data_clean`.

In [None]:
# Write your code here

# Check your work
iucn_data_clean.head()

## Problem 3: Merging the DataFrames

Now let's do something we just learned this week: merge the two `DataFrame`s together.
To do so, we'll need to make sure that the two "Genus_Species" columns in the `DataFrame`s match.
We'll take a similar, but slightly different approach, from the one we used in lecture.

### Problem 3a: String formatting

1. Create a new `Series` called `genus_species_formatted` that consists of the `'Genus_Species'` column from `pantheria_data_clean`, except with all spaces (`" "`) replaced by underscores (`"_"`).
    To do this, you'll need to extract the right column from the `DataFrame` and then use the `DataFrame.str.replace(old, new)` method on the column.

2. *Modify* `pantheria_data_clean` by adding the `Series` from the previous step to it under the column name `'Genus_Species_Formatted'`.

    *Reminder*: because your code for this question actually modifies `pantheria_data_clean`, if you want to restart you should re-run all cells above this one (in the JupyterHub menu, select Cell -> Run All Above).

In [None]:
# Write your code here

# Check your work
pantheria_data_clean.head()

### Problem 3b: Merge the two `DataFrame`s

Merge `pantheria_data_clean` and `iucn_data_clean` using function `pd.merge`.
You'll need to determine the appropriate arguments for `left_on` and `right_on`.

Name the resulting `DataFrame` `joined_pantheria_iucn_data`.

In [None]:
# Write your code here

# Check your work
joined_pantheria_iucn_data.head()

## Problem 4: Eliminate irrelevant IUCN categories

Now that we have our joined `DataFrame`, we're almost ready to perform the computation necessary to answer our question.
But first, the IUCN status values `'DD'` and `'EP'` are not useful to us, so we'll remove them.

1. Extract all rows from `joined_pantheria_iucn_data` with IUCN categories OTHER THAN `'DD'` and `'EP'`. Name this resulting `DataFrame` `pantheria_iucn_clean`.

    You are strongly encouraged to create your own variable to store the *boolean `Series`* you're using as a filter. You'll need to use a comparison operator (e.g., `==` or `!=`) along with one of the two logical operators, either `&` or `|`.

In [None]:
# Write your code here

# Check your work
pantheria_iucn_clean

## Problem 5: Computing fecundity

Using `pantheria_iucn_clean`, you will estimate a new measurement that we will call `Max Lifetime Fecundity`.

This will be computed using the following columns:

`'Age to Maturity (days)'`: How long it takes for the average individual to grow to maturity. This is measured in days as the interval between birth and the time when the individual first reproduces.
 
`'Max Longevity (months)'`: How long can individuals within each species live, expressed in months.

`'Interbirth Interval (days)'`: How long do adult females wait, on average, between giving birth and becoming pregnant again?

`'Litter Size'`: How many babies do females within each species have at one time, on average?

The **maximum fecundity** of a species is calculated using the following formula:

$$ \frac{\text{max longevity} - \text{age to maturity}}{\text{interbirth interval}} \times \text{litter size}
$$

### Problem 5a: Adding the column

Your task is to add a new column called `'Max Fecundity'` to `pantheria_iucn_clean` that contains the maximum fecundity of each species. Do not perform any rounding.

**NOTE**: currently, the age to maturity/longevity/interbrith interval columns use different units. You'll first need to convert them to *years* by dividing by 365 (for days) or 12 (months) before you can use the above formula.
Do not modify the existing `pantheria_iucn_clean` for these unit conversions; instead, use new variables to store the converted `Series`.

In [None]:
# Write your code here

# Check your work
pantheria_iucn_clean

### Problem 5b: Sort

Finally, use the `DataFrame.sort_values` method to sort `pantheria_iucn_clean` in ascending order of its `'Max Fecundity'` column. You may, but are not required, to store the result in a variable.


In [None]:
# Write your code here


## Problem 6: Computing the average Max Fecundity for each IUCN Status

You will now calculate the average `Max Fecundity` value for each IUCN Status group.

Like in the lecture, this will involve two steps:
1. Group the `pantheria_iucn_clean` by the `IUCN Status` column, using the `DataFrame.groupby()` function.
2. Compute the `mean` of the `Max Fecundity` column for the grouped data.

Store the output of these steps in a new variable called `iucn_avg_fecundity`. This variable should be of type `Series`, and associate each IUCN category with the average of the `Max Fecundity` values for the species in that category.

You may store the output of Step 1 in another variable, if you wish, or chain both the steps together in one command.

In [11]:
# Write your code here


In [None]:
# This code is provided to check your work. Do not modify it.
print(type(iucn_avg_fecundity))
display(iucn_avg_fecundity.sort_values())

## Conclusion

Based on your analysis, answer each of these questions:
 
1. Explain, in biological terms, what our new `'Max Fecundity'` column measures. __(3 marks)__
2. What can you say about the relationship between the IUCN Status and the average maximum fecundity of species? __(3 marks)__

**WRITE YOUR RESPONSE HERE.**