# Homework 8: Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic

## Logistics

**Due date**: The homework is due 17:00 (5:00pm) on Tuesday, March 12.

You will submit your work on [MarkUs](https://markus-ds.teach.cs.toronto.edu).
To submit your work:

1. Download this file (`Homework_8.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the **hw8** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one).

## Introduction

For this week's homework, we are going to continue to work with the PanTHERIA dataset and the IUCN categories.

We will create a new metric using the PanTHERIA data that estimates: how many offspring do individuals within each species produce throughout their lifetime, on average? We call this "lifetime fecundity". We will be looking to see whether there is a relationship between average lifetime fecundity and a species' risk of going extinct.

In this homework, you will:

* Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
* Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you're encourage to practice using logical operators such as `!=`, `<=`, `>=`, `>`, `<`.)
* Join two related datasets to create a larger, more comprehensive dataset.
* Perform arithmetic on several pandas series to estimate the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.

### Question

The overarching question you're answering in this homework:

> **Is there a difference in IUCN category between species with smaller mean lifetime fecundity and species with larger mean lifetime fecundity?**

## Problem 1: Read in the data files

Import the raw data from the PanTHERIA (`PanTHERIA_WR05_Aug2008.csv`) and phylacine (`phylacine.csv`) datasets and name the `DataFrame`s as `pantheria_raw` and `iucn_raw`, respectively.

In [1]:
# The following code is provided for you; please do not change it.
import pandas as pd
pd.set_option('mode.chained_assignment', None) 

# Write your code here
pantheria_raw = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')
iucn_raw = pd.read_csv('phylacine.csv')

# Check your work
display(pantheria_raw.head())
display(iucn_raw.head())

Unnamed: 0,MSW05_Order,MSW05_Family,MSW05_Genus,MSW05_Species,MSW05_Binomial,1-1_ActivityCycle,5-1_AdultBodyMass_g,8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,2-1_AgeatEyeOpening_d,...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3,492714.47,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1,10392.49,-999.0,745.32,-999.0,...,-17.53,13.0,0,35.2,1.0,0.14,90.75,236.51,922.9,1534.4
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2,9658.7,-999.0,827.53,7.5,...,-17.05,45.74,0,79.29,0.0,0.1,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2,11989.1,-999.0,872.39,11.94,...,-168.12,-117.6,0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2,31756.51,-999.0,1055.0,14.01,...,-171.84,3.9,0,37.87,0.0,0.04,34.79,4.82,313.33,561.11


Unnamed: 0,Binomial.1.2,Order.1.2,Family.1.2,Genus.1.2,Species.1.2,Terrestrial,Marine,Freshwater,Aerial,Life.Habit.Method,...,Mass.Comparison,Mass.Comparison.Source,Island.Endemicity,IUCN.Status.1.2,Added.IUCN.Status.1.2,Diet.Plant,Diet.Vertebrate,Diet.Invertebrate,Diet.Method,Diet.Source
0,Abditomys_latidens,Rodentia,Muridae,Abditomys,latidens,1,0,0,0,Reported,...,,,Occurs only on isolated islands,DD,No,100,0,0,Reported 000 Species,"Wilman, H., et al. 2014. EltonTraits 1.0: spec..."
1,Abeomelomys_sevia,Rodentia,Muridae,Abeomelomys,sevia,1,0,0,0,Reported,...,,,Occurs on large land bridge islands,LC,No,78,3,19,Imputed 000 Species,PHYLACINE 1.2
2,Abrawayaomys_ruschii,Rodentia,Cricetidae,Abrawayaomys,ruschii,1,0,0,0,Reported,...,,,Occurs on mainland,LC,No,88,1,11,Imputed 000 Species,PHYLACINE 1.2
3,Abrocoma_bennettii,Rodentia,Abrocomidae,Abrocoma,bennettii,1,0,0,0,Reported,...,,,Occurs on mainland,LC,No,100,0,0,Reported 000 Species,"Wilman, H., et al. 2014. EltonTraits 1.0: spec..."
4,Abrocoma_boliviensis,Rodentia,Abrocomidae,Abrocoma,boliviensis,1,0,0,0,Reported,...,,,Occurs on mainland,CR,No,100,0,0,Reported 000 Species,"Wilman, H., et al. 2014. EltonTraits 1.0: spec..."


## Problem 2: Cleaning the data

You'll now perform various data cleaning operations on these two datasets, similar to what you did last week.
At each step, we've specified a variable to store the result in, so that all of your work can be autograded.
Note that as we saw in lecture, all of these steps create a new `DataFrame`, rather than modifying an existing `DataFrame`. (That makes it easier for you to check your work at each step.)
You should use the result of the previous step as the "input" of the next step.

### Problem 2a: Cleaning the PanTHERIA data

1. Extract just the columns `'"MSW05_Order'`, `'MSW05_Binomial'`, `'23-1_SexualMaturityAge_d'`, and `'14-1_InterbirthInterval_d'`, `'17-1_MaxLongevity_m'`, and `'15-1_LitterSize'`, in the order listed.
    Store the resulting `DataFrame` in `pantheria_data`.
    
    You are encouraged, but not required, to create a new list variable to store the column names, just like we did in lecture.

2. Rename the columns according to the table below. Store the result in `pantheria_data_renamed`.

    | Old column name              | New column name            |
    |------------------------------|----------------------------|
    | `MSW05_Order`                | `Order`                    |
    | `MSW05_Binomial`             | `Genus_Species`            |
    | `23-1_SexualMaturityAge_d`   | `Age to Maturity (days)`   |
    | `14-1_InterbirthInterval_d`  | `Interbirth Interval (days)` |
    | `17-1_MaxLongevity_m`        | `Max Longevity (months)`   |
    | `15-1_LitterSize`            | `Litter Size`              |

3. Use the `DataFrame.convert_dtypes()` method to automatically convert each column into its most appropriate type, storing the resulting `DataFrame` in a variable called `pantheria_data_converted`.

4. Finally, use the `DataFrame.replace(old, new)` method to replace all occurrences of `-999` with `pd.NA`. Store the result in a variable called `pantheria_data_clean`.

In [2]:
important_columns = ['MSW05_Order', 'MSW05_Binomial', '23-1_SexualMaturityAge_d',
                     '14-1_InterbirthInterval_d', '17-1_MaxLongevity_m', '15-1_LitterSize']
pantheria_data = pantheria_raw[important_columns]

pantheria_new_column_names = {'MSW05_Order': 'Order',
                              'MSW05_Binomial': 'Genus_Species',
                              '23-1_SexualMaturityAge_d': 'Age to Maturity (days)',
                              '14-1_InterbirthInterval_d': 'Interbirth Interval (days)',
                              '17-1_MaxLongevity_m': 'Max Longevity (months)',
                              '15-1_LitterSize': 'Litter Size'}
pantheria_data_renamed = pantheria_data.rename(columns=pantheria_new_column_names)

pantheria_data_converted = pantheria_data_renamed.convert_dtypes()

pantheria_data_clean = pantheria_data_converted.replace(-999, pd.NA)

# Check your work
pantheria_data_clean.head()

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size
0,Artiodactyla,Camelus dromedarius,1947.94,614.41,480.0,0.98
1,Carnivora,Canis adustus,249.88,,137.0,4.5
2,Carnivora,Canis aureus,371.23,365.0,192.0,3.74
3,Carnivora,Canis latrans,372.9,365.0,262.0,5.72
4,Carnivora,Canis lupus,679.37,365.0,354.0,4.98


### Problem 2b: Cleaning the IUCN data

1. Extract just the columns `'Binomial.1.2'` and `'IUCN.Status.1.2'`. Store the resulting `DataFrame` in `iucn_data`.

2. Rename the columns to `Genus_Species_IUCN` and `IUCN Status`, respectively. Store the resulting `DataFrame` in `iucn_data_renamed`.

3. Convert column types using `DataFrame.convert_dtypes`, and store the resulting `DataFrame` in `iucn_data_clean`.

In [3]:
important_columns = ['Binomial.1.2', 'IUCN.Status.1.2']
iucn_data = iucn_raw[important_columns]

iucn_data_new_column_names = {
    'Binomial.1.2': 'Genus_Species_IUCN',
    'IUCN.Status.1.2': 'IUCN Status'
}
iucn_data_renamed = iucn_data.rename(columns=iucn_data_new_column_names)

iucn_data_clean = iucn_data_renamed.convert_dtypes()

# Check your work
iucn_data_clean.head()

Unnamed: 0,Genus_Species_IUCN,IUCN Status
0,Abditomys_latidens,DD
1,Abeomelomys_sevia,LC
2,Abrawayaomys_ruschii,LC
3,Abrocoma_bennettii,LC
4,Abrocoma_boliviensis,CR


## Problem 3: Merging the DataFrames

Now let's do something we just learned this week: merge the two `DataFrame`s together.
To do so, we'll need to make sure that the two "Genus_Species" columns in the `DataFrame`s match.
We'll take a similar, but slightly different approach, from the one we used in lecture.

### Problem 3a: String formatting

1. Create a new `Series` called `genus_species_formatted` that consists of the `'Genus_Species'` column from `pantheria_data_clean`, except with all spaces (`" "`) replaced by underscores (`"_"`).
    To do this, you'll need to extract the right column from the `DataFrame` and then use the `DataFrame.str.replace(old, new)` method on the column.

2. *Modify* `pantheria_data_clean` by adding the `Series` from the previous step to it under the column name `'Genus_Species_Formatted'`.

    *Reminder*: because your code for this question actually modifies `pantheria_data_clean`, if you want to restart you should re-run all cells above this one (in the JupyterHub menu, select Cell -> Run All Above).

In [4]:
genus_species_formatted = pantheria_data_clean['Genus_Species'].str.replace(" ", "_")

pantheria_data_clean['Genus_Species_Formatted'] = genus_species_formatted

# Check your work
pantheria_data_clean.head()

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size,Genus_Species_Formatted
0,Artiodactyla,Camelus dromedarius,1947.94,614.41,480.0,0.98,Camelus_dromedarius
1,Carnivora,Canis adustus,249.88,,137.0,4.5,Canis_adustus
2,Carnivora,Canis aureus,371.23,365.0,192.0,3.74,Canis_aureus
3,Carnivora,Canis latrans,372.9,365.0,262.0,5.72,Canis_latrans
4,Carnivora,Canis lupus,679.37,365.0,354.0,4.98,Canis_lupus


### Problem 3b: Merge the two `DataFrame`s

Merge `pantheria_data_clean` and `iucn_data_clean` using function `pd.merge`.
You'll need to determine the appropriate arguments for `left_on` and `right_on`.

Name the resulting `DataFrame` `joined_pantheria_iucn_data`.

In [5]:
joined_pantheria_iucn_data = pd.merge(
    left=pantheria_data_clean,
    right=iucn_data_clean, 
    left_on='Genus_Species_Formatted',
    right_on='Genus_Species_IUCN'
)

# Check your work
joined_pantheria_iucn_data.head()

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size,Genus_Species_Formatted,Genus_Species_IUCN,IUCN Status
0,Artiodactyla,Camelus dromedarius,1947.94,614.41,480.0,0.98,Camelus_dromedarius,Camelus_dromedarius,EP
1,Carnivora,Canis adustus,249.88,,137.0,4.5,Canis_adustus,Canis_adustus,LC
2,Carnivora,Canis aureus,371.23,365.0,192.0,3.74,Canis_aureus,Canis_aureus,LC
3,Carnivora,Canis latrans,372.9,365.0,262.0,5.72,Canis_latrans,Canis_latrans,LC
4,Carnivora,Canis lupus,679.37,365.0,354.0,4.98,Canis_lupus,Canis_lupus,LC


## Problem 4: Eliminate irrelevant IUCN categories

Now that we have our joined `DataFrame`, we're almost ready to perform the computation necessary to answer our question.
But first, the IUCN status values `'DD'` and `'EP'` are not useful to us, so we'll remove them.

1. Extract all rows from `joined_pantheria_iucn_data` with IUCN categories OTHER THAN `'DD'` and `'EP'`. Name this resulting `DataFrame` `pantheria_iucn_clean`.

    You are strongly encouraged to create your own variable to store the *boolean `Series`* you're using as a filter. You'll need to use a comparison operator (e.g., `==` or `!=`) along with one of the two logical operators, either `&` or `|`.

In [7]:
# pantheria_iucn_clean
nomiss = (joined_pantheria_iucn_data['IUCN Status'] != 'DD') & (joined_pantheria_iucn_data['IUCN Status'] != 'EP')

pantheria_iucn_clean = joined_pantheria_iucn_data[nomiss]

# Check your work
pantheria_iucn_clean

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size,Genus_Species_Formatted,Genus_Species_IUCN,IUCN Status
1,Carnivora,Canis adustus,249.88,,137.0,4.5,Canis_adustus,Canis_adustus,LC
2,Carnivora,Canis aureus,371.23,365.0,192.0,3.74,Canis_aureus,Canis_aureus,LC
3,Carnivora,Canis latrans,372.9,365.0,262.0,5.72,Canis_latrans,Canis_latrans,LC
4,Carnivora,Canis lupus,679.37,365.0,354.0,4.98,Canis_lupus,Canis_lupus,LC
5,Artiodactyla,Bos javanicus,797.31,,318.96,1.22,Bos_javanicus,Bos_javanicus,EN
...,...,...,...,...,...,...,...,...,...
4941,Rodentia,Zygogeomys trichopus,,,,,Zygogeomys_trichopus,Zygogeomys_trichopus,EN
4942,Rodentia,Zyzomys argurus,155.06,219.0,,2.76,Zyzomys_argurus,Zyzomys_argurus,LC
4943,Rodentia,Zyzomys maini,,,,,Zyzomys_maini,Zyzomys_maini,VU
4944,Rodentia,Zyzomys pedunculatus,,,,,Zyzomys_pedunculatus,Zyzomys_pedunculatus,CR


## Problem 5: Computing fecundity

Using `pantheria_iucn_clean`, you will estimate a new measurement that we will call `Max Lifetime Fecundity`.

This will be computed using the following columns:

`'Age to Maturity (days)'`: How long it takes for the average individual to grow to maturity. This is measured in days as the interval between birth and the time when the individual first reproduces.
 
`'Max Longevity (months)'`: How long can individuals within each species live, expressed in months.

`'Interbirth Interval (days)'`: How long do adult females wait, on average, between giving birth and becoming pregnant again?

`'Litter Size'`: How many babies do females within each species have at one time, on average?

The **maximum fecundity** of a species is calculated using the following formula:

$$ \frac{\text{max longevity} - \text{age to maturity}}{\text{interbirth interval}} \times \text{litter size}
$$

### Problem 5a: Adding the column

Your task is to add a new column called `'Max Fecundity'` to `pantheria_iucn_clean` that contains the maximum fecundity of each species. Do not perform any rounding.

**NOTE**: currently, the age to maturity/longevity/interbrith interval columns use different units. You'll first need to convert them to *years* by dividing by 365 (for days) or 12 (months) before you can use the above formula.
Do not modify the existing `pantheria_iucn_clean` for these unit conversions; instead, use new variables to store the converted `Series`.

In [8]:
maturity_yr = pantheria_iucn_clean['Age to Maturity (days)'] / 365
longevity_yr = pantheria_iucn_clean['Max Longevity (months)'] / 12
interbirth_yr = pantheria_iucn_clean['Interbirth Interval (days)'] / 365
litter_size_series = pantheria_iucn_clean['Litter Size']

max_lifetime_fecundity = (((longevity_yr - maturity_yr) / interbirth_yr) * litter_size_series)
pantheria_iucn_clean['Max Fecundity'] = max_lifetime_fecundity

# Check your work
pantheria_iucn_clean

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size,Genus_Species_Formatted,Genus_Species_IUCN,IUCN Status,Max Fecundity
1,Carnivora,Canis adustus,249.88,,137.0,4.5,Canis_adustus,Canis_adustus,LC,
2,Carnivora,Canis aureus,371.23,365.0,192.0,3.74,Canis_aureus,Canis_aureus,LC,56.036164
3,Carnivora,Canis latrans,372.9,365.0,262.0,5.72,Canis_latrans,Canis_latrans,LC,119.042864
4,Carnivora,Canis lupus,679.37,365.0,354.0,4.98,Canis_lupus,Canis_lupus,LC,137.640787
5,Artiodactyla,Bos javanicus,797.31,,318.96,1.22,Bos_javanicus,Bos_javanicus,EN,
...,...,...,...,...,...,...,...,...,...,...
4941,Rodentia,Zygogeomys trichopus,,,,,Zygogeomys_trichopus,Zygogeomys_trichopus,EN,
4942,Rodentia,Zyzomys argurus,155.06,219.0,,2.76,Zyzomys_argurus,Zyzomys_argurus,LC,
4943,Rodentia,Zyzomys maini,,,,,Zyzomys_maini,Zyzomys_maini,VU,
4944,Rodentia,Zyzomys pedunculatus,,,,,Zyzomys_pedunculatus,Zyzomys_pedunculatus,CR,


### Problem 5b: Sort

Finally, use the `DataFrame.sort_values` method to sort `pantheria_iucn_clean` in ascending order of its `'Max Fecundity'` column. You may, but are not required, to store the result in a variable.


In [9]:
pantheria_iucn_clean.sort_values(by="Max Fecundity")

Unnamed: 0,Order,Genus_Species,Age to Maturity (days),Interbirth Interval (days),Max Longevity (months),Litter Size,Genus_Species_Formatted,Genus_Species_IUCN,IUCN Status,Max Fecundity
493,Diprotodontia,Lagostrophus fasciatus,391.96,365.0,48.0,0.9,Lagostrophus_fasciatus,Lagostrophus_fasciatus,VU,2.633523
4918,Diprotodontia,Wyulda squamicaudata,834.72,365.0,72.0,1.01,Wyulda_squamicaudata,Wyulda_squamicaudata,NT,3.750227
3548,Dasyuromorphia,Pseudantechinus ningbing,348.62,365.0,24.0,4.0,Pseudantechinus_ningbing,Pseudantechinus_ningbing,LC,4.179507
1124,Microbiotheria,Dromiciops gliroides,556.85,365.0,38.4,3.0,Dromiciops_gliroides,Dromiciops_gliroides,NT,5.023151
353,Cetacea,Balaena mysticetus,6041.21,1642.5,480.0,1.0,Balaena_mysticetus,Balaena_mysticetus,LC,5.210831
...,...,...,...,...,...,...,...,...,...,...
4941,Rodentia,Zygogeomys trichopus,,,,,Zygogeomys_trichopus,Zygogeomys_trichopus,EN,
4942,Rodentia,Zyzomys argurus,155.06,219.0,,2.76,Zyzomys_argurus,Zyzomys_argurus,LC,
4943,Rodentia,Zyzomys maini,,,,,Zyzomys_maini,Zyzomys_maini,VU,
4944,Rodentia,Zyzomys pedunculatus,,,,,Zyzomys_pedunculatus,Zyzomys_pedunculatus,CR,


## Problem 6: Computing the average Max Fecundity for each IUCN Status

You will now calculate the average `Max Fecundity` value for each IUCN Status group.

Like in the lecture, this will involve two steps:
1. Group the `pantheria_iucn_clean` by the `IUCN Status` column, using the `DataFrame.groupby()` function.
2. Compute the `mean` of the `Max Fecundity` column for the grouped data.

Store the output of these steps in a new variable called `iucn_avg_fecundity`. This variable should be of type `Series`, and associate each IUCN category with the average of the `Max Fecundity` values for the species in that category.

You may store the output of Step 1 in another variable, if you wish, or chain both the steps together in one command.

In [11]:
iucn_avg_fecundity = pantheria_iucn_clean.groupby('IUCN Status')['Max Fecundity'].mean()

In [19]:
# This code is provided to check your work. Do not modify it.
print(type(iucn_avg_fecundity))
display(iucn_avg_fecundity.sort_values())

<class 'pandas.core.series.Series'>


IUCN Status
CR    24.656159
VU     28.08233
EN     28.21208
NT    56.979062
LC    63.847946
EW         <NA>
EX         <NA>
Name: Max Fecundity, dtype: Float64

## Conclusion

Based on your analysis, answer each of these questions:
 
1. Explain, in biological terms, what our new `'Max Fecundity'` column measures. __(3 marks)__
2. What can you say about the relationship between the IUCN Status and the average maximum fecundity of species? __(3 marks)__

**Sample Answers**

**1.** It measures the average number of offspring produced by a species in their reproductive years. The numerator computes the number of reproductive years, and this is divided by the time between successive reproductions to obtain the average number of times a species can reproduce. This is then multiplied by the litter size to get the number of offspring a species can produce in its lifetime (a short explanation is acceptable).

**2.** The max fecundity is nearly consistently lower for species that are more endangered according to the IUCN categorization (2 marks). Lower risk species (LC, NT) have much higher max fecundity compared to those that are endangered, vulnerable, and critical (CR, VU, EN), though the relationship is not exactly linear (between VU and EN, for example) (1 mark).