# Homework 4 - Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic
 


# Introduction

For this week's homework, we are going to continue to work with the Pantheria Dataset and the IUCN categories. 

We will create a new metric using the Pantheria data that estimates: what is the maximum number of offspring that individuals within each species can theoretically produce throughout their lifetime? We call this 'maximum lifetime fecundity'. We will be looking to see whether there is a relationship between maximum lifetime fecundity and a species' risk of going extinct. 


# Question

The overarching question you'll be working toward answering in this homework:

**_Is there a difference in conservation risk between species with species with smaller maximum lifetime fecundity and species with larger maximum lifetime fecundity?_**

Answering this question requires a lot of steps, so this week we will be preparing our data and computing our maximum lifetime fecundity metric so that we can take a closer look at this question in next week's assignment.

# Lab Instructions and Learning Objectives

Just like in the previous homework, you will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
* Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you're encourage to practice using logical operators such as `!=`, `<=`, `>=`, `>`, `<`.)
* Join two related datasets to create a larger, more comprehensive dataset.
* Perform arithmetic on several pandas series to estimate a `maximum_lifetime_fecundity` metric, which is an estimate of the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.


# Due date 

You will submit your completed Homework 4 on MarkUs by <mark>XX</mark>. We will send an announcement in a couple days when autotesting has been set up on MarkUs.

# EEB: How to submit

1. Download your homework to your local computer and save it as `EEB125_Homework_4.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `HW4: Homework 4`.

# Marking Rubric


Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

Maximum grade: **35**

<mark> to be modified </mark>


# Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

# Data section

+ Step 1: Import In the raw data from Pantheria (pantheria.txt) and phylacine (phylacine.csv) as `DataFrame`s using pandas and assign them to the variables: 

+ `pantheria_raw`: the `DataFrame` created by reading the `pantheria.txt` file. __(1 mark)__
+ `iucn_raw`: the `DataFrame` created by reading the `phylacine.csv` file. __(1 mark)__

In [63]:
import pandas as pd

pantheria_raw = pd.read_csv('pantheria.txt', sep = '\t')
iucn_raw = pd.read_csv('phylacine.csv')


Step 2: create new dataframes containing only the columns we need for this homework.  __(1 mark)__

 
 +  `pantheria_data`: the `DataFrame` containing only the relevant columns from `pantheria_raw`: the `'"MSW05_Order'`, `'MSW05_Binomial'`, `'23-1_SexualMaturityAge_d'`, and `'14-1_InterbirthInterval_d'`, `'17-1_MaxLongevity_m'`, `'15-1_LitterSize'`. __(1 mark)__ 
 
 
 +  `iucn_data`: the `DataFrame` containing only the relevant columns from `iucn_raw`: the `'Binomial.1.2'`, and `'IUCN.Status.1.2'`. __(1 mark)__ 





In [64]:
important_columns = ['MSW05_Order', 'MSW05_Binomial', '23-1_SexualMaturityAge_d', '14-1_InterbirthInterval_d', '17-1_MaxLongevity_m', 
                                '15-1_LitterSize']
pantheria_data = pantheria_raw[important_columns]

In [65]:
important_columns = ['Binomial.1.2', 'IUCN.Status.1.2']
iucn_data = iucn_raw[important_columns]

Step 3: Create dictionaries with the name mappings stored as variables to rename our selected columns

+  `pantheria_new_column_names`: the `dictionary` mapping the column names from `pantheria_data` to the values `'order'`, `'genus_species'`, `'maturity_d'`, `interbirth_d`, `'longevity_m'`, and `'litter_size_ind'`. __(1 mark)__


+ `iucn_data_new_columns`: the `dictionary` mapping the column names from `iucn_data_raw` to `genus_species`, and `iucn_status`, respectively. __(1 mark)__


+ Step 4: Rename the selected columns using the dictionaries


In [66]:
pantheria_new_column_names = {'MSW05_Order': 'order',
                              'MSW05_Binomial': 'genus_species',
                              '23-1_SexualMaturityAge_d': 'maturity_d',
                              '14-1_InterbirthInterval_d': 'interbirth_d',
                              '17-1_MaxLongevity_m': 'longevity_m',
                              '15-1_LitterSize': 'litter_size_ind'}

iucn_data_new_columns = {'Binomial.1.2':'genus_species', 
                         'IUCN.Status.1.2':'iucn_status'}

+ Step 5: Please use the rename() function to rename the columns in `pantheria_data` and `iucn_data` using the dictionaries that we have just created. Assign the outputs to the new variables `pantheria_data_clean` and `iucn_data_clean`.

+ `pantheria_data_clean`: the `DataFrame` that is the result of renaming the columns in `pantheria_data`.. __(1 mark)__`


+ `iucn_data_clean`: the `DataFrame` that is the result of renaming the columns in `iucn_data_raw`. (We will not autotest this `DataFrame` until you have added columns, as described below.) __(1 mark)__`


In [67]:
pantheria_data_clean = pantheria_data.rename(columns=pantheria_new_column_names)
pantheria_data_clean

Unnamed: 0,order,genus_species,maturity_d,interbirth_d,longevity_m,litter_size_ind
0,Artiodactyla,Camelus dromedarius,1947.94,614.41,480.0,0.98
1,Carnivora,Canis adustus,249.88,,137.0,4.50
2,Carnivora,Canis aureus,371.23,365.00,192.0,3.74
3,Carnivora,Canis latrans,372.90,365.00,262.0,5.72
4,Carnivora,Canis lupus,679.37,365.00,354.0,4.98
...,...,...,...,...,...,...
5411,Rodentia,Zyzomys argurus,155.06,219.00,,2.76
5412,Rodentia,Zyzomys maini,,,,
5413,Rodentia,Zyzomys palatilis,,,,
5414,Rodentia,Zyzomys pedunculatus,,,,


In [68]:
iucn_data_clean = iucn_data.rename(columns=iucn_data_new_columns)
iucn_data_clean

Unnamed: 0,genus_species,iucn_status
0,Abditomys_latidens,DD
1,Abeomelomys_sevia,LC
2,Abrawayaomys_ruschii,LC
3,Abrocoma_bennettii,LC
4,Abrocoma_boliviensis,CR
...,...,...
5826,Zyzomys_argurus,LC
5827,Zyzomys_maini,VU
5828,Zyzomys_palatalis,CR
5829,Zyzomys_pedunculatus,CR


+ Step 6: Please replace the spaces `" "` in the species names stored in `pantheria_data[genus_species]` with underscores `"_"` so that the puncuation matches in both dataframes that we are trying to merge. Re-assign the output to the variable `pantheria_data_clean`.



In [69]:
pantheria_data_clean['genus_species'] = pantheria_data_clean['genus_species'].str.replace(" ","_")
pantheria_data_clean

Unnamed: 0,order,genus_species,maturity_d,interbirth_d,longevity_m,litter_size_ind
0,Artiodactyla,Camelus_dromedarius,1947.94,614.41,480.0,0.98
1,Carnivora,Canis_adustus,249.88,,137.0,4.50
2,Carnivora,Canis_aureus,371.23,365.00,192.0,3.74
3,Carnivora,Canis_latrans,372.90,365.00,262.0,5.72
4,Carnivora,Canis_lupus,679.37,365.00,354.0,4.98
...,...,...,...,...,...,...
5411,Rodentia,Zyzomys_argurus,155.06,219.00,,2.76
5412,Rodentia,Zyzomys_maini,,,,
5413,Rodentia,Zyzomys_palatilis,,,,
5414,Rodentia,Zyzomys_pedunculatus,,,,


Step 7: Merge (Join) `pantheria_data_clean` and `iucn_data_clean`

+  `joined_pantheria_iucn_data`: the `DataFrame` created as a result of *joining* the `DataFrames` `pantheria_data_clean` and `iucn_data_clean`. Let `pantheria_data_clean` be the `left` dataframe, and `iucn_data_clean` be the `right` dataframe. Join on the column `genus_species`. __(1 mark)__`

In [70]:

joined_pantheria_iucn_data = pantheria_data_clean.merge(iucn_data_clean, 
                                                right_on= 'genus_species', # the right data frame is iucn 
                                                left_on='genus_species')   # the left data frame is pantheria 

joined_pantheria_iucn_data.head()

Unnamed: 0,order,genus_species,maturity_d,interbirth_d,longevity_m,litter_size_ind,iucn_status
0,Artiodactyla,Camelus_dromedarius,1947.94,614.41,480.0,0.98,EP
1,Carnivora,Canis_adustus,249.88,,137.0,4.5,LC
2,Carnivora,Canis_aureus,371.23,365.0,192.0,3.74,LC
3,Carnivora,Canis_latrans,372.9,365.0,262.0,5.72,LC
4,Carnivora,Canis_lupus,679.37,365.0,354.0,4.98,LC


Step 8: Eliminate irrelevant iucn categories:  `DD` and `EP`. In a markdown cell <mark> below </mark>  describe why we are eliminating these iucn categories. __(1 mark)__`

+ `pantheria_iucn_clean`: Eliminate rows with IUCN categories `DD` and `EP` (missing data and errors) from our dataset `joined_pantheria_iucn_data`. 


We will check the value of these variables in the autotester. You may wish to use a few other intermediate variables along the way, but we're not autotesting those.


In [71]:
# pantheria_iucn_clean

nomiss = (joined_pantheria_iucn_data['iucn_status'] != 'DD') & (joined_pantheria_iucn_data['iucn_status'] != 'EP')

pantheria_iucn_clean = joined_pantheria_iucn_data[nomiss]
pantheria_iucn_clean


Unnamed: 0,order,genus_species,maturity_d,interbirth_d,longevity_m,litter_size_ind,iucn_status
1,Carnivora,Canis_adustus,249.88,,137.00,4.50,LC
2,Carnivora,Canis_aureus,371.23,365.00,192.00,3.74,LC
3,Carnivora,Canis_latrans,372.90,365.00,262.00,5.72,LC
4,Carnivora,Canis_lupus,679.37,365.00,354.00,4.98,LC
5,Artiodactyla,Bos_javanicus,797.31,,318.96,1.22,EN
...,...,...,...,...,...,...,...
4941,Rodentia,Zygogeomys_trichopus,,,,,EN
4942,Rodentia,Zyzomys_argurus,155.06,219.00,,2.76,LC
4943,Rodentia,Zyzomys_maini,,,,,VU
4944,Rodentia,Zyzomys_pedunculatus,,,,,CR


# Methods section


We will be estimating a new measurement that we will call `max_lifetime_fecundity`. This will be computed using the following columns:

`'maturity_d'`: How long does it take for the average individual to grow to maturity? This is measured in days as the interval between birth and the time when the individual first reproduces.
 
`'longevity_m'`: The maximum time individuals within each can species live, expressed in months.

`'interbirth_d'`: The time adult females wait, on average, between giving birth and becoming pregnant again, expressed in days.

`'litter_size_ind'`: The number of babies females within each species have at one time, on average.

1. The three measurements relating to time (`'maturity_d'`, `'longevity_m'`, and `'interbirth_d'`) are expressed in two different units. Convert each of these columns so that they are expressed in years and assign each transformed measurement to a set of new variables named `maturity_yr`, `longevity_yr`, and `interbirth_yr`. Also, create a new variable called `litter_size_series` and assign `'litter_size_ind'` to it. 

2. Estimate the maximum lifetime fecundity metric for each species using the formula: 

`((longevity-maturity)/(interbirth))*litter_size` 

What are the units of our new column? Briefly explain. __(2 marks)__ 

3. Create a new column in our `pantheria_iucn_clean` for the maximum lifetime fecundity metric previously estimated.  



# Computation section - Estimating maximum lifetime fecundity metric

There are a few steps to this, as outlined in the [Methods section](#Methods-section). 

## Step 1

Transform the variables of interest `'maturity_d'`, `'longevity_m'`, `'interbirth_d'`, and `litter_size_ind` so that the time units are all expressed in years and name the transformed variables `maturity_yr`, `longevity_yr `, `interbirth_yr`, and `litter_size_series` respectively. __(3 marks total/1 mark each)__

## Step 2

In this step you will calculate the maximum lifetime fecundity metric and add it as a column to `pantheria_iucn_clean`.  Calculate

`((longevity-maturity)/(interbirth))*litter_size` 

using the transformed variables from step 1, and name this variable `max_lifetime_fecundity`. __(1 mark)__

All of the variables in this section will be checked in the autotester.


In [74]:
maturity_yr = pantheria_iucn_clean['maturity_d']/365
longevity_yr = pantheria_iucn_clean['longevity_m']/12
interbirth_yr = pantheria_iucn_clean['interbirth_d']/365
litter_size_series = pantheria_iucn_clean['litter_size_ind']


In [73]:
max_lifetime_fecundity = (((longevity_yr - maturity_yr) / interbirth_yr)*litter_size_series)
pantheria_iucn_clean['max_lifetime_fecundity'] = max_lifetime_fecundity
pantheria_iucn_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pantheria_iucn_clean['max_lifetime_fecundity'] = max_lifetime_fecundity


Unnamed: 0,order,genus_species,maturity_d,interbirth_d,longevity_m,litter_size_ind,iucn_status,max_lifetime_fecundity
1,Carnivora,Canis_adustus,249.88,,137.00,4.50,LC,
2,Carnivora,Canis_aureus,371.23,365.00,192.00,3.74,LC,56.036164
3,Carnivora,Canis_latrans,372.90,365.00,262.00,5.72,LC,119.042864
4,Carnivora,Canis_lupus,679.37,365.00,354.00,4.98,LC,137.640787
5,Artiodactyla,Bos_javanicus,797.31,,318.96,1.22,EN,
...,...,...,...,...,...,...,...,...
4941,Rodentia,Zygogeomys_trichopus,,,,,EN,
4942,Rodentia,Zyzomys_argurus,155.06,219.00,,2.76,LC,
4943,Rodentia,Zyzomys_maini,,,,,VU,
4944,Rodentia,Zyzomys_pedunculatus,,,,,CR,


In [75]:
homosap_fecundity_estimate = pantheria_iucn_clean[pantheria_iucn_clean['genus_species']=="Homo_sapiens"]['max_lifetime_fecundity']
homosap_fecundity_estimate

1541    40.839947
Name: max_lifetime_fecundity, dtype: float64

# Conclusion

Include cells with your answers to each of these questions:
 
1. Explain, in biological terms, what our new `max_lifetime_fecundity` metric measures. __(3 marks)__

2. Check the value we estimated for humans. Does our estimate seem reasonable to you? Is this higher or lower than you would expect? Can you think of any reasons that our metric might be inaccurate, especially for species like humans? __(3 marks)__

# Printing the required variables

[Nathan Note:  This section needs housekeepin]

Run the code cell at the very end of your notebook to check if you have the correct variable names:

In [76]:
print("pantheria_data_raw: ")
print(pantheria_data_raw)
print("iucn_data_raw")
print(iucn_data_raw)
print("pantheria_iucn_clean")
print(pantheria_iucn_clean)
print("maturity_yr")
print(maturity_yr)
print("longevity_yr") 
print(longevity_yr) 
print("interbirth_yr") 
print(interbirth_yr) 
print("litter_size_series")
print(litter_size_series)
print("max_lifetime_fecundity")
print(max_lifetime_fecundity)
print("pantheria_iucn['max_lifetime_fecundity']")
print(pantheria_iucn['max_lifetime_fecundity'])
print("homosap_fecundity_estimate")
print(homosap_fecundity_estimate)

pantheria_data_raw: 


NameError: name 'pantheria_data_raw' is not defined