### Milestone 2: Flat File Cleaning/Formatting

**Perform at least 5 data transformation and/or cleansing steps to your flat file data. For example:**

- Replace Headers
- Format data into a more readable format
- Identify outliers and bad data
- Find duplicates
- Fix casing or inconsistent values
- Conduct Fuzzy Matching

This dataset lists the species held in Association of Zoo and Aquarium facilities and their median age ranges. For the purposes of this project, I will assume that this is an exhaustive list of all species in AZA facilities. (There are many species not on this list that are in AZA facilities, but that should not affect the exploratory nature of this project.) The end goal of the project is to compare endangered status of different species to their documented appearances in the wild via the iNaturalist app as well as to their median life expectancy (MLE) in human care facilities. With that in mind, I only want to remove animals from this list which do not have sufficient data for MLE. Some species have MLE data for each gender of the species, some only have data for one gender, and some have a combined MLE for both genders.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# The first thing I want to do is extract the information from my PDF into a CSV using Tabula to make it easier to use.
mle_data = pd.read_csv('aza_mle.csv')
mle_data.head()

Unnamed: 0,Common Name,Scientific Name,MLE Males &\rFemales\rCombined,MLE Males\r(Years),MLE Females\r(Years)
0,Aardvark,Orycteropus afer,-,Data deficient,11.5
1,Addax,Addax nasomaculatus,-,10.7,13.9
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-
3,"Alligator, Chinese",Alligator sinensis,25,-,-
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-


In [3]:
# I want to replace the string 'Data deficient' with NaN to make finding the missing data easier. I will leave the dashes in the
# columns that don't have MLE because that data is not missing, it is just not necessary because the MLE is represented in a 
# different column.
mle_data = mle_data.replace('Data deficient', np.nan)
mle_data.head()

Unnamed: 0,Common Name,Scientific Name,MLE Males &\rFemales\rCombined,MLE Males\r(Years),MLE Females\r(Years)
0,Aardvark,Orycteropus afer,-,,11.5
1,Addax,Addax nasomaculatus,-,10.7,13.9
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-
3,"Alligator, Chinese",Alligator sinensis,25,-,-
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-


In [4]:
# When perusing the PDF, I did see that there were some duplicates so I do need to get rid of those. Before I do that, I need to
# change the headers so that they are able to be called in the code easier.
mle_data.columns = ['common_name', 'scientific_name', 'MLE_combined', 'MLE_male', 'MLE_female']
mle_data.head()

Unnamed: 0,common_name,scientific_name,MLE_combined,MLE_male,MLE_female
0,Aardvark,Orycteropus afer,-,,11.5
1,Addax,Addax nasomaculatus,-,10.7,13.9
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-
3,"Alligator, Chinese",Alligator sinensis,25,-,-
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-


In [5]:
# Now I can confirm if there are duplicates. I will use the scientific name to confirm since common names can vary even among
# the same species. Then I will use drop_duplicates to drop the duplicate rows based on duplicated scientific names. 
print(('Scientific Name is duplicated - {}').format(any(mle_data.duplicated('scientific_name'))))

Scientific Name is duplicated - True


In [6]:
# I also want to save the original size so that I can see the change after duplicates are removed.
orig_size = mle_data.shape
print(orig_size)

(489, 5)


It dropped 16 rows, which was actually more than I was expecting. I looked through the CSV on Excel to see what happened.
Turns out because of the nature of the PDF, there were 14 rows (besides the first header) that listed the header labels.
There were 3 animals that were each listed twice. So, there was a total of 20 rows that were "duplicates", but the code saved
the first instance of each of those four sets so it removed 16 rows. I did also notice that the animals that were duplicated had different MLE data on each of their rows. What I should do instead of deleting the duplicates is merge the information between the two rows for each animal instead. So, I went back and commented out the original code.

In [8]:
# I used duplicated to find and print all the duplicated rows so I know which ones I can remove entirely and which need to be merged.
duplicates = mle_data[mle_data.duplicated(['scientific_name'], keep=False)]
print(duplicates)

                            common_name          scientific_name  \
33                          Common Name          Scientific Name   
66                          Common Name          Scientific Name   
100                         Common Name          Scientific Name   
134                         Common Name          Scientific Name   
135   Frog, Panamanian Golden (Ahogado)          Atelopus zeteki   
136      Frog, Panamanian Golden (Sora)          Atelopus zeteki   
168                         Common Name          Scientific Name   
202                         Common Name          Scientific Name   
236                         Common Name          Scientific Name   
270                         Common Name          Scientific Name   
294             Partridge, Crested Wood         Rollulus rouloul   
304                         Common Name          Scientific Name   
337                         Common Name          Scientific Name   
365  Spider, Sapphire Ornamental Spider  Poecilo

In [9]:
# I deleted the rows with the header titles. A closer inspection of the Panamanian Golden Frog shows that the duplicate rows are
# actually two subspecies so I will leave them there. I will merge the wood partridge and the spider rows together.
mle_data_smol = mle_data[mle_data.common_name != 'Common Name']
duplicate_spec = mle_data_smol[mle_data_smol.duplicated(['scientific_name'], keep=False)]
print(duplicate_spec)

                            common_name          scientific_name MLE_combined  \
135   Frog, Panamanian Golden (Ahogado)          Atelopus zeteki          7.5   
136      Frog, Panamanian Golden (Sora)          Atelopus zeteki            -   
294             Partridge, Crested Wood         Rollulus rouloul          4.9   
365  Spider, Sapphire Ornamental Spider  Poecilotheria metallica          NaN   
366  Spider, Sapphire Ornamental Spider  Poecilotheria metallica         13.1   
485             Wood-partridge, Crested         Rollulus rouloul          4.8   

    MLE_male MLE_female  
135        -          -  
136      5.1        4.2  
294        -          -  
365      NaN        NaN  
366        -          -  
485        -          -  


In [11]:
# Using groupby, I was able to merge the spider rows because they have the same common name. The common name difference between
# the two rows of the wood-partridge seems to be a mistype of some kind. I'm not sure which MLE is the most recent but since 
# they are so close I will just delete row 469.
mle_sort = mle_data_smol.groupby(['common_name', 'scientific_name'], as_index=False).agg('last')
duplicates = mle_sort[mle_sort.duplicated(['scientific_name'], keep=False)]
print(duplicates)

                           common_name   scientific_name MLE_combined  \
131  Frog, Panamanian Golden (Ahogado)   Atelopus zeteki          7.5   
132     Frog, Panamanian Golden (Sora)   Atelopus zeteki            -   
286            Partridge, Crested Wood  Rollulus rouloul          4.9   
469            Wood-partridge, Crested  Rollulus rouloul          4.8   

    MLE_male MLE_female  
131        -          -  
132      5.1        4.2  
286        -          -  
469        -          -  


In [12]:
# I had to do some guess and check work to find the row with the spider and make sure the row that was saved was the one with
# the MLE.
print(mle_sort.loc[[355]])

                            common_name          scientific_name MLE_combined  \
355  Spider, Sapphire Ornamental Spider  Poecilotheria metallica         13.1   

    MLE_male MLE_female  
355        -          -  


In [13]:
# One last check after dropping the row shows that I did get rid of all the duplicates.
mle_no_dupes = mle_sort.drop([469])
duplicates = mle_no_dupes[mle_no_dupes.duplicated(['scientific_name'], keep=False)]
print(duplicates)

                           common_name  scientific_name MLE_combined MLE_male  \
131  Frog, Panamanian Golden (Ahogado)  Atelopus zeteki          7.5        -   
132     Frog, Panamanian Golden (Sora)  Atelopus zeteki            -      5.1   

    MLE_female  
131          -  
132        4.2  


In [14]:
mle_no_dupes.shape

(473, 5)

In [15]:
# I used dropna to drop all rows with at least 3 non-NA values. If a row only had two non-NA values that means it was missing
# all of the MLE data.
mle_final = mle_no_dupes.dropna(thresh = 3)
mle_final.head(10)

Unnamed: 0,common_name,scientific_name,MLE_combined,MLE_male,MLE_female
0,Aardvark,Orycteropus afer,-,,11.5
1,Addax,Addax nasomaculatus,-,10.7,13.9
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-
3,"Alligator, Chinese",Alligator sinensis,25,-,-
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-
5,"Anteater, Giant",Myrmecophaga tridactyla,19.2,-,-
6,"Antelope, Roan",Hippotragus equinus,-,,12.3
7,"Antelope, Sable",Hippotragus niger,11,-,-
9,"Aracari, Green",Pteroglossus viridis,7.9,-,-
10,"Argus, Great",Argusianus argus,12.4,-,-


In [16]:
mle_final.shape

(372, 5)

The only other potentially problematic feature in this data is for the MLE for male elephants

In [17]:
print(mle_sort.loc[[117, 118]])

           common_name     scientific_name MLE_combined       MLE_male  \
117  Elephant, African  Loxodonta africana            -  See Full Repo   
118    Elephant, Asian     Elephas maximus            -  See Full Repo   

    MLE_female  
117       38.6  
118       46.9  


There is apparently a report somewhere for the MLE for male African and Asian elephants. As of now, I'm going to treat those strings as if they are NaN values. I can't find the report anywhere but I work at an AZA facility so when I go back to work (I'm currently out of town) I'm going to ask around and see if there is more information somewhere that I can take a look at.

In [18]:
#Store clean dataframe as CSV
mle_final.to_csv('dsc540_animalinfo.csv', index=False) 