### Milestone 3: Website table data 

Website url: https://awionline.org/content/list-endangered-species

*Goal*: Pull the endangered status information from the website and retrieve the specific information for the AZA animals from Milestone 2.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Store the url as a variable to make writing code easier
url = 'https://awionline.org/content/list-endangered-species'

In [3]:
# Store the contents of the page as a variable to make scraping easier. If you do this then you don't have to continually access
# the web page. The read_html command loads the tables on the web page as dataframe objects.
page = pd.read_html(url)

In [4]:
# I can use len to check how many tables are in the variable. It says there are 11 and a quick check of the website does confirm
# that there are 11. The tables are divided by animal type: mammals, birds, reptiles, amphibians, fish, clams, snails, insects, 
# arachnids, crustaceans, and corals. 
len(page)

11

I already know that there are no corals or clams in my AZA information so I can remove those tables. I want to join the endangered status information from these tables with the information from my table from Milestone 2. I saved my cleaned up table from Milestone 2 as a CSV so that I can utilize it here and add the endangered status as a new column. So, I need to extract only the information about the animals that are also on the AZA cleaned up list.

I'm going to start by joining all the tables from the website into a single dataframe.

In [5]:
# Using the index method, I can see which table is at which point in the list. Since this is a zero index the table order is as
# follows: 0 - mammals, 1 - birds, 2 - reptiles, 3 - amphibians, 4 - fish, 5 - clams, 6 - snails, 7 - insects, 8 -arachnids, 
# 9 -crustaceans, and 10 - corals.
page[0]

Unnamed: 0,Common Name,Scientific Name,Status
0,Addax,Addax nasomaculatus,Endangered
1,"Anoa, lowland",Bubalus depressicornis,Endangered
2,"Anoa, mountain",Bubalus quarlesi,Endangered
3,"Antelope, giant sable",Hippotragus niger variani,Endangered
4,"Antelope, Tibetan",Panthalops hodgsonii,Endangered
...,...,...,...
377,"Woodrat, riparian (San Joaquin Valley)",Neotoma fuscipes riparia,Endangered
378,"Yak, wild",Bos mutus (=grunniens m.),Endangered
379,"Zebra, Grevy's",Equus grevyi,Threatened
380,"Zebra, Hartmann's mountain",Equus zebra hartmannae,Threatened


In [6]:
# I will save each dataframe as its own variable (but not including clams and corals).

mamm, birds, rept, amph, fish, snails, insects, arach, crust = page[0], page[1], page[2], page[3], page[4], page[6], page[7], page[8], page[9]

In [7]:
# Create a master dataframe with all the information in it. Add appropriate headers (common_name, scientific_name, status).
ess_master = mamm.append([birds, rept, amph, fish, snails, insects, arach, crust])
ess_master.columns =['common_name', 'scientific_name', 'status'] 
ess_master.head()

Unnamed: 0,common_name,scientific_name,status
0,Addax,Addax nasomaculatus,Endangered
1,"Anoa, lowland",Bubalus depressicornis,Endangered
2,"Anoa, mountain",Bubalus quarlesi,Endangered
3,"Antelope, giant sable",Hippotragus niger variani,Endangered
4,"Antelope, Tibetan",Panthalops hodgsonii,Endangered


In [8]:
# I'm using the same duplicates code I used in Milestone 2 to check for the same type of duplication (on scientific name) to
# see where there might be repeated animals.
duplicates = ess_master[ess_master.duplicated(['scientific_name'], keep=False)]
print(duplicates)

                                          common_name  \
5   Argali [All populations except Kyrgyzstan, Mon...   
6       Argali [Kyrgyzstan, Mongolia, and Tajikistan]   
40                                      Bear, grizzly   
41                                      Bear, grizzly   
45                                        Bison, wood   
..                                                ...   
15                              Riversnail, Anthony's   
8                            Beetle, American burying   
9                            Beetle, American burying   
51                       Butterfly, Oregon silverspot   
52                       Butterfly, Oregon silverspot   

              scientific_name      status  
5                  Ovis ammon  Endangered  
6                  Ovis ammon  Threatened  
40    Ursus arctos horribilis  Threatened  
41    Ursus arctos horribilis          XN  
45     Bison bison athabascae  Threatened  
..                        ...         ...  
15     

I can see that a few of these repeats are species with separate populations in different regions. Others are repeated because of the XN variable under status. XN = Nonessential experimental population. According to Fish and Wildlife, "A “nonessential” designation for a 10(j) experimental population means that, on the basis of the best available information, the experimental population is not essential for the continued existence of the species." So, most likely the repetitions here indicate that there are multiple regional groups of that species but certain regional groups are not essential. Therefore, those rows can be deleted for the purposes of this project.

In [9]:
# First, I need to replace the 'XN' values with NaN and drop those rows. Then I can check to see how many rows were dropped.
ess_master = ess_master.replace('XN', np.nan)
ess_clean = ess_master.dropna()

In [10]:
duplicates = ess_clean[ess_clean.duplicated(['scientific_name'], keep=False)]
print(duplicates)

                                           common_name  \
5    Argali [All populations except Kyrgyzstan, Mon...   
6        Argali [Kyrgyzstan, Mongolia, and Tajikistan]   
67                                       Deer, Barbary   
163                                            Leopard   
164              Leopard [Southern Africa populations]   
..                                                 ...   
176  Sturgeon, Atlantic (Atlantic subspecies)[Carol...   
177  Sturgeon, Atlantic (Atlantic subspecies)[Chesa...   
178  Sturgeon, Atlantic (Atlantic subspecies)[Gulf ...   
179  Sturgeon, Atlantic (Atlantic subspecies)[New Y...   
180  Sturgeon, Atlantic (Atlantic subspecies)[South...   

                     scientific_name      status  
5                         Ovis ammon  Endangered  
6                         Ovis ammon  Threatened  
67           Cervus elaphus barbarus  Endangered  
163                  Panthera pardus  Endangered  
164                  Panthera pardus  Threatened

So, at this point with 97 rows that still have duplicates (most being regional groups of the same species), I think there must be a better way to deal with this data. I only need information on the animals that match the AZA list. I don't need to spend a lot of time going through sturgeons because there are no sturgeons on that list. I need to go through the ESS list and extract only those animals.

In [11]:
aza = pd.read_csv('dsc540_animalinfo.csv')
aza_names = aza['scientific_name'].tolist()
# This was my first attempt at using a list to extract the rows from the dataframe. It did not work and returned a table with a
# shape of (0,3)
# ess_final = ess_clean[ess_clean['scientific_name'].isin([aza_names])]
# ess_final.shape

In [12]:
# I figured this method out on my own but it did not work as expected. The first variation extracted all the rows 
# but inserted other variables into all of the other rows. It was messy. After hours and hours of trying to find a solution, I 
# finally gave up and asked StackOverflow and got help almost immediately. Lesson learned.
# ess_aza = []

# for i in aza_names:
#     if True:
#         ess_aza.append(ess_clean.loc[ess_clean['scientific_name'] == i])
#     else:
#         return

In [13]:
# This was the solution from StackOverflow. Now I have a dataframe with (presumably) any AZA animals with endangered stati.
clean_df = ess_clean[ess_clean['scientific_name'].isin(aza_names)]
clean_df.shape

(55, 3)

In [14]:
# It looks like I don't have to worry about duplicates with this trimmed set.
clean_df.head(55)

Unnamed: 0,common_name,scientific_name,status
0,Addax,Addax nasomaculatus,Endangered
1,"Anoa, lowland",Bubalus depressicornis,Endangered
34,"Bat, Rodrigues fruit (flying fox)",Pteropus rodricensis,Endangered
42,"Bear, Mexican grizzly",Ursus arctos,Endangered
43,"Bear, polar",Ursus maritimus,Threatened
49,"Camel, Bactrian",Camelus bactrianus,Endangered
53,"Cat, black-footed",Felis nigripes,Endangered
61,Cheetah,Acinonyx jubatus,Endangered
62,Chimpanzee,Pan troglodytes,Endangered
63,"Chimpanzee, pygmy",Pan paniscus,Endangered


In [15]:
# Now I want to add the endangered status of the 55 available species to my original AZA dataframe. I did concat first because
# that's what I used in the other assignment. I forgot that since I have similar indices that I can do join. Tried merge, but that
# reduced the dataset to only those 55 animals because I used the key instead of how argument.

# aza_ess = pd.concat([aza.reset_index(drop=True), clean_df.reset_index(drop=True)], axis = 1)
# aza_ess = pd.merge( aza, clean_df, on = 'scientific_name')
# aza_ess = pd.merge( aza, clean_df, how ='outer')

# Once I used how, it looked like it worked but for some reason I have more rows...
# Upon further investigation, the how ='outer' argument did not work as intended because some of the common_name values are
# different between the two datasets.
# Easiest solution to this problem will be to take the common_name column out of the clean_df dataframe. Then 'outer' will work.
clean_df = clean_df.drop(['common_name'], axis = 1)

In [16]:
clean_df.head()

Unnamed: 0,scientific_name,status
0,Addax nasomaculatus,Endangered
1,Bubalus depressicornis,Endangered
34,Pteropus rodricensis,Endangered
42,Ursus arctos,Endangered
43,Ursus maritimus,Threatened


In [17]:
aza_ess = pd.merge( aza, clean_df, how ='outer')

In [18]:
aza_ess.head(20)

Unnamed: 0,common_name,scientific_name,MLE_combined,MLE_male,MLE_female,status
0,Aardvark,Orycteropus afer,-,,11.5,
1,Addax,Addax nasomaculatus,-,10.7,13.9,Endangered
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-,
3,"Alligator, Chinese",Alligator sinensis,25,-,-,Endangered
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-,Endangered
5,"Anteater, Giant",Myrmecophaga tridactyla,19.2,-,-,
6,"Antelope, Roan",Hippotragus equinus,-,,12.3,
7,"Antelope, Sable",Hippotragus niger,11,-,-,
8,"Aracari, Green",Pteroglossus viridis,7.9,-,-,
9,"Argus, Great",Argusianus argus,12.4,-,-,


In [19]:
aza_ess.shape

(372, 6)

In [20]:
# Finally, it seems to have worked correctly. That was a lot of back and forth for something that was a simple fix.

# Now I just need to replace the NaN values in the 'status' column to 'Least Concern' based on the verbiage of the Endangered
# Species Act.
aza_ess['status'].replace({np.nan: 'Least Concern'}, inplace=True)

In [21]:
aza_ess.head()

Unnamed: 0,common_name,scientific_name,MLE_combined,MLE_male,MLE_female,status
0,Aardvark,Orycteropus afer,-,,11.5,Least Concern
1,Addax,Addax nasomaculatus,-,10.7,13.9,Endangered
2,"Agouti, Brazilian",Dasyprocta leporina,8.2,-,-,Least Concern
3,"Alligator, Chinese",Alligator sinensis,25,-,-,Endangered
4,"Anoa, Lowland",Bubalus depressicornis,17.4,-,-,Endangered


In [22]:
#Store clean dataframe as CSV
aza_ess.to_csv('dsc540_mlestatus.csv', index=False) 