## Milestone 4

*Connecting to an API/Pulling in the Data and Cleaning/Formatting*

Perform at least 5 data transformation and/or cleansing steps to your API data

I had to change my API source. The original API didn't have a good structure for just data scraping. I am using the API for the Global Biodiversity Information Facility now which is one of the organizations that my original API institution sends their data to.

In [1]:
import urllib.request, urllib.parse, urllib.error
import json
import pandas as pd
%matplotlib inline 
import matplotlib.pyplot as plt

In [2]:
# Variables for the pieces of the API url to search for species
site = 'https://api.gbif.org/v1/'
species = 'occurrence/'
search = 'search?q='

In [3]:
# Import the list of AZA animals to search for
aza = pd.read_csv('dsc540_animalinfo.csv')

In [4]:
# I noticed there were some shorthand renditions of animals with multiple scientific names, so I manually changed them to the 
# most common scientific name in order for them to work with the API correctly.
aza = aza.replace(('Lithobates (Rana) sevosa', 'Enhydra lutris kenyoni & E. l. nereis', 'Ailurus fulgens refulgens (styani)', 'Bufo (Anaxyrus) houstonensis'), 
                  ('Lithobates sevosa', 'Enhydra lutris kenyoni', 'Ailurus fulgens refulgens', 'Anaxyrus houstonensis')) 

In [5]:
# I turned the column of scientific names into a list for reformatting.
aza_names = aza['scientific_name'].tolist()

In [6]:
# Function to split the scientific names into genus, species, and (if applicable) subspecies. Need to split them in order to
# correctly format them for insertion into API call.
def split_name(name):
    split = []
    split.append(name.split())
    return split

In [7]:
# Function to format the split scientfic name into URL appropriate string for API call.
def format_name(names):
    formatted = [['%20'.join(n) for n in name] for name in names]
    return formatted

In [8]:
# Function to send list of names to be formatted into correct form for the API url. It uses the split_name and format_name 
# functions to first split each name, format each name, and return them to a list. The result is a list of lists
def api_names(list):
    names = []

    for name in list:
        names.append(split_name(name))
    
    api_list = format_name(names)
    return api_list

In [9]:
master = api_names(aza_names)
# Flatten list of lists for easier access.
flat_master = [item for sublist in master for item in sublist]

In [10]:
# I want to save the information from the API for each animal to a list. I only want specific tags in the API so I've isolated
# those and will select that information from the API.
def gather_intel(file):
    intel = []
    tmp = {}
    file_keys = ['year', 'month', 'day', 'country', 'locality', 'species', 'basisofRecord', 'occurrenceStatus', 'occurrenceRemarks']
   
    for item in file:
        for key, val in item.items():
            if key in file_keys:
                tmp.update({key: val})
        intel.append(tmp)
        tmp = {}
    return intel

In [11]:
# Access API
def find_species(animals):
    gbif_data = []
    for animal in animals:
        try:
            print('Accessing API...')
            print(site+species+search+animal)
            uh = urllib.request.urlopen(site+species+search+animal)
            data = uh.read()
            file = json.loads(data)
            results = file['results'] # Have to further slice the data because the original data has multiple dicts
            
            gbif_data.append(gather_intel(results))
        
        except urllib.error.URLError as e:
            print(f"ERROR: {e.reason}")
    return gbif_data

In [12]:
# Run all species through the API and collect the information using the find_species function.
new = find_species(flat_master)

Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Orycteropus%20afer
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Addax%20nasomaculatus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Dasyprocta%20leporina
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Alligator%20sinensis
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Bubalus%20depressicornis
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Myrmecophaga%20tridactyla
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Hippotragus%20equinus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Hippotragus%20niger
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Pteroglossus%20viridis
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Argusianus%20argus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Tolypeutes%20matacas
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Papio%20hamadryas
Accessing 

Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Elephas%20maximus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Polihierax%20semitorquatus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Phoenicopterus%20ruber
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Phoenicopterus%20chilensis
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Phoenicopterus%20ruber%20roseus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Phoeniconaias%20minor
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Pteropus%20hypomelanus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Pteropus%20vampyrus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Vulpes%20zerda
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Vulpes%20velox
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Lithobates%20sevosa
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Atelopus%20varius
Accessin

Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Mandrillus%20sphinx
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Cercocebus%20torquatus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Capra%20falconeri%20heptneri
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Callithrix%20geoffroyi
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Callithrix%20pygmaea
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Suricata%20suricatta
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Mergus%20squamatus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Callicebus%20donacophilus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Ateles%20goeffroyi
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Saimiri%20sciureus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Cercopithecus%20neglectus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Cercopithecus%20diana


Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Callosciurus%20prevostii
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Coccycolius%20iris
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Cosmopsarus%20regius
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Scissirostrum%20dubium
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Cinnyricinclus%20leucogaster
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Raphicerus%20campestris
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Himantopus%20mexicanus
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Potamotrygon%20motoro
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Potamotrygon%20leopoldi
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Ciconia%20ciconia%20ciconia
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Leptoptilos%20crumenifer
Accessing API...
https://api.gbif.org/v1/occurrence/search?q=Mycte

In [13]:
# Turn data into dataframe for easier cleaning.
gbif_df = pd.DataFrame(new).stack().apply(pd.Series)

In [14]:
gbif_df.head(10)

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,year,month,day,country,locality,occurrenceRemarks
0,0,PRESENT,Orycteropus afer,2014.0,9.0,30.0,Kenya,Marsabit Forest Ecosystem,
0,1,PRESENT,Orycteropus afer,,,,,,
0,2,PRESENT,Orycteropus afer,,,,Germany,,
0,3,PRESENT,Orycteropus afer,,,,South Africa,,
0,4,PRESENT,Orycteropus afer,2015.0,4.0,8.0,Niger,"Say, Parc W",
0,5,PRESENT,Orycteropus afer,,,,,,
0,6,PRESENT,Orycteropus afer,,,,,,
0,7,PRESENT,Orycteropus afer,,,,,,
0,8,PRESENT,Orycteropus afer,,,,,,
0,9,PRESENT,Orycteropus afer,,,,,,


In [15]:
gbif_df.shape

(7420, 8)

In [16]:
# I wanted to investigate how many NaN values are in each column to see if there are any unnecessary variables present. I want 
# to keep all the observations, even if they have no information other than the species name because the number of observations
# was my original goal to extract from this dataset. occurrenceRemarks is the only column that I feel is unnecessary since almost
# all the rows have no value. So, I will remove that column.
gbif_df.isna().sum()

occurrenceStatus        0
species               118
year                 3216
month                3525
day                  3636
country              2452
locality             3436
occurrenceRemarks    6731
dtype: int64

In [17]:
gbif_df = gbif_df.drop('occurrenceRemarks', axis=1)

In [18]:
gbif_df.head(10)

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,year,month,day,country,locality
0,0,PRESENT,Orycteropus afer,2014.0,9.0,30.0,Kenya,Marsabit Forest Ecosystem
0,1,PRESENT,Orycteropus afer,,,,,
0,2,PRESENT,Orycteropus afer,,,,Germany,
0,3,PRESENT,Orycteropus afer,,,,South Africa,
0,4,PRESENT,Orycteropus afer,2015.0,4.0,8.0,Niger,"Say, Parc W"
0,5,PRESENT,Orycteropus afer,,,,,
0,6,PRESENT,Orycteropus afer,,,,,
0,7,PRESENT,Orycteropus afer,,,,,
0,8,PRESENT,Orycteropus afer,,,,,
0,9,PRESENT,Orycteropus afer,,,,,


In [19]:
# I want to rearrange the columns so that the location information columns are next to each other.
gbif_df = gbif_df[['occurrenceStatus', 'species', 'country', 'locality', 'year', 'month', 'day']]

In [20]:
gbif_df.head()

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,country,locality,year,month,day
0,0,PRESENT,Orycteropus afer,Kenya,Marsabit Forest Ecosystem,2014.0,9.0,30.0
0,1,PRESENT,Orycteropus afer,,,,,
0,2,PRESENT,Orycteropus afer,Germany,,,,
0,3,PRESENT,Orycteropus afer,South Africa,,,,
0,4,PRESENT,Orycteropus afer,Niger,"Say, Parc W",2015.0,4.0,8.0


In [21]:
# Now I want to duplicate the country information to the locality information if the locality is NaN. Locality is just the more
# specific location of the observation. If that information is not available, it would be better to have the general location
# represented there.
gbif_df.locality.fillna(gbif_df.country, inplace=True)

In [22]:
gbif_df.head()

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,country,locality,year,month,day
0,0,PRESENT,Orycteropus afer,Kenya,Marsabit Forest Ecosystem,2014.0,9.0,30.0
0,1,PRESENT,Orycteropus afer,,,,,
0,2,PRESENT,Orycteropus afer,Germany,Germany,,,
0,3,PRESENT,Orycteropus afer,South Africa,South Africa,,,
0,4,PRESENT,Orycteropus afer,Niger,"Say, Parc W",2015.0,4.0,8.0


In [23]:
# There are also NaN values represented in the country, so I want to replace those with the locality if that is available.
gbif_df.country.fillna(gbif_df.locality, inplace = True)

In [24]:
gbif_df.head()

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,country,locality,year,month,day
0,0,PRESENT,Orycteropus afer,Kenya,Marsabit Forest Ecosystem,2014.0,9.0,30.0
0,1,PRESENT,Orycteropus afer,,,,,
0,2,PRESENT,Orycteropus afer,Germany,Germany,,,
0,3,PRESENT,Orycteropus afer,South Africa,South Africa,,,
0,4,PRESENT,Orycteropus afer,Niger,"Say, Parc W",2015.0,4.0,8.0


In [25]:
# The rest of the location NaN values I will replace with 'Unknown'.
gbif_df.update(gbif_df[['country', 'locality']].fillna('Unknown'))

In [26]:
# I want to reformat the date numbers from floats to integers for easier reading. I'm not sure if I need to convert them to
# datetime format yet. First, I need to replace the NaN values in those columns to zero. Then I will convert them to integers.
gbif_df.update(gbif_df[['year', 'month', 'day']].fillna(0))

In [27]:
gbif_df[['year', 'month', 'day']] = gbif_df[['year', 'month', 'day']].astype(int)

In [28]:
gbif_df.head(670)

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,country,locality,year,month,day
0,0,PRESENT,Orycteropus afer,Kenya,Marsabit Forest Ecosystem,2014,9,30
0,1,PRESENT,Orycteropus afer,Unknown,Unknown,0,0,0
0,2,PRESENT,Orycteropus afer,Germany,Germany,0,0,0
0,3,PRESENT,Orycteropus afer,South Africa,South Africa,0,0,0
0,4,PRESENT,Orycteropus afer,Niger,"Say, Parc W",2015,4,8
...,...,...,...,...,...,...,...,...
33,5,PRESENT,Syncerus caffer,Unknown,Unknown,1834,8,1
33,6,PRESENT,Syncerus caffer,Botswana,"Khwai, Mababe",0,0,0
33,7,PRESENT,Syncerus caffer,Liberia,Liberia,0,0,0
33,8,PRESENT,Syncerus caffer,Botswana,"Khwai, Mababe",0,0,0


In [29]:
# There are apparently 93 instances of NaN values in the species column. Since the whole point of pulling this data is to match
# information to the species, that is a problem. But I'm not entirely sure why those are NaN. So, I'm going to find the rows to
# see if I can figure out why they're NaN.
gbif_df['species'].isnull().sum()

118

In [30]:
gbif_df[gbif_df['species'].isnull()]

Unnamed: 0,Unnamed: 1,occurrenceStatus,species,country,locality,year,month,day
1,10,PRESENT,,,,0,0,0
6,4,PRESENT,,CRANDON PARK ZOO,CRANDON PARK ZOO,1977,11,25
6,14,PRESENT,,,,1976,3,31
10,6,PRESENT,,Unknown,Unknown,1735,0,0
10,12,PRESENT,,NO DATA,NO DATA,1968,11,18
...,...,...,...,...,...,...,...,...
359,4,PRESENT,,,,1967,10,22
361,6,PRESENT,,No data,No data,1965,4,1
361,7,PRESENT,,,,1983,4,4
370,6,PRESENT,,NO DATA,NO DATA,1980,12,18


In [31]:
# The numbers on the far left indicate the list in the list of lists that was pulled from the API. I can use that to identify 
# which animal is represented by the observation. This summary also shows that there are some other variances in the way that 
# information was input into some of the columns (such as using 'N/A' or 'no specific locality' instead of leaving the information
# blank). I'm just going to drop the rows with the species NaN values since it would be a pretty big assumption to just put the 
# species in based on the list it was pulled from. Although the odds are high that it is from the same species, it also could be
# instances of misidentified species.
gbif_df = gbif_df[gbif_df['species'].notna()]

In [32]:
gbif_df['species'].isnull().sum() # Success

0

In [33]:
gbif_df.shape

(7302, 7)

In [34]:
# I'm going to replace the potential 'N/A' values. I'm not sure how many other variations of 'N/A' might be present in the data.
# I might just have to deal with them as they arise. However, for the ultimate analysis, I think I'm going to end up not using
# the locality/country data anyway, and those seem to be the main inconsistencies.
gbif_df = gbif_df.replace('N/A', 'Unknown')

In [35]:
#Store clean dataframe as CSV
gbif_df.to_csv('dsc540_gbif.csv', index=False) 