# Exploratory Data Analysis (EDA)

**Topic:** Endangered species

**Data source:** [International Union for Conservation of Nature’s Red List of Threatened Species](https://www.iucnredlist.org/)

**Goal:** The present notebook aims to join several files obtained through web scrapping and official sources from the IUCN Red List to create a dataset with information about endangered species worldwide. The dataset will serve to develop a web page showing the status of endangered species across the globe. After preprocessing the data, an exploratory data analysis will be done to understand the data and get insights from it.

#### Import required libraries

In [1]:
import pandas as pd
import json
import os
import glob

#### Utils

In [7]:
def save_json(dictionary, filename):
    with open(filename, 'w') as f:
        json.dump(dictionary, f)
        print(f"Saved data to {filename}")

#### Preprocessing

In [9]:
df_species = pd.read_csv('./data_viz_animals/species.csv')
df_species.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1799 entries, 0 to 1798
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   assessment_id  1799 non-null   int64  
 1   id_no          1799 non-null   int64  
 2   sci_name       1799 non-null   object 
 3   presence       1799 non-null   int64  
 4   origin         1799 non-null   int64  
 5   seasonal       1799 non-null   int64  
 6   compiler       1798 non-null   object 
 7   yrcompiled     1799 non-null   int64  
 8   citation       1798 non-null   object 
 9   legend         1799 non-null   object 
 10  subspecies     756 non-null    object 
 11  subpop         748 non-null    object 
 12  dist_comm      497 non-null    object 
 13  island         835 non-null    object 
 14  tax_comm       331 non-null    object 
 15  source         1468 non-null   object 
 16  basisofrec     1233 non-null   object 
 17  event_year     1577 non-null   float64
 18  longitud

In [10]:
def clean_species_data(df_species):
    print("Removing columns with many missing values...")
    df_species = df_species.drop(columns=['subspecies'])
    df_species = df_species.drop(columns=['subpop'])
    df_species = df_species.drop(columns=['dist_comm'])
    df_species = df_species.drop(columns=['island'])
    df_species = df_species.drop(columns=['tax_comm'])
    return df_species

def join_files(raw_data_path, species_path, output_path):
    animals = []

    df_species = clean_species_data(pd.read_csv(species_path))

    try:
        print("Joining files to create a single dataset...")

        for folder_id in os.listdir(raw_data_path):
            animal = {
                "id": folder_id,
            }

            animal.update(df_species[df_species['id_no'] == int(folder_id)].to_dict(orient='records')[0])

            path = f"{raw_data_path}/{folder_id}"

            if os.path.exists(path):
                animal['countries'] = pd.read_json(f"{path}/countries.json")["result"].to_list()
                animal['habitats'] = pd.read_json(f"{path}/habitats.json")["result"].to_list()
                animal['history'] = pd.read_json(f"{path}/history.json")["result"].to_list()
                animal['narrative'] = pd.read_json(f"{path}/narrative.json")["result"][0]
                animal['species'] = pd.read_json(f"{path}/species.json")["result"].to_list()
                animal['threats'] = pd.read_json(f"{path}/threats.json")["result"].to_list()

            animals.append(animal)
        
        save_json(animals, output_path)
    except Exception as e:
        print(e)


In [11]:
raw_data_path = './data_viz_animals/raw_data'
species_path = './data_viz_animals/species.csv'
output_path = './data_viz_animals/animals.json'

join_files(raw_data_path, species_path, output_path)

Removing columns with many missing values...
Joining files to create a single dataset...
File ./data_viz_animals/raw_data/186593/countries.json does not exist


#### Explore the dataset

In [8]:
data_path = './data_viz_animals/animals.json'

df_animals = pd.read_json(data_path)
df_animals.head(3)

Unnamed: 0,id,assessment_id,id_no,sci_name,presence,origin,seasonal,compiler,yrcompiled,citation,...,basisofrec,event_year,longitude,latitude,countries,habitats,history,narrative,species,threats
0,12395,506299,12395,Lucania interioris,1,1,1,"Brown, W.L.",2018,IUCN (International Union for Conservation of ...,...,,0.0,-102.099998,26.933332,"[{'code': 'MX', 'country': 'Mexico', 'presence...","[{'code': '5.1', 'habitat': 'Wetlands (inland)...","[{'year': '2019', 'assess_year': '2018', 'code...","{'species_id': 12395, 'taxonomicnotes': None, ...","[{'taxonid': 12395, 'scientific_name': 'Lucani...","[{'code': '7.2', 'title': 'Dams & water manage..."
1,186459,1813377,186459,Steindachneridion punctatum,1,1,1,"Salvador, G.N.",2020,IUCN (International Union for Conservation of ...,...,,0.0,-51.461113,-27.606112,"[{'code': 'BR', 'country': 'Brazil', 'presence...","[{'code': '5.1', 'habitat': 'Wetlands (inland)...","[{'year': '2023', 'assess_year': '2023', 'code...","{'species_id': 186459, 'taxonomicnotes': '<p>T...","[{'taxonid': 186459, 'scientific_name': 'Stein...","[{'code': '3.3', 'title': 'Renewable energy', ..."
2,186541,1814672,186541,Gymnogeophagus setequedas,1,1,1,"Fernando, E.",2023,IUCN (International Union for Conservation of ...,...,,0.0,-55.5,-25.333334,"[{'code': 'AR', 'country': 'Argentina', 'prese...","[{'code': '5.1', 'habitat': 'Wetlands (inland)...","[{'year': '2023', 'assess_year': '2020', 'code...","{'species_id': 186541, 'taxonomicnotes': '<p>T...","[{'taxonid': 186541, 'scientific_name': 'Gymno...","[{'code': '2.1', 'title': 'Annual & perennial ..."


In [13]:
df_animals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24 entries, 0 to 23
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             24 non-null     int64  
 1   assessment_id  24 non-null     int64  
 2   id_no          24 non-null     int64  
 3   sci_name       24 non-null     object 
 4   presence       24 non-null     int64  
 5   origin         24 non-null     int64  
 6   seasonal       24 non-null     int64  
 7   compiler       24 non-null     object 
 8   yrcompiled     24 non-null     int64  
 9   citation       24 non-null     object 
 10  legend         24 non-null     object 
 11  source         20 non-null     object 
 12  basisofrec     17 non-null     object 
 13  event_year     19 non-null     float64
 14  longitude      24 non-null     float64
 15  latitude       24 non-null     float64
 16  countries      24 non-null     object 
 17  habitats       24 non-null     object 
 18  history        24