# Exploratory Data Analysis (EDA)

**Topic:** Endangered species

**Data source:** [International Union for Conservation of Nature’s Red List of Threatened Species](https://www.iucnredlist.org/)

**Goal:** The present notebook aims to join several files obtained through web scrapping and official sources from the IUCN Red List to create a dataset with information about endangered species worldwide. The dataset will serve to develop a web page showing the status of endangered species across the globe. After preprocessing the data, an exploratory data analysis will be done to understand the data and get insights from it.

#### Import required libraries

In [1]:
import pandas as pd
import json
import os
import glob

from tqdm import tqdm

#### Utils

In [2]:
def save_json(dictionary, filename):
    if os.path.exists(filename):
        os.remove(filename)
        
    with open(filename, 'w') as f:
        json.dump(dictionary, f)
        print(f"Saved data to {filename}")

#### Preprocessing

1. Load data and join files to have a single dataset

In [3]:
def join_files(raw_data_path, species_path, output_path):
    animals = []

    df_species = pd.read_csv(species_path)

    try:
        print("Joining files to create a single dataset...")

        for folder_id in tqdm(os.listdir(raw_data_path), desc="Processing files"):
            animal = {
                "id": folder_id,
            }

            animal.update(df_species[df_species['id_no'] == int(folder_id)].to_dict(orient='records')[0])

            # join all json files
            path = f"{raw_data_path}/{folder_id}"

            if os.path.exists(path):
                narrative_data = pd.read_json(f"{path}/narrative.json")["result"][0]
                animal.update(narrative_data)

                species = pd.read_json(f"{path}/species.json")["result"][0]
                animal.update(species)
                
                data_files = ['countries', 'habitats', 'history', 'threats']
                for file_name in data_files:
                    animal[file_name] = pd.read_json(f"{path}/{file_name}.json")["result"].tolist()

                animals.append(animal)
        
        save_json(animals, output_path)
    except Exception as e:
        print(e)

In [4]:
raw_data_path = './data_viz_animals/raw_data'
species_path = './data_viz_animals/species.csv'
output_path = './data_viz_animals/animals.json'

join_files(raw_data_path, species_path, output_path)

Joining files to create a single dataset...


Processing files: 100%|██████████| 1799/1799 [00:08<00:00, 211.72it/s]


Saved data to ./data_viz_animals/animals.json


### Exploratory Data Analysis

1. General information about the dataset

In [5]:
df_animals = pd.read_json('./data_viz_animals/animals.json')
df_animals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1799 entries, 0 to 1798
Data columns (total 64 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1799 non-null   int64  
 1   assessment_id         1799 non-null   int64  
 2   id_no                 1799 non-null   int64  
 3   sci_name              1799 non-null   object 
 4   presence              1799 non-null   int64  
 5   origin                1799 non-null   int64  
 6   seasonal              1799 non-null   int64  
 7   compiler              1798 non-null   object 
 8   yrcompiled            1799 non-null   int64  
 9   citation              1798 non-null   object 
 10  legend                1799 non-null   object 
 11  subspecies            756 non-null    object 
 12  subpop                748 non-null    object 
 13  dist_comm             497 non-null    object 
 14  island                835 non-null    object 
 15  tax_comm              331 

2. Clean data and check for missing values

In [6]:
def clean_animals_data(df_animals):
    print("Removing columns with many missing values...")
    columns_to_drop = [
        'subspecies', 'subpop', 'dist_comm', 'island', 'tax_comm', 
        'taxonomicnotes', 'main_common_name', 'aoo_km2', 'eoo_km2', 
        'elevation_upper', 'elevation_lower', 'depth_upper', 'depth_lower', 
        'errata_flag', 'errata_reason', 'amended_flag', 'amended_reason',
        'id_no', 'species_id', 'assessment_id', 'origin', 'presence', 
        'seasonal', 'assessor', 'reviewer'
    ]
    
    print("Removing columns with many missing values, duplicated, or no meaningful information...")
    df_animals = df_animals.drop(columns=columns_to_drop)

    return df_animals

In [7]:
df_animals = clean_animals_data(df_animals)
df_animals.info()

Removing columns with many missing values...
Removing columns with many missing values, duplicated, or no meaningful information...
<class 'pandas.core.frame.DataFrame'>
Index: 1799 entries, 0 to 1798
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1799 non-null   int64  
 1   sci_name              1799 non-null   object 
 2   compiler              1798 non-null   object 
 3   yrcompiled            1799 non-null   int64  
 4   citation              1798 non-null   object 
 5   legend                1799 non-null   object 
 6   source                1468 non-null   object 
 7   basisofrec            1233 non-null   object 
 8   event_year            1577 non-null   float64
 9   longitude             1799 non-null   float64
 10  latitude              1799 non-null   float64
 11  rationale             1799 non-null   object 
 12  geographicrange       1799 non-null   object 
 

#### Explore the dataset

In [8]:
df_animals.head()

Unnamed: 0,id,sci_name,compiler,yrcompiled,citation,legend,source,basisofrec,event_year,longitude,...,assessment_date,category,criteria,population_trend,marine_system,freshwater_system,terrestrial_system,countries,habitats,history
0,192897,Herichthys labridens,"Fabian Pérez, Omar Mejía, Eduardo Soto-Galera...",2018,IUCN (International Union for Conservation of ...,Extant (resident),ENCB-IPN-P1888,PRESERVED_SPECIMEN,1955.0,-100.04833,...,2018-10-18,EN,B1ab(iii),Decreasing,False,True,False,"[{'code': 'MX', 'country': 'Mexico', 'presence...","[{'code': '5.1', 'habitat': 'Wetlands (inland)...","[{'year': '2019', 'assess_year': '2018', 'code..."
1,139371719,Onychogomphus thienemanni,R.A. Dow,2019,R.A. Dow,Extant (resident),Choong & Rahim 2014,PreservedSpecimen,2010.0,101.9879,...,2019-05-26,NT,,Decreasing,False,True,True,"[{'code': 'ID', 'country': 'Indonesia', 'prese...","[{'code': '1.6', 'habitat': 'Forest - Subtropi...","[{'year': '2020', 'assess_year': '2019', 'code..."
2,176408492,Andromakhe latens,"Alonso, F.A.",2020,IUCN (International Union for Conservation of ...,Extant (resident),,,,-64.500597,...,2020-12-14,EN,B1ab(iii),Unknown,False,True,False,"[{'code': 'AR', 'country': 'Argentina', 'prese...","[{'code': '5.1', 'habitat': 'Wetlands (inland)...","[{'year': '2022', 'assess_year': '2020', 'code..."
3,197975,Caridina striata,"K. von Rintelen, MfN Berlin",2017,"K. von Rintelen, MfN Berlin",Extant (resident),"von Rintelen K., and Y. Cai. 2009. The Raffles...",PreservedSpecimen,2003.0,121.396333,...,2018-12-06,CR,A3e,Unknown,False,True,False,"[{'code': 'ID', 'country': 'Indonesia', 'prese...","[{'code': '5.5', 'habitat': 'Wetlands (inland)...","[{'year': '2019', 'assess_year': '2018', 'code..."
4,218123787,Ancylodactylus spawlsi,PK Malonza,2022,National Museums of Kenya (NMK),Extant (resident),NMK,PreservedSpecimen,2013.0,37.12985,...,2022-07-04,VU,B1ab(iii),Decreasing,False,False,True,"[{'code': 'KE', 'country': 'Kenya', 'presence'...","[{'code': '1.5', 'habitat': 'Forest - Subtropi...","[{'year': '2023', 'assess_year': '2022', 'code..."


3. Show plots to understand the data:
* Population trend
* Countries with the most endangered species
* Endangered species by category
* Extinct species by category
* Vulnerable species by category
