# Biodiversity Dataset

This dataset is from the National Parks Service about endangered species in different parks.

> ## Exploring Data:

First, I want to know the number of columns, what type they are, unique and number of unique values they contain. From there, I can know if I have missing data and explore it further to know how to treat it.

In [1]:
# Importing libraries and read both datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

In [2]:
# Merge both datasets into a single one and look at the merged information
biodiversity = species.merge(right = observations, how = 'outer', on = 'scientific_name')
biodiversity.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101


In [3]:
# Get to know the number of columns, their data type and null values
biodiversity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25632 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             25632 non-null  object
 1   scientific_name      25632 non-null  object
 2   common_names         25632 non-null  object
 3   conservation_status  880 non-null    object
 4   park_name            25632 non-null  object
 5   observations         25632 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.2+ MB


In [4]:
# Number of unique values
for column in biodiversity.columns:
    print(f'{column} unique values:')
    print(biodiversity[column].nunique())

category unique values:
7
scientific_name unique values:
5541
common_names unique values:
5504
conservation_status unique values:
4
park_name unique values:
4
observations unique values:
304


In [5]:
# Unique values for 'Category', 'Conservation Status' and 'Park Name' (fewest count of unique values)
columns_unique_values = ['category', 'conservation_status', 'park_name']

for column in columns_unique_values:
    print(f'{column} unique values:')
    print(biodiversity[column].unique())

category unique values:
['Vascular Plant' 'Nonvascular Plant' 'Bird' 'Amphibian' 'Reptile'
 'Mammal' 'Fish']
conservation_status unique values:
[nan 'Species of Concern' 'Threatened' 'Endangered' 'In Recovery']
park_name unique values:
['Bryce National Park' 'Yellowstone National Park'
 'Great Smoky Mountains National Park' 'Yosemite National Park']


In [119]:
# See duplicated values
biodiversity[biodiversity.duplicated() == True]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
1668,Vascular Plant,Arctium minus,"Lesser Burdock, Lesser Burrdock",,Yosemite National Park,162
1676,Vascular Plant,Arctium minus,"Beggar's Button, Burdock, Common Burdock, Less...",,Yosemite National Park,162
2741,Vascular Plant,Botrychium virginianum,Rattlesnake Fern,,Yellowstone National Park,232
2749,Vascular Plant,Botrychium virginianum,"Common Grapefern, Rattlesnake Fern",,Yellowstone National Park,232
5638,Vascular Plant,Cichorium intybus,"Chickory, Chicory",,Yellowstone National Park,266
5646,Vascular Plant,Cichorium intybus,"Blue Sailors, Chickory, Chicory, Coffeeweed, C...",,Yellowstone National Park,266
8119,Vascular Plant,Echinochloa crus-galli,Barnyard Grass,,Great Smoky Mountains National Park,62
8127,Vascular Plant,Echinochloa crus-galli,"Barnyard Grass, Barnyardgrass, Cockspur, Japan...",,Great Smoky Mountains National Park,62
8245,Vascular Plant,Eleocharis palustris,Spike-Rush,,Great Smoky Mountains National Park,62
8253,Vascular Plant,Eleocharis palustris,"Common Spikerush, Creeping Spikerush, Creeping...",,Great Smoky Mountains National Park,62


> ## Exploring missing data:

In [6]:
# Get the % of missing data
max_rows = len(biodiversity)

print('% Missing Data:')
print((1 - biodiversity.count() / max_rows) * 100)

% Missing Data:
category                0.000000
scientific_name         0.000000
common_names            0.000000
conservation_status    96.566792
park_name               0.000000
observations            0.000000
dtype: float64


In [7]:
# Get every row with nan values
biodiversity[biodiversity.conservation_status.isna() == True]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101
...,...,...,...,...,...,...
25627,Nonvascular Plant,Zygodon viridissimus,Zygodon Moss,,Bryce National Park,100
25628,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yellowstone National Park,237
25629,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Bryce National Park,102
25630,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yosemite National Park,210


In [11]:
biodiversity.category.groupby(biodiversity.conservation_status).value_counts()

conservation_status  category         
Endangered           Mammal                44
                     Bird                  16
                     Fish                  12
                     Amphibian              4
                     Vascular Plant         4
In Recovery          Bird                  12
                     Mammal                12
Species of Concern   Bird                 320
                     Vascular Plant       172
                     Mammal               168
                     Nonvascular Plant     20
                     Reptile               20
                     Amphibian             16
                     Fish                  16
Threatened           Fish                  20
                     Amphibian              8
                     Mammal                 8
                     Vascular Plant         8
Name: count, dtype: int64

In [18]:
missing_data_by_category = biodiversity.conservation_status.isnull().groupby(biodiversity.category).sum().reset_index()
print(missing_data_by_category)

            category  conservation_status
0          Amphibian                  300
1               Bird                 2016
2               Fish                  476
3             Mammal                  968
4  Nonvascular Plant                 1312
5            Reptile                  304
6     Vascular Plant                19376


In [19]:
missing_data_by_park = biodiversity.conservation_status.isnull().groupby(biodiversity.park_name).sum().reset_index()
print(missing_data_by_park)

                             park_name  conservation_status
0                  Bryce National Park                 6188
1  Great Smoky Mountains National Park                 6188
2            Yellowstone National Park                 6188
3               Yosemite National Park                 6188


In [117]:
biodiversity.groupby(['category', 'park_name']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,scientific_name,common_names,conservation_status,observations
category,park_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amphibian,Bryce National Park,82,82,7,82
Amphibian,Great Smoky Mountains National Park,82,82,7,82
Amphibian,Yellowstone National Park,82,82,7,82
Amphibian,Yosemite National Park,82,82,7,82
Bird,Bryce National Park,591,591,87,591
Bird,Great Smoky Mountains National Park,591,591,87,591
Bird,Yellowstone National Park,591,591,87,591
Bird,Yosemite National Park,591,591,87,591
Fish,Bryce National Park,131,131,12,131
Fish,Great Smoky Mountains National Park,131,131,12,131
