# Biodiversity Dataset

This dataset is from the National Parks Service about endangered species in different parks.

> ## Exploring Data:

First, I want to know the number of columns, what type they are, unique and number of unique values they contain. From there, I can know if I have missing data and explore it further to know how to treat it.

In [14]:
# Importing libraries and read both datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

In [15]:
# Merge both datasets into a single one and look at the merged information
biodiversity = species.merge(right = observations, how = 'outer', on = 'scientific_name')
biodiversity.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101


In [16]:
# Get to know the number of columns, their data type and null values
biodiversity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25632 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             25632 non-null  object
 1   scientific_name      25632 non-null  object
 2   common_names         25632 non-null  object
 3   conservation_status  880 non-null    object
 4   park_name            25632 non-null  object
 5   observations         25632 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.2+ MB


In [17]:
# Number of unique values
for column in biodiversity.columns:
    print(f'{column} unique values:')
    print(biodiversity[column].nunique())

category unique values:
7
scientific_name unique values:
5541
common_names unique values:
5504
conservation_status unique values:
4
park_name unique values:
4
observations unique values:
304


In [18]:
# Unique values for 'Category', 'Conservation Status' and 'Park Name' (fewest count of unique values)
columns_unique_values = ['category', 'conservation_status', 'park_name']

for column in columns_unique_values:
    print(f'{column} unique values:')
    print(biodiversity[column].unique())

category unique values:
['Vascular Plant' 'Nonvascular Plant' 'Bird' 'Amphibian' 'Reptile'
 'Mammal' 'Fish']
conservation_status unique values:
[nan 'Species of Concern' 'Threatened' 'Endangered' 'In Recovery']
park_name unique values:
['Bryce National Park' 'Yellowstone National Park'
 'Great Smoky Mountains National Park' 'Yosemite National Park']


In [19]:
# See duplicated values
biodiversity[biodiversity.duplicated() == True]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
1668,Vascular Plant,Arctium minus,"Lesser Burdock, Lesser Burrdock",,Yosemite National Park,162
1676,Vascular Plant,Arctium minus,"Beggar's Button, Burdock, Common Burdock, Less...",,Yosemite National Park,162
2741,Vascular Plant,Botrychium virginianum,Rattlesnake Fern,,Yellowstone National Park,232
2749,Vascular Plant,Botrychium virginianum,"Common Grapefern, Rattlesnake Fern",,Yellowstone National Park,232
5638,Vascular Plant,Cichorium intybus,"Chickory, Chicory",,Yellowstone National Park,266
5646,Vascular Plant,Cichorium intybus,"Blue Sailors, Chickory, Chicory, Coffeeweed, C...",,Yellowstone National Park,266
8119,Vascular Plant,Echinochloa crus-galli,Barnyard Grass,,Great Smoky Mountains National Park,62
8127,Vascular Plant,Echinochloa crus-galli,"Barnyard Grass, Barnyardgrass, Cockspur, Japan...",,Great Smoky Mountains National Park,62
8245,Vascular Plant,Eleocharis palustris,Spike-Rush,,Great Smoky Mountains National Park,62
8253,Vascular Plant,Eleocharis palustris,"Common Spikerush, Creeping Spikerush, Creeping...",,Great Smoky Mountains National Park,62


In [20]:
biodiversity.drop_duplicates()

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101
...,...,...,...,...,...,...
25627,Nonvascular Plant,Zygodon viridissimus,Zygodon Moss,,Bryce National Park,100
25628,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yellowstone National Park,237
25629,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Bryce National Park,102
25630,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yosemite National Park,210


> ## Exploring missing data:

In [21]:
# Get the % of missing data
max_rows = len(biodiversity)

print('% Missing Data:')
print((1 - biodiversity.count() / max_rows) * 100)

% Missing Data:
category                0.000000
scientific_name         0.000000
common_names            0.000000
conservation_status    96.566792
park_name               0.000000
observations            0.000000
dtype: float64


In [22]:
# Get every row with nan values
biodiversity[biodiversity.conservation_status.isna() == True]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101
...,...,...,...,...,...,...
25627,Nonvascular Plant,Zygodon viridissimus,Zygodon Moss,,Bryce National Park,100
25628,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yellowstone National Park,237
25629,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Bryce National Park,102
25630,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yosemite National Park,210


In [79]:
# Filter data by not null values of Conservation Status
filtered_data = biodiversity[['category', 'park_name', 'conservation_status']]
filtered_data = filtered_data[filtered_data.conservation_status.notna()]

# Group all data to know how many counts of Conservation Status I have in each Category and Park
grouped_data = filtered_data.groupby(['category', 'park_name', 'conservation_status']).size().reset_index(name = 'count')

In [81]:
# Pivot grouped data to find patterns
pivoted_data = grouped_data.pivot_table(index = ['category', 'park_name'], columns = 'conservation_status', values = 'count', fill_value = 0)
pivoted_data

Unnamed: 0_level_0,conservation_status,Endangered,In Recovery,Species of Concern,Threatened
category,park_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amphibian,Bryce National Park,1.0,0.0,4.0,2.0
Amphibian,Great Smoky Mountains National Park,1.0,0.0,4.0,2.0
Amphibian,Yellowstone National Park,1.0,0.0,4.0,2.0
Amphibian,Yosemite National Park,1.0,0.0,4.0,2.0
Bird,Bryce National Park,4.0,3.0,80.0,0.0
Bird,Great Smoky Mountains National Park,4.0,3.0,80.0,0.0
Bird,Yellowstone National Park,4.0,3.0,80.0,0.0
Bird,Yosemite National Park,4.0,3.0,80.0,0.0
Fish,Bryce National Park,3.0,0.0,4.0,5.0
Fish,Great Smoky Mountains National Park,3.0,0.0,4.0,5.0


In [41]:
# Create a new dataframe to contain Conservation Status missing data
missing_data = biodiversity[['category', 'park_name', 'conservation_status']].reset_index()
missing_data.head()

Unnamed: 0,index,category,park_name,conservation_status
0,0,Vascular Plant,Bryce National Park,
1,1,Vascular Plant,Yellowstone National Park,
2,2,Vascular Plant,Great Smoky Mountains National Park,
3,3,Vascular Plant,Yosemite National Park,
4,4,Vascular Plant,Great Smoky Mountains National Park,


In [43]:
# Group missing data by Category, Park Name and Conservation Status to see a trend in missing data
null_counts = missing_data.groupby(['category', 'park_name'])['conservation_status'].apply(lambda x: x.isnull().count()).reset_index()
null_counts

Unnamed: 0,category,park_name,conservation_status
0,Amphibian,Bryce National Park,82
1,Amphibian,Great Smoky Mountains National Park,82
2,Amphibian,Yellowstone National Park,82
3,Amphibian,Yosemite National Park,82
4,Bird,Bryce National Park,591
5,Bird,Great Smoky Mountains National Park,591
6,Bird,Yellowstone National Park,591
7,Bird,Yosemite National Park,591
8,Fish,Bryce National Park,131
9,Fish,Great Smoky Mountains National Park,131
