# Biodiversity Dataset

This dataset is from the National Parks Service about endangered species in different parks.

> ## Exploring Data:

First, I want to know the number of columns, what type they are, unique and number of unique values they contain. From there, I can know if I have missing data and explore it further to know how to treat it.

In [36]:
# Importing libraries and read both datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

In [37]:
# Merge both datasets into a single one and look at the merged information
biodiversity = species.merge(right = observations, how = 'outer', on = 'scientific_name')
biodiversity.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101


In [38]:
# Get to know the number of columns, their data type and null values
biodiversity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25632 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             25632 non-null  object
 1   scientific_name      25632 non-null  object
 2   common_names         25632 non-null  object
 3   conservation_status  880 non-null    object
 4   park_name            25632 non-null  object
 5   observations         25632 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.2+ MB


In [61]:
# Number of unique values
for column in biodiversity.columns:
    print(f'{column} unique values:')
    print(biodiversity[column].nunique())

category unique values:
7
scientific_name unique values:
5541
common_names unique values:
5504
conservation_status unique values:
4
park_name unique values:
4
observations unique values:
304


In [62]:
# Unique values for 'Category', 'Conservation Status' and 'Park Name' (fewest count of unique values)
columns_unique_values = ['category', 'conservation_status', 'park_name']

for column in columns_unique_values:
    print(f'{column} unique values:')
    print(biodiversity[column].unique())

category unique values:
['Vascular Plant' 'Nonvascular Plant' 'Bird' 'Amphibian' 'Reptile'
 'Mammal' 'Fish']
conservation_status unique values:
[nan 'Species of Concern' 'Threatened' 'Endangered' 'In Recovery']
park_name unique values:
['Bryce National Park' 'Yellowstone National Park'
 'Great Smoky Mountains National Park' 'Yosemite National Park']


> ## Exploring missing data:

In [53]:
biodiversity[biodiversity.conservation_status.isna() == True]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Bryce National Park,109
1,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yellowstone National Park,215
2,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Great Smoky Mountains National Park,72
3,Vascular Plant,Abies bifolia,Rocky Mountain Alpine Fir,,Yosemite National Park,136
4,Vascular Plant,Abies concolor,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",,Great Smoky Mountains National Park,101
...,...,...,...,...,...,...
25627,Nonvascular Plant,Zygodon viridissimus,Zygodon Moss,,Bryce National Park,100
25628,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yellowstone National Park,237
25629,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Bryce National Park,102
25630,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,,Yosemite National Park,210
