### INTRODUCTION

The goal of this project is to analyse biodiversity data from the National Parks Service, specifically various species observed in 4 different locations.

The data was collected to keep record of at-risked species, in order to maintain the level of biodiversity within the parks. Hence, the objectives of the project is to understand the relationships of the species to their parks and the distribution of the conservation statuses. 

The project aims to answer:
<ul>
  <li>What is the distribution of conservation status for species? </li>
  <li>Are certain species more likely to be endangered? </li>
  <li>Which parks stand out in terms of hosting unique or rare species? </li>

### IMPORTING PYTHON MODULES

First, import the libraries that will be used in this project.

In [None]:
import pandas as pd
import numpy as np

### IMPORTING DATA

**SPECIES**

The *species_info.csv* contains information on different species in the National Parks.


In [None]:
species = pd.read_csv('Downloads/species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


**OBSERVATION**

The *observation.csv* contains information on recorded sightings of different species throughout national parks in one week

In [None]:
observations = pd.read_csv('Downloads/observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


### CLEANING DATA

**Check for missing data**

In [None]:
for col in species.columns:
    pct_missing = np.mean(species[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

category - 0%
scientific_name - 0%
common_names - 0%
conservation_status - 97%


**97%** species do not have a conservation status


**Convert the NaN to another value**

In [None]:
species['conservation_status'].fillna('Least Concern', inplace = True)

In [None]:
for col in observations.columns:
    pct_missing = np.mean(observations[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

scientific_name - 0%
park_name - 0%
observations - 0%


**Check for duplicates**

In [None]:
print(species.duplicated())

0       False
1       False
2       False
3       False
4       False
        ...  
5819    False
5820    False
5821    False
5822    False
5823    False
Length: 5824, dtype: bool


In [None]:
print(observations.duplicated())

0        False
1        False
2        False
3        False
4        False
         ...  
23291    False
23292    False
23293    False
23294    False
23295    False
Length: 23296, dtype: bool


**Check for data dimension**

In *species*, there are **5824 rows** and **4 columns**, and there are **23296 rows** and **3 columns** in *observations*.


In [None]:
print(f"species shape: {species.shape}")
print(f"observations shape: {observations.shape}")

species shape: (5824, 4)
observations shape: (23296, 3)


### EXPLORING DATA

**SPECIES**

How many categories and species are there in *species*?

There are 7 categories in total including animals and plants. Vascular plant has the biggest number of species with 4470 in the data and reptiles being the smallest with 79

In [None]:
print(f"Number of category: {species['category'].nunique()}")
print(f"Number of species: {species['scientific_name'].nunique()}")
print(species.groupby('category').size().sort_values())

Number of category: 7
Number of species: 5541
category
Reptile                79
Amphibian              80
Fish                  127
Mammal                214
Nonvascular Plant     333
Bird                  521
Vascular Plant       4470
dtype: int64


What are the conservation statuses?

In [None]:
print(species.groupby('conservation_status').size().sort_values())

conservation_status
In Recovery              4
Threatened              10
Endangered              16
Species of Concern     161
Least Concern         5633
dtype: int64


**OBSERVATIONS**

Total observations in one week


In [None]:
print(f"Number of observations: {observations['observations'].sum()}")

Number of observations: 3314739


**3314739** observations were done in 4 different parks! That's a lot of observations.

What are the 4 parks?

In [None]:
print(f"Unique parks: {observations['park_name'].unique()}")

Unique parks: ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']


### ANALYSIS

Are certain types of species more likely to be endangered?

In [None]:
conservation_category = species[species.conservation_status != 'Least Concern']\
.groupby(['conservation_status','category'])['scientific_name'].count()
print(conservation_category)

conservation_status  category         
Endangered           Amphibian             1
                     Bird                  4
                     Fish                  3
                     Mammal                7
                     Vascular Plant        1
In Recovery          Bird                  3
                     Mammal                1
Species of Concern   Amphibian             4
                     Bird                 72
                     Fish                  4
                     Mammal               28
                     Nonvascular Plant     5
                     Reptile               5
                     Vascular Plant       43
Threatened           Amphibian             2
                     Fish                  4
                     Mammal                2
                     Vascular Plant        2
Name: scientific_name, dtype: int64


We can see that **Bird** and **Mammal** have the highest number of species protected 
In the *In Recovery* status, there are 3 Birds and 1 Mammal, meaning Birds are bouncing back quickly.

Create a column and include all species with statuses other than *Least Concern*

In [None]:
species['is_protected'] = species.conservation_status != 'Least Concern'

Let's look at the number of species sighted at different parks across Bird and Mammal.

Find out which row of species refers to mammals and birds.


In [None]:
species['is_mammal'] = species['category'] == 'Mammal'
species['is_bird'] = species['category'] == 'Bird'

Filter the data to check if **is_bird** is true

In [None]:
species[species.is_mammal].head(n=10)

Unnamed: 0,category,scientific_name,common_names,conservation_status,is_protected,is_mammal,is_bird,is_va_plant
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Least Concern,False,True,False,False
1,Mammal,Bos bison,"American Bison, Bison",Least Concern,False,True,False,False
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Least Concern,False,True,False,False
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",Least Concern,False,True,False,False
4,Mammal,Cervus elaphus,Wapiti Or Elk,Least Concern,False,True,False,False
5,Mammal,Odocoileus virginianus,White-Tailed Deer,Least Concern,False,True,False,False
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",Least Concern,False,True,False,False
7,Mammal,Canis latrans,Coyote,Species of Concern,True,True,False,False
8,Mammal,Canis lupus,Gray Wolf,Endangered,True,True,False,False
9,Mammal,Canis rufus,Red Wolf,Endangered,True,True,False,False


Merging the results with *observations* to create a **DataFrame** with observations of mammals across 4 parks

In [None]:
mammal_observations = species[species.is_mammal].merge(observations)
mammal_observations.head(n=10)


Unnamed: 0,category,scientific_name,common_names,conservation_status,is_protected,is_mammal,is_bird,is_va_plant,park_name,observations
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Least Concern,False,True,False,False,Bryce National Park,130
1,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Least Concern,False,True,False,False,Yellowstone National Park,270
2,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Least Concern,False,True,False,False,Great Smoky Mountains National Park,98
3,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Least Concern,False,True,False,False,Yosemite National Park,117
4,Mammal,Bos bison,"American Bison, Bison",Least Concern,False,True,False,False,Yosemite National Park,128
5,Mammal,Bos bison,"American Bison, Bison",Least Concern,False,True,False,False,Yellowstone National Park,269
6,Mammal,Bos bison,"American Bison, Bison",Least Concern,False,True,False,False,Bryce National Park,68
7,Mammal,Bos bison,"American Bison, Bison",Least Concern,False,True,False,False,Great Smoky Mountains National Park,77
8,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Least Concern,False,True,False,False,Bryce National Park,99
9,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Least Concern,False,True,False,False,Yosemite National Park,124


Total mammal observations made at each park

In [None]:
mammal_by_park= mammal_observations.groupby(['park_name', 'is_protected']).observations.sum().reset_index()
mammal_by_park

Unnamed: 0,park_name,is_protected,observations
0,Bryce National Park,False,24129
1,Bryce National Park,True,4701
2,Great Smoky Mountains National Park,False,18105
3,Great Smoky Mountains National Park,True,2951
4,Yellowstone National Park,False,59671
5,Yellowstone National Park,True,11030
6,Yosemite National Park,False,36069
7,Yosemite National Park,True,6464


In [None]:
#We can see that Yellowstone National Park has the most sightings of protected and non-protected mammals, which is great news!
#Great Smoky Mountains National Park needs to put more effort into conservation as they have seen more non_protected mammals.

Filter the data to check if **is_mammal** is true

In [None]:
species[species.is_bird].head(n=10)

Unnamed: 0,category,scientific_name,common_names,conservation_status,is_protected,is_mammal,is_bird,is_va_plant
90,Bird,Vermivora pinus X chrysoptera,Brewster's Warbler,Least Concern,False,False,True,False
91,Bird,Accipiter cooperii,Cooper's Hawk,Species of Concern,True,False,True,False
92,Bird,Accipiter gentilis,Northern Goshawk,Least Concern,False,False,True,False
93,Bird,Accipiter striatus,Sharp-Shinned Hawk,Species of Concern,True,False,True,False
94,Bird,Aquila chrysaetos,Golden Eagle,Species of Concern,True,False,True,False
95,Bird,Buteo jamaicensis,Red-Tailed Hawk,Least Concern,False,False,True,False
96,Bird,Buteo lineatus,Red-Shouldered Hawk,Species of Concern,True,False,True,False
97,Bird,Buteo platypterus,Broad-Winged Hawk,Least Concern,False,False,True,False
98,Bird,Circus cyaneus,Northern Harrier,Species of Concern,True,False,True,False
99,Bird,Elanoides forficatus,"American Swallow-Tailed Kite, Swallow-Tailed Kite",Species of Concern,True,False,True,False


Merging the results with *observations* to create a **DataFrame** with osightings of birds across 4 parks

In [130]:
bird_observations = species[species.is_bird].merge(observations)
bird_observations.head(n=10)

Unnamed: 0,category,scientific_name,common_names,conservation_status,is_protected,is_mammal,is_bird,is_va_plant,park_name,observations
0,Bird,Vermivora pinus X chrysoptera,Brewster's Warbler,Least Concern,False,False,True,False,Great Smoky Mountains National Park,98
1,Bird,Vermivora pinus X chrysoptera,Brewster's Warbler,Least Concern,False,False,True,False,Yosemite National Park,136
2,Bird,Vermivora pinus X chrysoptera,Brewster's Warbler,Least Concern,False,False,True,False,Yellowstone National Park,259
3,Bird,Vermivora pinus X chrysoptera,Brewster's Warbler,Least Concern,False,False,True,False,Bryce National Park,89
4,Bird,Accipiter cooperii,Cooper's Hawk,Species of Concern,True,False,True,False,Bryce National Park,95
5,Bird,Accipiter cooperii,Cooper's Hawk,Species of Concern,True,False,True,False,Yosemite National Park,138
6,Bird,Accipiter cooperii,Cooper's Hawk,Species of Concern,True,False,True,False,Yellowstone National Park,245
7,Bird,Accipiter cooperii,Cooper's Hawk,Species of Concern,True,False,True,False,Great Smoky Mountains National Park,65
8,Bird,Accipiter gentilis,Northern Goshawk,Least Concern,False,False,True,False,Great Smoky Mountains National Park,78
9,Bird,Accipiter gentilis,Northern Goshawk,Least Concern,False,False,True,False,Yellowstone National Park,232


Total mammal observations made at each park

In [134]:
bird_by_park= bird_observations.groupby(['park_name', 'is_protected']).observations.sum().reset_index()
bird_by_park

Unnamed: 0,park_name,is_protected,observations
0,Bryce National Park,False,50982
1,Bryce National Park,True,7608
2,Great Smoky Mountains National Park,False,37572
3,Great Smoky Mountains National Park,True,5297
4,Yellowstone National Park,False,125774
5,Yellowstone National Park,True,18526
6,Yosemite National Park,False,75319
7,Yosemite National Park,True,11293


*Yellowstone National Park* again has the highest number of sighted species with nonprotected and protected birds.
The number of nonprotected birds is significantly higher than protected birds.

## CONCLUSION

The vast majority of species were not part of conservation (**5633 vs 191**). Mammals and Birds have the highest number of protection and were most recorded in *Yellowstone National Park*. *Great Smoky Mountains National Park* had the lowest number of sightings recorded for both species. However, it could be assumed that *Yellowstone National Park* might be much larger than other parks resulting in
a more diverse and higher number of species.