# Introduction
### Author: Alec Swainston
This goal of this project is to analyze biodiversity data from the National Parks Service, particularly around various species observed in different national park locations.

This project will scope, analyze, prepare, plot data, and seek to explain the findings from the analysis.

Here are a few questions that this project has sought to answer:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

**Data sources:**

Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

### Goals
To better understand the characteristics about the species and their conservations status, and those species and their relationship to the national parks. Some questions that are posed:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

### Data

The first `csv` file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project. Both of these files contain data inspired by real data, but is not taken from any actual source. It was created by Codecademy.com

### Analysis

Descriptive statistics and data visualization techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include: 

1. Distributions
1. counts
1. relationship between species
1. conservation status of species
1. observations of species in parks. 

## Import Python Modules

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

## Loading in the Data

In [4]:
species = pd.read_csv('species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [5]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [6]:
print(f"species shape: {species.shape}")
print(f"observations shape: {observations.shape}")

species shape: (5824, 4)
observations shape: (23296, 3)


## Exploring the Data
Getting a better grip of whhat the data contains, how it is structured, and what it looks like.

### Number of Species = 5541

In [9]:
print(species.scientific_name.nunique())

5541


### Categories
7 categories: Mammal, Bird, Reptile, Ambhibian, Fish, Vascular Plant, Nonvascular Plant
| Species | Count |
| ----------- | ----------- |
| Ambibian | 80 |
| Bird | 521 |
| Fish | 127 |
| Mammal | 214 |
| Nonvascular Plant | 333 |
| Reptile | 79 |
| Vascular Plant | 4470 |

In [14]:
print(species.category.nunique())
print(species.category.unique())
species.groupby("category").size()

7
['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64

### Conservation Status
Species of Concern, Endagered, Threatened, In Recovery, nan
| Conservation Status | Count |
| ----------- | ----------- |
| NaN | 5633 |
| Endagered | 16 |
| In Recovery | 4 |
| Species of Concern | 161 |
| Threatened | 10 |

In [17]:
print(species.conservation_status.unique())
print(species.conservation_status.isna().sum())

print(species.groupby("conservation_status").size())

[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']
5633
conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64


### National Parks in this Study:
Great Smoky Mountains, Yosemite, Bryce Canyon, Yellowstone

3,314,739 Observations

In [19]:
print(observations.park_name.unique())
print(observations.observations.sum())

['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']
3314739


## Cleaning Data

In [21]:
# Replace NaN conservation status with "No Intervention"
species.fillna('No Intervention', inplace=True)
species.groupby("conservation_status").size()

conservation_status
Endangered              16
In Recovery              4
No Intervention       5633
Species of Concern     161
Threatened              10
dtype: int64