# Biodiversity Project

## Introduction

In this project we will be interpreting data from the National Parks Services about endangered species in different parks.

The data files being used `Observations.csv` and `Species.csv` has been provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

## Scoping 

When starting a new project, it's a good idea to define its scope. Four parts have been designed to guide the project's development and progress. The first component is the project goals, which outline the project's high-level aims and purposes. The next component is the data; fortunately, data is already supplied in this project, but it must be determined whether the project goals can be satisfied with the data available. Third, the analysis must be well planned, including the methodologies and questions that will be used to achieve the project's objectives. Finally, evaluation will assist us in drawing conclusions and results from our study.

### Project Goals

In this project, the perspective will be that of a biodiversity analyst for the National Parks Service. The National Parks Service seeks to secure the survival of at-risk species and sustain the amount of biodiversity in its parks. As a result, the primary goals of an analyst will be to understand the features of the species and their conservation status, as well as the species' link to national parks. Some questions are posed:

- What is the distribution of `conservation_status` for animals?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which species were spotted the most at each park?

### Data

This project included two data sets with the package. The first `csv` file contains information about each species, while the second has observations of species and park locations. This data will be used to examine the project's goals. 

### Analysis

In this part, descriptive statistics and data visualization techniques will be used to better comprehend the data. Statistical inference will also be applied to determine if the observed values are statistically significant. Some of the major metrics that will be calculated are: 

1. Counts
1. Distributions
1. Relationship between Species
1. `conservation_status` of Species
1. Observations of Species in Parks.

### Evaluation

Finally, it is a good idea to read over the objectives again and see if the analysis output matches to the questions that were originally established to be addressed (in the goals section). This part will also reflect on what was discovered during the process and whether any of the questions could not be addressed. This might also include limits or if the analysis could have been conducted using alternate approaches.

## Importing Python Modules

We will first begin with importing the required modules that will be used in this project:

In [3]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

## Loading the Data

The next steps for us would be to load the provided data for us to continue analysing further. We will glimpse adn check their contents using the `.head()` method.

#### Species

`species_info.csv` contains information about the different species in the National Parks. The dataset consists of 4 columns namely:

- **category** - class of animal
- **scientific_name** - the scientifc name of each species
- **common_name** - the common names of each species
- **conservation_status** - each species' current conservation status

In [4]:
species = pd.read_csv('species_info.csv', encoding='utf-8')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


#### Observations

`observations.csv` contains information about the different species throughout the national parks over the past 7 days from the recorded sightings. The dataset consists of 3 columns namely:

- **scientific_name** - the scientific name of each species
- **park_name** - park where species were found
- **observations** - the number of times each species was observed at park

In [5]:
observations = pd.read_csv('observations.csv', encoding='utf-8')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


#### Data Characteristics

Next we will check the sizes of both of the files `species_info.csv` and `observations.csv`

In [6]:
species_shape = species.shape
observations_shape = observations.shape

In [7]:
print(f"Species shape: {species_shape}")
print(f"Observations shape: {observations_shape}")

Species shape: (5824, 4)
Observations shape: (23296, 3)


From above we can see that `species_info.csv` consists of 5,824 rows and 4 columns. 

`observations.csv` consists of 23,296 rows and 3 columns.

### Exploring the Data

We will first be exploring the data of `species` in depth. We will start by getting the number of unique species available in the national parks.

In [9]:
unique_sn = species.scientific_name.nunique()
print(f"Number of Species: {unique_sn}")

Number of Species: 5541


We used the `scientific_name` column from the data provided to find the value of 5,541 unique species above.