# Biodiversity in National Parks - Data Analysis
### Analysis Overview
In this analysis, we will be analyzing data from two datasets containing information on National Parks and the species who inhabit them.

The datasets are:

- ```"observations.csv"``` - includes data on the national park being observed, what species live their, and how many times each species has been recorded being spotted within the last 7 days.
- ```"species_info.csv"``` - includes data on a large number of different species.  It features the species' scientific name, common name, category of species, and conservation status.

---

## Section 1 - Understanding the Data
In order to work with any data, you first need to understand what you'll be working with.  

1. Import the Python libraries you'll need to conduct your analysis 
2. Load the two CSV datasets into two seperate Pandas DataFrames, ```parks``` and ```species```
3. View the first 10 rows of the ```parks``` dataframe
4. View the first 10 rows of the ```species``` dataframe

In [1]:
# 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

In [2]:
# 2
parks = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

In [4]:
# 3 
parks.head(10)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
5,Elymus virginicus var. virginicus,Yosemite National Park,112
6,Spizella pusilla,Yellowstone National Park,228
7,Elymus multisetus,Great Smoky Mountains National Park,39
8,Lysimachia quadrifolia,Yosemite National Park,168
9,Diphyscium cumberlandianum,Yellowstone National Park,250


In [5]:
# 4
species.head(10)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
5,Mammal,Odocoileus virginianus,White-Tailed Deer,
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",
7,Mammal,Canis latrans,Coyote,Species of Concern
8,Mammal,Canis lupus,Gray Wolf,Endangered
9,Mammal,Canis rufus,Red Wolf,Endangered


The first thing that stands out to me is in the ```species``` dataframe.  Without looking any deeper you can already see that it contains some missing values ('NaN').  In order to accurately work with this data, you'll need to clean up your dataframes.  Aside from that one column in the ```species``` dataframe, you aren't sure which columns may have missing data in either of the two dataframes, so you'll need to do some work to find out.

Let's focus on looking at your ```parks``` dataframe first!

5. Check to see if the ```parks``` dataframe contains any missing values.  Check column names and data types as well
6. Take a look at the summary statistics
7. find out how many species observations there are for each National Park 

In [7]:
# 5
parks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


In [10]:
# 6
parks.describe(include='all')

Unnamed: 0,scientific_name,park_name,observations
count,23296,23296,23296.0
unique,5541,4,
top,Myotis lucifugus,Great Smoky Mountains National Park,
freq,12,5824,
mean,,,142.287904
std,,,69.890532
min,,,9.0
25%,,,86.0
50%,,,124.0
75%,,,195.0


In [12]:
# 7
parks.park_name.value_counts()

Great Smoky Mountains National Park    5824
Yosemite National Park                 5824
Bryce National Park                    5824
Yellowstone National Park              5824
Name: park_name, dtype: int64

You can see that the length of the dataframe is 23,296 rows long and that it has 3 columns, each with 0 missing values.  The column names are clean and easy to work with and you can also see the data types of each column make sense.

Looking at the output from code block #6, you can see that there are a lot of unique species within the dataframe (5541), but only 4 diferent National Parks.    

Since there isn't a lot of unique values, you can easily use ```.value_counts()``` on the dataframe to show that there is 5824 observations of species in each National Park.  

Though our data looks seemingly nice and tidy, we could further inspect the ```scientific_name``` column to see if there are any strange values that are hidden within the column that could effect the analysis.
  

# Appendix
### Section 1
1. Python library imports:
* Numpy - Used for fast calculations within pandas.  Can perform matrix calculations.
* Pandas - Used for loading in datasets into workable dataframes.  Crucial for Data Analysis with Python.
* Matplotlib - Used for creating plots and visualizations from pandas dataframes.
* Seaborn - Newer, more intuitive visualization library built to work with Matplotlib and pandas.
* Statsmodels - Used for creating predictive models.

2. Use the ```.read_csv()``` method along with ```pd``` (pandas) to import a csv file into a pandas dataframe.  The method takes the CSV filename surrounded by quotations as a parameter (Example: parks = pd.read_csv("observations.csv"))

3. Use the ```.head()``` method from the ```pandas``` library on the ```parks``` dataframe to show the first 10 rows of the dataframe

4. Use the ```.head()``` method from the ```pandas``` library on the ```species``` dataframe to show the first 10 rows of the dataframe

5. Use the ```.info()``` method from the ```pandas``` library on the ```parks``` dataframe to show column names, data types of the columns, length of the dataframe, and possible missing values within the dataframe.  This shows that the length of the dataframe is 23,296 rows long and that it has 3 columns, each with 0 missing values.  We can see that the column names are clean and easy to work with and we can also see the data types of each column, which are workable for now.

6. Use the ```.decribe()``` method from the ```pandas``` library on the ```parks``` dataframe to show some summary statistics about each column.  In order to see summary statistics for categorical variables, you need to use the ```include=``` parameter and set its value to ```'all'```.  This shows the amount of unique values for the categorical variables (```scientific_name``` and ```park_name```), and some useful statistics about the quantitative variable (```observations```).  The average number of times a single species was observed in 7 days was 142 times.  The minimum times observed was 9 times and the maximum was 321 times.  This is useful primarily because now you know that the ```observations``` column doesn't contain any zero values in it.  This means you can move on to exploring a different variable within the data.

7. Use the ```.value_counts()``` method from the ```pandas``` library on the ```parks.park_name``` dataframe column to show the frequency of the unique values within the column.  You can see that each of the 4 National Parks contain an equal amount of entries.  This information also shows that the 4 unique values within this column are in fact National Parks.
