First, we'll import `pandas` to load our csv files into DataFrames, and `matplotlib` to help us create data visualizations.

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
obs = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

In [5]:
obs.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [6]:
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


It looks like our observations data contains a list of species, based on their scientific names, along with what National Park they were in and how many observations of this species occurred.

The species info data also shares the scientific name category and has more information about that species, such as common name, conservation status, and 'category' showing whether it's a mammal, reptile, etc.

In [36]:
species.groupby('category').count()

Unnamed: 0_level_0,scientific_name,common_names,conservation_status
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Amphibian,80,80,7
Bird,521,521,79
Fish,127,127,11
Mammal,214,214,38
Nonvascular Plant,333,333,5
Reptile,79,79,5
Vascular Plant,4470,4470,46


In [39]:
species.scientific_name.count()

5824

In [38]:
len(pd.unique(species.scientific_name))

5541

In [42]:
species.drop_duplicates(inplace=True)
species.scientific_name.count()

5824

Since there are fewer unique scientific names for species than the total count, but there don't appear to be duplicate rows, I'm going to assume that there are some repeat names with different common names listed, as there are a ton of possible common names for each Latin name.

The species data is highly representative of vascular plants, and much less representative of other categories, especially reptiles and amphibians. We can also see that there only 'missing data' is under conservation_status and it seems safe to assume that missing data here means that their population is not considered threatened. We'll check what values exist for conservation_status.

In [12]:
species.groupby('conservation_status').count()

Unnamed: 0_level_0,category,scientific_name,common_names
conservation_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Endangered,16,16,16
In Recovery,4,4,4
Species of Concern,161,161,161
Threatened,10,10,10


Now I'd like to get some data about the observations, such as how many data points there are, how species repeat across parks, etc.

In [14]:
obs.count()

scientific_name    23296
park_name          23296
observations       23296
dtype: int64

In [34]:
len(pd.unique(obs.scientific_name))

5541

Above, we were able to see that there are 5824 species in our species_info data, but only 5541 unique species observed in our observations. This lines up with the number of unique scientific names in our species_info data.