# Biodiversity Portfolio Project

I will aim to answer the questions proposed by the solution document:
- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

In [6]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

First, I will import the relevant datasets, `species_info.csv` as well as `observations.csv`.

In [7]:
species = pd.read_csv('species_info.csv')
species

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
...,...,...,...,...
5819,Vascular Plant,Solanum parishii,Parish's Nightshade,
5820,Vascular Plant,Solanum xanti,"Chaparral Nightshade, Purple Nightshade",
5821,Vascular Plant,Parthenocissus vitacea,"Thicket Creeper, Virginia Creeper, Woodbine",
5822,Vascular Plant,Vitis californica,"California Grape, California Wild Grape",


In [8]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


Check for duplicate rows and drop if necessary.

In [9]:
print(species.duplicated().nunique())
print(observations.duplicated().nunique())

#as only the second dataframe seems to have duplicate rows, I will drop those.

observations = observations.drop_duplicates()

1
2


Check for shape of the dataframes as well as for data types.

In [10]:
print(species.shape, observations.shape)
print(species.dtypes)
print(observations.dtypes)

(5824, 4) (23281, 3)
category               object
scientific_name        object
common_names           object
conservation_status    object
dtype: object
scientific_name    object
park_name          object
observations        int64
dtype: object


### Exploring the species dataframe

The columns and values will be analyzed and, if needed, replaced or adapted.

In [11]:
print(species.category.unique())
print(species.scientific_name.nunique())
print(species.common_names.nunique())
print(species.conservation_status.unique())

['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']
5541
5504
[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


In [12]:
species['conservation_status'].fillna('No Concern', inplace=True)
#species.conservation_status.str.lower()
species

print(species.conservation_status.unique())

['No Concern' 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


### Exploring the observations dataframe

We have previously already dropped duplicates. We now need to check whether there is NaN values which should be replaced.

In [13]:
print(observations.scientific_name.isna().sum())
print(observations.park_name.unique())
print(observations.observations.isna().sum())

0
['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']
0


No missing data could be found, meaning that the dataframe is ready for analysis.

### Conservation status by species

To answer the first three questions, I will first create a pivot table by category and conservation status.

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?

In [14]:
species_conservation = species.groupby(['conservation_status', 'category']).scientific_name.count()

species_conservation.unstack(level=0)

conservation_status,Endangered,In Recovery,No Concern,Species of Concern,Threatened
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amphibian,1.0,,73.0,4.0,2.0
Bird,4.0,3.0,442.0,72.0,
Fish,3.0,,116.0,4.0,4.0
Mammal,7.0,1.0,176.0,28.0,2.0
Nonvascular Plant,,,328.0,5.0,
Reptile,,,74.0,5.0,
Vascular Plant,1.0,,4424.0,43.0,2.0


I will add a column in the dataframe which I add whether the conservation status is critical or not and then create a new pivot table, which will also be the basis for testing whether the differences between certain categories of species are significant.

First, the table already shows that there are certain types of animals where a larger share is protected, such as mammals and birds.

In [15]:
species['is_protected'] = species.conservation_status.apply(lambda x: True if x != 'No Concern' else False)

species_protection = species.groupby(['is_protected', 'category']).scientific_name.count().unstack(level=0)
species_protection.columns = ['not_protected', 'protected']
#percentage share
species_protection['share_protected'] = (species_protection.protected / species_protection.not_protected) * 100
species_protection

Unnamed: 0_level_0,not_protected,protected,share_protected
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Amphibian,73,7,9.589041
Bird,442,79,17.873303
Fish,116,11,9.482759
Mammal,176,38,21.590909
Nonvascular Plant,328,5,1.52439
Reptile,74,5,6.756757
Vascular Plant,4424,46,1.039783


The correct test to decide whether there is a statistical significance in the differences between categories is the Chi-Square Test. For that, a contingency table will be created. As an example, I will test whether there is a relevance in changed outcomes between Fish and Mammal as well as between Mammal and Bird.

In [16]:
from scipy.stats import chi2_contingency

#Fish and Mammals
contingency1 = [[116,11],
              [176,38]]

#Mammals and Bird
contingency2 = [[176, 38],
               [442, 79]]

print(chi2_contingency(contingency1))
print(chi2_contingency(contingency2))

(4.644937895246063, 0.031145264082780604, 1, array([[108.75073314,  18.24926686],
       [183.24926686,  30.75073314]]))
(0.5810483277947567, 0.445901703047197, 1, array([[179.93469388,  34.06530612],
       [438.06530612,  82.93469388]]))


The p-value, which is a result of this test, is below 0.05 for the first pair. This suggests that the variables are dependent, and that there is actually a statistical significance between Fish and Mammal and the protection status. The second pair does not show significance, meaning the variables are very likely to be independent from each other.

### Observations by parks

For the last question, 

- Which animal is most prevalent and what is their distribution amongst parks?

I will group the data of the observation table by scientific name and park, and sum the observations.

In [18]:
observation_species = observations.groupby(['scientific_name', 'park_name']).observations.sum().unstack(level=-1)
observation_species

park_name,Bryce National Park,Great Smoky Mountains National Park,Yellowstone National Park,Yosemite National Park
scientific_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abies bifolia,109,72,215,136
Abies concolor,83,101,241,205
Abies fraseri,109,81,218,110
Abietinella abietina,101,65,243,183
Abronia ammophila,92,72,222,137
...,...,...,...,...
Zonotrichia leucophrys gambelii,58,87,246,169
Zonotrichia leucophrys oriantha,73,123,227,135
Zonotrichia querula,105,83,268,160
Zygodon viridissimus,100,71,270,159


In [20]:
observation_species['sum_sightings'] = observation_species['Bryce National Park'] + observation_species['Great Smoky Mountains National Park'] + observation_species['Yellowstone National Park'] + observation_species['Yosemite National Park']
observation_species.sort_values('sum_sightings', ascending=False)

park_name,Bryce National Park,Great Smoky Mountains National Park,Yellowstone National Park,Yosemite National Park,sum_sightings
scientific_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Holcus lanatus,296,216,805,463,1780
Castor canadensis,278,243,703,501,1725
Hypochaeris radicata,294,195,726,505,1720
Puma concolor,311,239,753,408,1711
Procyon lotor,247,247,745,453,1692
...,...,...,...,...,...
Rana sierrae,31,11,60,42,144
Noturus baileyi,22,23,67,31,143
Vermivora bachmanii,20,18,58,45,141
Canis rufus,30,13,60,34,137


As the sorted dataframe shows, the most sighted species was the Holcus lanatus, and it was most often seen in Yellowstone National Park.