In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import pearsonr
import seaborn as sns

obs = pd.read_csv('observations.csv')
species =pd.read_csv('species_info.csv')

print(obs.head())
print(species.head())




            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85
  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic 

We have two datasets, one contains the number of observationsa and in which national park a species was seen. The second contains species and their common names and their conservation status (if of concern)
We explored conservation status of different species in the national parks.


Vascular plants (77%) followed by birds (8.9%) were the most frequently recorded species groups  

In [None]:
print(species.groupby("category").size())
print(len(species))

In [None]:
print(species.conservation_status.unique())
print('There are 4 categories of conservation status')
species['conservation_status'].value_counts().plot(kind='bar')
plt.ylabel('Frequency')
plt.xlabel('Conservation status')
plt.title('Frequency of different conservation statuses recorded for animal species in US National Parks')
plt.show()
print('The majority of animal species were recorded as a Species of Concern')



We next looked at whether certain animal groups differed in their distribution amongst conservation statuses


In [None]:
print(species.category.unique())
#first going to make indiv barplots for each category then do a stacked barplot
species.fillna('No Intervention', inplace=True)
species_grouped= species.groupby(['category']).conservation_status.value_counts()
#changes the series back to a panda df and gives us count as column 
species_grouped= species_grouped.rename('count').reset_index()
print(species_grouped)



In [None]:
#unstack is way of getting data into wide format
print(species_grouped.set_index(['category','conservation_status']).unstack())

species_grouped.set_index(['category','conservation_status']).unstack().plot.bar(figsize=(7, 5))
plt.title('Conservation status by animal species group')
plt.xlabel('Animal species group')
plt.ylabel('Count')
plt.legend(['Endangered','In Recovery','No Intervention','Species of Concern','Threatened'])



Vascular plants and birds have the largest counts of species that are classified as no intervention. Due to the large majority of sightings being of species needig no intervention we cannot assess which groups have more endangered species etc. so we need to explore these as proportions of the number of each group's species.

In [None]:
species_grouped_norm = species_grouped.set_index(['category','conservation_status']).unstack()
print(species_grouped_norm)
#sums by row with axis 1
total_species = species_grouped_norm.sum(axis=1)
#divide by column 
species_grouped_norm = species_grouped_norm.div(total_species, axis=0)
print(species_grouped_norm)
species_grouped_norm.plot.bar(figsize=(7, 5))
plt.title('Conservation status by animal species group')
plt.xlabel('Animal species group')
plt.ylabel('% by animal species group')
plt.legend(['Endangered','In Recovery','No Intervention','Species of Concern','Threatened'])


From this plot we can see that as a proportion of all of their species group records, the largest number of endangered species are in the fish and mammal groups. Reptiles, vascular non vascular plants had fewer recordings and of these, the species sighted were all species of concern (or not recorded).

In [None]:
species_grouped_norm = species_grouped.set_index(['category','conservation_status']).unstack()
print(species_grouped_norm)
#sums by row with axis 1
total_species = species_grouped_norm.sum(axis=0)
#divide by column 
species_grouped_norm = species_grouped_norm.div(total_species, axis=1)
print(species_grouped_norm)
species_grouped_norm.plot.bar(figsize=(7, 5))
plt.title('Conservation status by animal species group')
plt.xlabel('Animal species group')
plt.ylabel('% by conservation status')
plt.legend(['Endangered','In Recovery','No Intervention','Species of Concern','Threatened'])

When we look at the composition of each conservation status group we can see variation in which species groups predominate. For the endangered status most species are mammals (44%) followed by birds (25%). Additionally, although these two species groups predominate the endangered status species, birds (75%) and mammals (25%) make up all the species determined to be in recovery status.

In [7]:

print(obs.sort_values(by='observations',ascending=False))
print('The most frequently observed species overall was Lycopodium tristachyum, a moss')

                 scientific_name                            park_name  \
11281     Lycopodium tristachyum            Yellowstone National Park   
1168          Castilleja miniata            Yellowstone National Park   
20734        Cryptantha fendleri            Yellowstone National Park   
8749   Dracocephalum parviflorum            Yellowstone National Park   
7112           Bidens tripartita            Yellowstone National Park   
...                          ...                                  ...   
20375          Sambucus mexicana  Great Smoky Mountains National Park   
18823               Rana sierrae  Great Smoky Mountains National Park   
16054         Strix occidentalis  Great Smoky Mountains National Park   
15511         Collomia tinctoria  Great Smoky Mountains National Park   
9418             Corydalis aurea                  Bryce National Park   

       observations  
11281           321  
1168            317  
20734           316  
8749            316  
7112         

Conclusions:
    -> There are lots of sightings recorded in the National Parks- the majority of sightings are of species with a conservation status of 'No Intervention' (this makes sense, more commonly seen if not endangered)
    ->Vascular plants and birds are the predominant species groups that have species with a 'No Intervention' status
    -> At the species group level,mammals and birds have the highest proportions of species that are endangered
    -> At the conservation status level, mammals and birds also had the highest proportion of species in recovery (good news)
   -> However, this is a 7 day dataset. For exploring conservation status and species recovery a longitudinal dataset would be of more interest
   