# Portfolio Project on Biodiversity in National Parks
The purpose of this notebook is to use data exploration, visualisation and inference techniques to gain understanding of the following datasets. <br>
<br>
species_info.csv - contains data about different species and their conservation status. <br>
observations.csv - holds recorded sightings of different species at several national parks for the past 7 days.

This notebook will answer the following research questions. <br>
<br>
What is the distribution of conservation statuses? <br>
Are certain types of species more likely to be endangered? <br>
Which species were spotted the most at each park?

First we will import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Matplotlib is building the font cache; this may take a moment.


We will load the datasets into pandas dataframes then read the first 5 rows of each dataframe and perform some summary statistics to gain understanding of the contents of each dataframe.

In [None]:
print('Species Info')
species = pd.read_csv('species_info.csv')
print(species.head(5))
print('')
print('Observations')
observations = pd.read_csv('observations.csv')
print(observations.head(5))

Species Info


From observing the first 5 rows of each dataframe, we see that the dataset Species Info has column names _category, scientific_name, common_names_ and _conservation_status_. The dataset Observations has column names _scientific_name, park_name_ and _observations_. <br>

We have also learnt that _conservation_status_ has missing values.

It makes sense to merge these tables on the common variable _scientific_name_ so that insights can be more easily visualised across the two tables. <br>

However before we can do this, we must first deal with any missing values in each dataframe since merge will fill in any column without matching data as NaN, and check for duplicates of _scientific_name_ in _species_ which will cause more duplicates if unremoved before a merge.

In [None]:
print('Null Values in Species Info')
print(species.isna().sum())
print('\nNull Values in Observations')
print(observations.isna().sum())

We have found out that only _conservation_status_ contains null values.

In [None]:
species.conservation_status.value_counts()

By getting more information about the counts of different values of _conservation_status_, we see that the status is either _Species of Concern, Endangered, Threatened,_ or _In Recovery_. <br>
We can use this context to infer that the _NaN_ values represent species without a conservation status. <br>
Therefore this is structurally missing data. <br>
We can navigate this by filling any _NaN_ values for _conservation_status_ with a new value 'No Conservation Status'.

In [None]:
species['conservation_status'] = species['conservation_status'].fillna('No Conservation Status')

In [None]:
print(species.head(5))

Now that we have dealt with all missing values, we just need to investigate for duplicates in _scientific_name_ in the _species_ dataframe before we merge.

In [None]:
print('Species Unique Values')
print(species.nunique())
print('\nShape of Species Dataframe')
print(species.shape)

We see that _species_ has 5,824 rows but only 5,541 unique species according to _scientific_name_, so there are duplicates present in this column. <br>

To solve this we will group rows by _scientific_name_ in the species table. If duplicates are caused by differences in _common_names_ we will aggregate these into a list of distinct common names.  <br>

If duplicates are caused by differences in _conservation_status_ we will choose the worst case scenario to ensure that the danger to the species is not understated. <br>
To do this we will make _conservation_status_ an ordinal categorical variable.

In [None]:
status_order = [
    'No Conservation Status',
    'Species of Concern',
    'Threatened',
    'Endangered',
    'In Recovery'
]

In [None]:
species['conservation_status'] = pd.Categorical(
    species['conservation_status'],
    categories=status_order,
    ordered=True
)

In [None]:
species = species.groupby('scientific_name').agg({
    'common_names': lambda x: ', '.join(sorted(set(x))),
    'category': 'first',                # take first value (assuming it's the same across duplicates)
    'conservation_status': 'last'      # take the last value which should be the worst case scenario 
}).reset_index()

print(species.shape)

Finally we can see that species now has the same number of rows as distinct scientific names. We can now merge the two tables. <br>

Since the variable of most interest is _observations_ followed by _conservation_status_, we will merge the _species_ dataframe onto the _observations_ dataframe so that the rows of the new dataframe will be all rows from the _observations_ dataframe, with the _species_ info data (which includes _conservation_status_) added on where _scientific_name_ matches. <br>

In [None]:
df = pd.merge(left=species, right=observations, on='scientific_name', how='right')
print('Shape of Observations Dataframe')
print(observations.shape)
print('\nShape of New Dataframe Df')
print(df.shape)

The new dataframe _df_ has the same number of rows as _observations_ but three extra columns as expected. <br>

In [None]:
print(df.isna().sum())

The new dataframe has no null values so every scientific name in _observations_ has been matched to a scientific name in _species_.

Now that we have a combined dataset we will explore the dataset using summary statistics.

In [None]:
df.nunique()

A count of unique values reveals that our dataframe describes 5,540 distinct species which fall into 7 categories and have been observed across 4 different national parks.

In [None]:
print('Categories')
print(df['category'].unique())
print('')
print('Parks')
print(df['park_name'].unique())

In [None]:
df.info()

We see that _df_ has 23,296 rows. All variables have the datatype _object_ which is expected for strings, except _conservation_status_ which we made categorical and _observations_ which has datatype _int64_ as expected for a counting variable. <br>

In [None]:
df.describe(include='all')

The most frequent park recorded is Great Smoky Mountains National Park which is interesting because it represents exactly 25% of the data so observations must be evenly distributed between the 4 parks.

We also see that 77% of species observed in this dataset come under the category Vascular Plant.

It makes sense that plants would be sighted more than animals as they cannot move and hide in their surroundings.

In [None]:
colours = [
    'lightcoral',
    'lightsalmon',
    'palegoldenrod',
    'lightgreen',
    'lightcyan',
    'paleturquoise',
    'lightblue',
    'plum',
    'lavender',
    'thistle',
    'pink',
    'mistyrose',
    'peachpuff',
    'wheat',
    'powderblue',
    'honeydew',
    'mintcream'
]

Which types of animals are we most likely to observe at each park?

In [None]:
filtered_df = df[~df['category'].isin(['Vascular Plant', 'Nonvascular Plant'])] # filtering out plants as they dominate the dataset
grouped = filtered_df.groupby(['park_name', 'category'])['observations'].sum().reset_index() # grouping by park
pivot_table = grouped.pivot(index='park_name', columns='category', values='observations') 
pivot_table.plot(kind='bar', figsize=(10, 6), color=colours)
plt.xlabel('Park')
plt.ylabel('Total Observations')
plt.title('Most Observed Classes of Animals by Park')
plt.xticks(rotation=45)
plt.legend(title='Class')
plt.tight_layout()
plt.show()

The above grouped bar chart shows that Yellowstone National Park is the best for bird-spotting and the least animal observations occured in the Great Smoky Mountains.

Are the least observed animals also the most endangered?

In [None]:
# plot total observations vs conservation_status (perhaps log scale so that the count for 'no conservation status' fits on the graph)

Which types of animals are the most endangered?

In [None]:
# change observation sum to row count as the graph is currently biased by number of observations
obs_sum = df[(df['conservation_status'] != 'No Conservation Status') & (df['conservation_status'] != 'Species of Concern')].groupby(['conservation_status', 'category'], observed=True)['observations'].sum().unstack()
obs_sum.plot(kind='bar', stacked=True, color=colours)
plt.xticks(rotation=45)
plt.legend(title='Class')
plt.xlabel('Conservation Status')
plt.ylabel('Total Observations')
plt.title('Make-Up of Most at Risk Groups')
plt.tight_layout()
plt.show()

The above stacked bar chart shows that the most threatened species observed are fish and the most endangered species observed are mammals. Whilst many birds observed are endangered, more birds are in recovery.

Which species were most observed across all 4 parks?

In [None]:
print('Wordcloud of Most Observed Species')
from wordcloud import WordCloud

df['common_names_cleaned'] = (
    df['common_names']
    .astype(str)
    .str.replace(r"[^\w\s\-]", "", regex=True)   # Remove punctuation except hyphens
    .str.replace(r"\s+", " ", regex=True)        # Normalise whitespace
    .str.strip()                                 # Remove leading/trailing spaces
)

species_counts = df.groupby('common_names_cleaned')['observations'].sum()
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(species_counts)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

The above WordCloud shows that we can expect a lot of moss at any park.

Which animal species are most likely to be observed at each park?

In [None]:
# print table