### Biodiversity Python Project by Albert Cort Banke
_____________________________________________________________________________________________________________________________________________

Project scope

***Goal***: To draw insights from the biodiversity data included and illustrate interesting features by answering questions with visualizations and summary statistics

***Data***: There are two datasets. One is data about species and the other is data about observations of different species in national parks 

***Analysis***: We use exploratory data analysis and hypothesis testing to assess the data and determine associations in the data.

***Questions***: In what ways do species represent biodiversity? What parks have the most biodiversity? Why?

_____________________________________________________________________________________________________________________________________________

**1. Preparation**

Import the relevant libraries (A) and import and read the data (B)

In [1]:
# (A)
from matplotlib import pyplot as plt
from scipy.stats import pearsonr
from itertools import cycle
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
# (B)
species = pd.read_csv('species_info.csv')

observations = pd.read_csv('observations.csv')

# We print the first few observations to check the data has been imported correctly

print(species.head())
print(observations.head())

# The datasets look correct at first glance, however, in the species dataset the conservation_status column is apparently blank.
# Moreover, the common_names column contains multiple names for the scientific_name. This is important to note

  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic Cattle (Feral), Dom...                 NaN  
3  Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)                 NaN  
4                                      Wapiti Or Elk                 NaN  
            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               

In [4]:
# Check for missing values in the datasets

print(species.info())

# The conservation-status has an overload of missing entries. The other columns in the species dataset are complete 

print(observations.info())

# The observation dataset is complete. No missing values are present 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None
