# Portfolio Project: Biodiversity in National Parks
Code and analysis by Adam Laviguer \<adamlaviguer@gmail.com\>

___
## Project Goals
1. Complete a project to supplement my portfolio
2. Use Jupyter Notebook to communicate findings
3. Run an analysis on a set of data
4. Become familiar with data analysis workflow

## Prompt
For this project, you will interpret data from the National Parks Service about endangered species in different parks.

You will perform some data analysis on the conservation statuses of these species and investigate if there are any patterns or themes to the types of species that become endangered. During this project, you will analyze, clean up, and plot data as well as pose questions and seek to answer them in a meaningful way.

After you perform your analysis, you will share your findings about the National Park Service.

___
## Project Data

This project is based on data provided in two CSV files called `observations.csv` and `species_info.csv`. Refer to the explanation below to understand the variables in each dataset.

**observations.csv:** 23,296 total rows
- **scientific_name** - the scientific name of each species
- **park_name** - Park where species were found
- **observations** - the number of times each species was observed at park

**species_info.csv:** 5,824 total rows
- **category** - class of animal
- **scientific_name** - the scientific name of each species
- **common_name** - the common names of each species
- **conservation_status** - each speciesâ€™ current conservation status

Given this precursory understanding, we notice that there is a common variable (`scientific_name`) in both datasets. With this in mind, we will proceed with combining the datasets by merging on the key variable `scientific_name`. This will better aid our attempt to extract meaningful conclusions from the all of the data provided.

In [19]:
# Import common libraries. All or some of these may be used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats
from scipy.stats import iqr, ttest_ind, pearsonr, trim_mean, chi2_contingency, ttest_1samp, binomtest

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Understanding `observations.csv`

In [20]:
# Read the data file into a pandas DataFrame.
observations_df = pd.read_csv('./codecademy-portfolio-project-biodiversity-in-national-parks/observations.csv')
# Display the first 5 rows of the DataFrame.
print('\n=================== FIRST FIVE ROWS ===================\n{}\n'.format(observations_df.head()))
print('\n=================== DF INFO ===================')
observations_df.info()
print('\n\n=================== DF DESCRIBE ===================\n{}\n'.format(observations_df.describe()))


            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


       observations
count  23296.000000
mean     142.287904
std       69.890532
min        9.000000
25%       86.000000
50%      124.000

### Understanding `species_info.csv`

In [21]:
# Read the data file into a pandas DataFrame.
species_info_df = pd.read_csv('./codecademy-portfolio-project-biodiversity-in-national-parks/species_info.csv')
# Display the first 5 rows of the DataFrame.
print('\n=================== FIRST FIVE ROWS ===================\n{}\n'.format(species_info_df.head()))
print('\n=================== DF INFO ===================')
species_info_df.info()
print('\n\n=================== DF DESCRIBE ===================\n{}\n'.format(species_info_df.describe()))


  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic Cattle (Feral), Dom...                 NaN  
3  Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)                 NaN  
4                                      Wapiti Or Elk                 NaN  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name   

### Combining the Datasets

In [22]:
merged_df = pd.merge(observations_df, species_info_df, on='scientific_name')

# Display the first 5 rows of the DataFrame.
print('\n=================== FIRST FIVE ROWS ===================\n{}\n'.format(merged_df.head()))
print('\n=================== DF INFO ===================')
merged_df.info()
print('\n\n=================== DF DESCRIBE ===================\n{}\n'.format(merged_df.describe()))


            scientific_name                            park_name  \
0        Vicia benghalensis  Great Smoky Mountains National Park   
1            Neovison vison  Great Smoky Mountains National Park   
2         Prunus subcordata               Yosemite National Park   
3      Abutilon theophrasti                  Bryce National Park   
4  Githopsis specularioides  Great Smoky Mountains National Park   

   observations        category                        common_names  \
0            68  Vascular Plant  Purple Vetch, Reddish Tufted Vetch   
1            77          Mammal                       American Mink   
2           138  Vascular Plant                        Klamath Plum   
3            84  Vascular Plant                          Velvetleaf   
4            85  Vascular Plant                      Common Bluecup   

  conservation_status  
0                 NaN  
1                 NaN  
2                 NaN  
3                 NaN  
4                 NaN  


<class 'pandas.co

### Handling Duplicates

In [None]:
merged_df = merged_df.sort_values(by=['category', 'scientific_name', 'common_names', 'park_name'])

# Check for duplicate rows.
print(merged_df.duplicated(keep=False).sum())   # Use keep=False to return the total number of duplicates, not just the first occurrence

def consolidate_common_names(df):
    consolidated_rows = []
    skip_next = False

    for i in range(len(df)):
        if skip_next:
            skip_next = False
            continue

        current_row = df.iloc[i]

        if i < len(df) - 1:
            next_row = df.iloc[i + 1]

            if (current_row['category'] == next_row['category'] and
                current_row['scientific_name'] == next_row['scientific_name'] and
                current_row['park_name'] == next_row['park_name']):
                
                combined_common_names = f"{current_row['common_names']}, {next_row['common_names']}"
                consolidated_row = current_row.copy()
                consolidated_row['common_names'] = combined_common_names
                consolidated_rows.append(consolidated_row)
                skip_next = True
            else:
                consolidated_rows.append(current_row)
        else:
            consolidated_rows.append(current_row)

    return pd.DataFrame(consolidated_rows)


# # Drop the duplicate rows in place.
# merged_df.drop_duplicates(inplace=True)
# # Check for duplicate rows.
# print(merged_df.duplicated(keep=False).sum())

62


## Analysis

### 1. What is the most diverse national park?

To begin answering this question, we will consider "diversity" to be a measure of unique species which have been observed at each national park.

In [24]:
for park in merged_df['park_name'].unique():
    unique_species_count = merged_df[merged_df['park_name'] == park]['scientific_name'].nunique()
    print(f'{park}: {unique_species_count} unique species')

Bryce National Park: 5541 unique species
Great Smoky Mountains National Park: 5541 unique species
Yellowstone National Park: 5541 unique species
Yosemite National Park: 5541 unique species
