In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Plankton-Patrol/notebooks


In [7]:
cd plankton_patrol_models/

/content/drive/MyDrive/Plankton-Patrol/notebooks/plankton_patrol_models


# Introduction

This notebook is for practicing XGBoost and initial analyzation of Tidal Benthic dataset from the [Chesapeake Bay Project DataHub](https://datahub.chesapeakebay.net/LivingResources). This data goes through September 26, 2013.

The Tidal Benthic database measures sediment, bio mass data, taxonomic data, and water quality data. These separate datasets were merged using the monitoring event data, as decribed by the [2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf). More cleaning is described in the [cleaning notebook](../../notebooks/planton_patrol-APIs/plank_cleaningBenthic.ipynb)

In [2]:
import pandas as pd

And read in our parameter dictionary

In [8]:
import json

# Import from JSON file
with open('../../data/plank_CBPparam_dict.json', 'r') as f:
    benthic_param_dict = json.load(f)


# Using the 'wide' dataset

The 'wide' dataset converts each parameter to a column, creating one row per `FieldActivityId`. That is, any data with the same location (including depth), data, time, and sample volume will be in the same row of the dataframe. There are 903 data points from July 29, 2004 throught September 26, 2013.

In [12]:
import pandas as pd

benthic_data_wide = pd.read_csv('../../data/plank_ChesapeakeBayBenthic_clean_wide.csv')

Now, a few columns somehow survived cleaning. Let's remove them again.

In [13]:
benthic_data_wide = benthic_data_wide.drop(columns=['EventId','SampleType','SampleReplicate']).reset_index(drop=True)

In [14]:
benthic_data_wide.shape

(903, 569)

We can also verify that there are 903 unique samples.

In [15]:
benthic_data_wide['FieldActivityId'].nunique()

903

Since `FieldActitivyId` is a unique numerical value for each sample, it might give false correlations. Let's remove it (others should probably be removed later).

In [16]:
benthic_data_wide = benthic_data_wide.drop(columns='FieldActivityId')

A lot of parameters are missing values, especially the taxonomic and sediment data. Let's remove columns missing more than half of the values.

In [17]:
threshold = 0.5

missing_percentage = benthic_data_wide.isnull().mean()

columns_to_keep = missing_percentage[missing_percentage <= threshold].index

# Create a new DataFrame with only the columns to keep
benthic_data_less_wide = benthic_data_wide[columns_to_keep]

print('Remaining columns:', benthic_data_less_wide.columns)


Remaining columns: Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'SampleDate', 'SampleTime', 'TotalDepth', 'Source', 'ProjectIdentifier',
       'SampleVolume', 'PDepth', 'Salzone', 'MOIST', 'SAND', 'SILTCLAY', 'TC',
       'TIC', 'TN', 'TOC', 'PCT_CARN_OMN', 'PCT_DEPO', 'PCT_PI_ABUND',
       'PCT_PI_BIO', 'PCT_PS_ABUND', 'PCT_PS_BIO', 'SW', 'TOTAL_SCORE',
       'TOT_ABUND', 'TOT_BIOMASS_G', 'SampleDepth', 'DO', 'DO_SAT_P', 'PH',
       'SALINITY', 'SPCOND', 'WTEMP'],
      dtype='object')


## Updating the dictionary--first time only

Some of these columns are already in our parameter dictionary, and some do not need to be (SampleDate, Latitude, Longitude). Let's find which ones are missing.

In [None]:
columns_in_data = benthic_data_less_wide.columns.to_list()

no_dict_columns = ['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude', 'SampleDate', 'SampleTime', 'TotalDepth', 'Source','ProjectIdentifier', 'SampleVolume', 'PDepth', 'Salzone']

columns_to_check = [value for value in columns_in_data if value not in no_dict_columns]

[col for col in columns_to_check if col not in benthic_param_dict]


Now we consult the [ 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf)

In [None]:
additional_param_dict = {'PCT_CARN_OMN': {'Units': 'Percent', 'Type':'Percent Carnivores And Omnivores'},
 'PCT_DEPO': {'Units': 'Percent', 'Type':'Percent Deep Deposit Feeders'},
 'PCT_PI_ABUND': {'Units': 'Percent', 'Type': 'Percent Pollution Indicative Species Abundance'},
 'PCT_PI_BIO': {'Units': 'Percent', 'Type': 'Percent Pollution Indicative Species Biomass'},
 'PCT_PS_ABUND': {'Units': 'Percent', 'Type': 'Percent Pollution Sensitive Species Abundance-Tidal Benthic'},
 'PCT_PS_BIO': {'Units': 'Percent', 'Type': 'Percent Pollution Sensitive Species Biomass'},
 'SW': {'Units': 'None', 'Type': 'Shannon Wiener Index'},
 'TOTAL_SCORE': {'Units': 'None', 'Type': 'Total Benthic Restoration Goal Score For Sample'},
 'TOT_ABUND': {'Units': 'Count', 'Type': 'Total Number Of Individuals'},
 'TOT_BIOMASS_G': {'Units': 'Will vary', 'Type': 'Total Species Biomass In'},
 'SampleDepth':  {'Units': 'M', 'Type': 'Sample Collection DepthTotal Species Biomass In'}}

In [None]:
# # only run once!
# new_param_dict = {**benthic_param_dict, **additional_param_dict}

# with open('../../data/plank_CBPparam_dict.json', 'w') as f:
#     json.dump(new_param_dict, f)

# # Reimport from JSON file
# with open('../../data/plank_CBPparam_dict.json', 'r') as f:
#     benthic_param_dict = json.load(f)

## XGBoost

Since this dataset did not include any chlrophyl data, we will pick some other parameters to model. How about `PCT_PI_BIO` - Percent Pollution Indicative Species.

In [18]:
from typing import Tuple

import numpy as np
import pandas as pd

import xgboost as xgb

Create a list of categorical data columns for the model.

In [19]:
benthic_data_less_wide.columns

Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'SampleDate', 'SampleTime', 'TotalDepth', 'Source', 'ProjectIdentifier',
       'SampleVolume', 'PDepth', 'Salzone', 'MOIST', 'SAND', 'SILTCLAY', 'TC',
       'TIC', 'TN', 'TOC', 'PCT_CARN_OMN', 'PCT_DEPO', 'PCT_PI_ABUND',
       'PCT_PI_BIO', 'PCT_PS_ABUND', 'PCT_PS_BIO', 'SW', 'TOTAL_SCORE',
       'TOT_ABUND', 'TOT_BIOMASS_G', 'SampleDepth', 'DO', 'DO_SAT_P', 'PH',
       'SALINITY', 'SPCOND', 'WTEMP'],
      dtype='object')

In [20]:
# List of categorical columns
cat_cols = ['CBSeg2003', 'CBSeg2003Description', 'Station', 'Source', 'ProjectIdentifier', 'Salzone']

# Convert categorical columns to category type
for col in cat_cols:
    benthic_data_less_wide[col] = benthic_data_less_wide[col].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  benthic_data_less_wide[col] = benthic_data_less_wide[col].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  benthic_data_less_wide[col] = benthic_data_less_wide[col].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  benthic_data_less_wide[col] = benthic_data_less_wid