Skip to content


Switch branches/tags
This branch is 9 commits ahead of GCdye:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Modeling Toxic Phosphorus Levels in the Chesapeake Watershed


Cheaspeake Bay Water Quality Hackthon

organizded by Booz Allen Hamilton


Clay Carson
Bibor Szabo
Mike Blow

Problem Statement

Harmful algal blooms are on the radar of state agencies and local communities alike. From producing toxins harmful to humans and aquatic animals, through forming a thick mat that prevents sunlight from reaching the lower layers, to depleting the oxygen levels needed by aquatic organisms to survive, the rapid growth of algae signify an alarming level of water pollution. But the process, called eutrophication, starts way before we can see algae bloom on the water surface.

Eutrophication in modern-day societies is sped up by land-use practices that lead to excessive amounts of nutrients entering the water body and thus causing a growth spurt in first the plant (such as algae), then the animal population. In this process, phosphorus as a key nutrient, plays an important role both in producing and in controlling algae blooms. Phosphates are essential to cell reproduction. This means, that the plant population can only grow to the extent supported by the amount of phosphates in the water, regardless of the availability of other nutrients. While, therefore, a high level of phosphorus stimulates rapid algae growth, controlling the level of phosphorus in the water helps maintain a healthy aquatic ecosystem.

The first step towards controlling the total phosphorus amount in the water body is to monitor when levels are reaching a critical point. Our model predicts total phosphorus as three distinct categories: 1) healthy amount, 2) increased amount that stimulates plant growth, and 3) problematic amount that projects unhealthy algae blooms in the Potomac River.


The original dataset contained water quality data collected in the entire Chesapeake Bay and Watershed by both the CBP and the CMC.

Original Dataset
Clean Dataset - Potomac River
Clean Dataset - Cheasapeake Watershed

Missing Values

  • First, columns with more than 10% of NaN values were either dropped or NaN-s were imputed
  • Second, rows with the remaining NaN values were dropped (at this point NaN-s were less than 10% of any one column)

Tidestage: Missing values were randomly filled by 'Ebb Tide' and 'Flood Tide'

Target Variable

Total Phosphorus

  • healthy amount (0.01 - 0.03 mg/L)
  • increased amount (0.025 - 0.1 mg/L)
  • problematic amount (> 0.1 mg/L)


  • All parameter measures
  • Engineered features listed bellow

Feature Engineering

Transformed Variables:

  • Date_Time -> Year and Months

Dummified Variables:

  • Tide Stage

Model Description

Random Forest Classifier


  • n_estimators: 750
  • min_samples_split: 5

Evaluation of the Model

The model predicts whether or not the measured amount of total phosphorus is dangerously high, stimulates plant growth, or is at a healthy level, with a cross-validated accuracy of 97%. The testing acccuracy is 97% as well, supporting the notion that the model generalizes well to yet unseen data. (Training accuracy score: 100%.)

Future Directions

  • Incorporate data sets with more features (such as benthic data and weather data).
  • Try a neural net classifier.

Data cleaning
Data transformation
Random Forest - Potomac River
Random Forest - entire Chesapeake Watershed


Phosphates in the Environment - Water Research Center
Indicator: Phosphorus - U.S. Environmental Protection Agency
Nutrient Pollution - U.S. Environmental Protection Agency
Nutrients: Phosphorus, - Minnesota Pollution Control Agency
Harmful Algal Bloom - CDC
Satellite Imagery Can Track Harmful Algal Blooms- USGS


Month long hackathon for Booz allen Hamilton to help the Chesapeake bay







No releases published


No packages published


  • Jupyter Notebook 100.0%