organizded by Booz Allen Hamilton
Clay Carson
Bibor Szabo
Mike Blow
Harmful algal blooms are on the radar of state agencies and local communities alike. From producing toxins harmful to humans and aquatic animals, through forming a thick mat that prevents sunlight from reaching the lower layers, to depleting the oxygen levels needed by aquatic organisms to survive, the rapid growth of algae signify an alarming level of water pollution. But the process, called eutrophication, starts way before we can see algae bloom on the water surface.
Eutrophication in modern-day societies is sped up by land-use practices that lead to excessive amounts of nutrients entering the water body and thus causing a growth spurt in first the plant (such as algae), then the animal population. In this process, phosphorus as a key nutrient, plays an important role both in producing and in controlling algae blooms. Phosphates are essential to cell reproduction. This means, that the plant population can only grow to the extent supported by the amount of phosphates in the water, regardless of the availability of other nutrients. While, therefore, a high level of phosphorus stimulates rapid algae growth, controlling the level of phosphorus in the water helps maintain a healthy aquatic ecosystem.
The first step towards controlling the total phosphorus amount in the water body is to monitor when levels are reaching a critical point. Our model predicts total phosphorus as three distinct categories: 1) healthy amount, 2) increased amount that stimulates plant growth, and 3) problematic amount that projects unhealthy algae blooms in the Potomac River.
The original dataset contained water quality data collected in the entire Chesapeake Bay and Watershed by both the CBP and the CMC.
Original Dataset
Clean Dataset - Potomac River
Clean Dataset - Cheasapeake Watershed
- First, columns with more than 10% of NaN values were either dropped or NaN-s were imputed
- Second, rows with the remaining NaN values were dropped (at this point NaN-s were less than 10% of any one column)
Imputations:
Tidestage: Missing values were randomly filled by 'Ebb Tide' and 'Flood Tide'
Total Phosphorus
Categories:
- healthy amount (0.01 - 0.03 mg/L)
- increased amount (0.025 - 0.1 mg/L)
- problematic amount (> 0.1 mg/L)
- All parameter measures
- Engineered features listed bellow
Transformed Variables:
- Date_Time -> Year and Months
Dummified Variables:
- Tide Stage
Random Forest Classifier
Hyperparameters:
- n_estimators: 750
- min_samples_split: 5
Evaluation of the Model
The model predicts whether or not the measured amount of total phosphorus is dangerously high, stimulates plant growth, or is at a healthy level, with a cross-validated accuracy of 97%. The testing acccuracy is 97% as well, supporting the notion that the model generalizes well to yet unseen data. (Training accuracy score: 100%.)
- Incorporate data sets with more features (such as benthic data and weather data).
- Try a neural net classifier.
Data cleaning
Data transformation
Random Forest - Potomac River
Random Forest - entire Chesapeake Watershed
Presentation
Presentation_Video
Phosphates in the Environment - Water Research Center
Indicator: Phosphorus - U.S. Environmental Protection Agency
Nutrient Pollution - U.S. Environmental Protection Agency
Nutrients: Phosphorus, - Minnesota Pollution Control Agency
Harmful Algal Bloom - CDC
Satellite Imagery Can Track Harmful Algal Blooms- USGS