Look into cases we get very wrong

We'll look at the predictions from our best model - third place sentinel and land cover features, predicting log density, with folds and a higher number of boosted rounds.

In [None]:
%load_ext autoreload
%autoreload 2

In [16]:
from cloudpathlib import AnyPath
import pandas as pd

from cyano.data.utils import add_unique_identifier

In [65]:
# load all metadata for reference
meta = pd.read_csv(
    AnyPath(
        "s3://drivendata-competition-nasa-cyanobacteria/data/final/combined_final_release.csv"
    )
)
meta.head(3)

Unnamed: 0,uid,data_provider,region,latitude,longitude,date,density_cells_per_ml,severity,distance_to_water_m
0,aabm,Indiana State Department of Health,midwest,39.080319,-86.430867,2018-05-14,585.0,1,0.0
1,aabn,California Environmental Data Exchange Network,west,36.5597,-121.51,2016-08-31,5867500.0,4,3512.0
2,aacd,N.C. Division of Water Resources N.C. Departme...,south,35.875083,-78.878434,2020-11-19,290.0,1,514.0


In [12]:
best_exp_dir = AnyPath(
    "s3://drivendata-competition-nasa-cyanobacteria/experiments/results/third_sentinel_with_folds"
)

In [17]:
# load best predictions
preds = pd.read_csv(best_exp_dir / 'preds.csv', index_col=0)
preds.head()

Unnamed: 0_level_0,date,latitude,longitude,log_density,severity
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4a89ead93e2caa84da636236bb361e12,2016-08-31,36.5597,-121.51,15.472787,4.0
a7e2d76f204ac347ae5529557eb7f665,2014-11-01,33.0426,-117.076,,
6afcb31acb56fc25af76983df6a60d0a,2015-08-26,40.703968,-80.29305,,
494fb91d1fb8697e73b90e5d0c9420ed,2019-08-26,38.9725,-94.67293,10.436847,2.0
21e1f934b140745d1cde757090aa07fa,2018-01-08,34.279,-118.905,13.364631,3.0


In [18]:
# load actual
true = pd.read_csv(
    AnyPath(
        "s3://drivendata-competition-nasa-cyanobacteria/experiments/splits/competition/test.csv"
    )
)
true = add_unique_identifier(true)
true.head(2)

Unnamed: 0_level_0,uid,data_provider,region,latitude,longitude,date,density_cells_per_ml,severity,distance_to_water_m,log_density
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4a89ead93e2caa84da636236bb361e12,aabn,California Environmental Data Exchange Network,west,36.5597,-121.51,2016-08-31,5867500.0,4,3512.0,15.584939
a7e2d76f204ac347ae5529557eb7f665,aair,California Environmental Data Exchange Network,west,33.0426,-117.076,2014-11-01,2769000.0,4,195.0,14.833997


In [24]:
true['pred_severity']  = preds.loc[true.index].severity
true['pred_log_density'] = preds.loc[true.index].log_density

In [25]:
# check samples with actual severity 1 but predicted severity 4
check = true[(true.severity == 1) & (true.pred_severity == 4)]
check.shape

(36, 12)

## Check metadata

What region are these from? What providers? 

In [26]:
# almost all are in the south
check.region.value_counts()

region
south      29
midwest     5
west        2
Name: count, dtype: int64

In [27]:
# almost all are north carolina
# could these be routine sites with inaccurate gps data?
check.data_provider.value_counts()

data_provider
N.C. Division of Water Resources N.C. Department of Environmental Quality    29
US Army Corps of Engineers                                                    3
California Environmental Data Exchange Network                                2
EPA National Aquatic Research Survey                                          1
Bureau of Water Kansas Department of Health and Environment                   1
Name: count, dtype: int64

In [35]:
# Check some lat / longs
check[check.region == 'south'][['latitude', 'longitude']].drop_duplicates().head()

Unnamed: 0_level_0,latitude,longitude
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1
fbf3fdc2d37e96b93ecd668a793de81f,35.036,-79.117499
8aef8a6cce8d04d7123a94e6cf7f11e3,35.036,-79.118092
6529046f1dc37e148efadd3dc747b8ea,35.48,-76.82
354232a64d8764dbfcb7b67af11e8fdf,35.036,-79.118191
078bffe42a677d9e6a4a9a3c6000920b,35.55,-78.02


In [37]:
# pull in original NC data
nc_raw = pd.read_excel(
    AnyPath(
        "s3://drivendata-competition-nasa-cyanobacteria/data/raw/nc/New Use This NCDWR phyto 2013-2021 All Data.xlsx"
    ),
    sheet_name="Cyanobacteria Density"
)
nc_raw.shape 

(14339, 13)

In [56]:
nc_raw = nc_raw.rename(columns={'Lat':'latitude', 'Long':'longitude'})
nc_raw['date'] = pd.to_datetime(nc_raw.Date)
nc_raw.head(2)

Unnamed: 0,Waterbody,StationDesc,SiteCode,Date,Smplid,AlgalGroup,Genus,Species,Cell Density,Unit Density,Biovolume,latitude,longitude,date
0,Laurel Lea Lake,,ES9954,2013-10-24,9954-2013,Cyanobacteria,Microcystis,aeruginosa,27131.0,209.0,814.0,34.28071,-77.76675,2013-10-24
1,Laurel Lea Lake,,ES9954,2013-10-24,9954-2013,Cyanobacteria,Aphanocapsa,incerta,129396.0,626.0,65.0,34.28071,-77.76675,2013-10-24


In [58]:
raw_subset = check[['latitude', 'longitude']].merge(nc_raw, how='inner', on=['latitude', 'longitude'])
raw_subset.shape

(1629, 14)

In [59]:
raw_subset.Waterbody.value_counts(dropna=False)

Waterbody
Cape Fear River    984
Pamlico River      630
Lake Stewart        13
Lake Monroe          2
Name: count, dtype: int64

In [61]:
raw_subset.groupby(['Waterbody', raw_subset.date.dt.year]).size().sort_index()

Waterbody        date
Cape Fear River  2019    400
                 2020    584
Lake Monroe      2021      2
Lake Stewart     2021     13
Pamlico River    2017     78
                 2018     84
                 2019    192
                 2020     18
                 2021    258
dtype: int64

In [54]:
raw_subset.StationDesc.value_counts(dropna=False)

StationDesc
Cape Fear Riv at 42 nr Corinth              984
Bath Crk at NC 92 nr Bath                   630
Lake Stewart at Dam nr Fowler Crossroads     13
Lake Monroe Nr Monroe                         2
Name: count, dtype: int64

In an email from Elizabeth Fensin in NC:
> Iâ€™m also surprised I sent data from the Cape Fear River 2020-2021 study since it was unusual.  We usually assign one taxonomist to particular waterbodies to ensure uniform results.  In the Cape Fear study, one taxonomist did the 2020 and another did the 2021 samples.  This created confusing results and some samples were recounted.

She also noted that routinely monitored sites were more likely to have inaccurate GPS data. Pamlico is one of their ambient sites.

One option is to drop the Cape Fear River data altogether.

In [64]:
# how much of our final data is from cape fear river?
(cape_fear_coords := nc_raw[nc_raw.Waterbody == 'Cape Fear River'][['latitude', 'longitude']].drop_duplicates())

Unnamed: 0,latitude,longitude
1447,35.55,-78.02
1570,34.85,-78.83
1598,34.41,-78.3
1627,34.63,-78.58


In [91]:
cape_fear_final = cape_fear_coords.merge(meta, on=['latitude', 'longitude'], how='inner')
cape_fear_final.shape

(82, 9)

In [92]:
cape_fear_final.groupby(['latitude', 'longitude', 'data_provider']).size()

latitude  longitude  data_provider                                                            
34.41     -78.30     N.C. Division of Water Resources N.C. Department of Environmental Quality    16
34.63     -78.58     N.C. Division of Water Resources N.C. Department of Environmental Quality    21
34.85     -78.83     N.C. Division of Water Resources N.C. Department of Environmental Quality    17
35.55     -78.02     N.C. Division of Water Resources N.C. Department of Environmental Quality    28
dtype: int64

In [89]:
pd.concat(
    [final_from_cape_fear.distance_to_water_m.describe().rename('cape_fear'),
     meta.distance_to_water_m.describe().rename('all_data')
    ], axis=1
)

Unnamed: 0,cape_fear,all_data
count,82.0,23569.0
mean,415.341463,436.444525
std,318.091035,600.447428
min,46.0,0.0
25%,199.0,0.0
50%,339.0,246.0
75%,835.0,664.0
max,835.0,6468.0


**Takeaway**

I recommend removing the Cape Fear data from NC from our training set. It was flagged by our NC contact as being potentially confusing, and the odd behavior of the model on these cases could be related to inaccurate lat / longs or inaccurate ground truth measurements. We can also see that cape fear data points tend to be farther from water than the rest of the data

In [90]:
# would we also want to drop pamlico? how much data is from pamlico?
# pamlico has a lot more different lat / longs
pamlico_coords = nc_raw[nc_raw.Waterbody == 'Pamlico River'][['latitude', 'longitude']].drop_duplicates()

# how much of our data is pamlico?
pamlico_final = pamlico_coords.merge(meta, on=['latitude', 'longitude'], how='inner')
pamlico_final.shape

(683, 9)

In [80]:
pamlico_final.data_provider.value_counts()

data_provider
N.C. Division of Water Resources N.C. Department of Environmental Quality    683
Name: count, dtype: int64

The majority of the Pamlico data (630 samples) has severity 1 but is predicted severity 4. It's very possible that similarly, this is just because the data is noisy.

In [94]:
pd.concat(
    [cape_fear_final.distance_to_water_m.describe().rename('cape_fear'),
        pamlico_final.distance_to_water_m.describe().rename('pamlico'),
     meta.distance_to_water_m.describe().rename('all_data')
    ], axis=1
)

Unnamed: 0,cape_fear,pamlico,all_data
count,82.0,683.0,23569.0
mean,415.341463,735.096633,436.444525
std,318.091035,342.847096,600.447428
min,46.0,0.0,0.0
25%,199.0,564.0,0.0
50%,339.0,759.0,246.0
75%,835.0,964.0,664.0
max,835.0,1364.0,6468.0


## Example images