Use gridsearch to see if there are more optimal lightGBM parameters we should be using.

We'll use the features that a have already been generated for our current best experiment (third sentinel + land cover features, trained with folds). Compare the results to that best experiment: `s3://drivendata-competition-nasa-cyanobacteria/experiments/results/filter_water_distance_550
`

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import yaml

from cloudpathlib import AnyPath
import lightgbm as lgb
from loguru import logger
import pandas as pd
from sklearn.model_selection import GridSearchCV

from cyano.data.utils import add_unique_identifier
from cyano.experiment.experiment import ExperimentConfig
from cyano.pipeline import CyanoModelPipeline
from cyano.settings import RANDOM_STATE

### Load data

In [3]:
tmp_dir = AnyPath("tmp_dir")
tmp_dir.mkdir(exist_ok=True)

In [4]:
experiment_dir = AnyPath(
    "s3://drivendata-competition-nasa-cyanobacteria/experiments/results/filter_water_distance_550"
)

In [5]:
train_features = pd.read_csv(experiment_dir / "features_train.csv", index_col=0)
train_features.head()

Unnamed: 0_level_0,AOT_mean,AOT_min,AOT_max,AOT_range,B01_mean,B01_min,B01_max,B01_range,B02_mean,B02_min,...,WVP_min,WVP_max,WVP_range,NDVI_B04,NDVI_B05,NDVI_B06,NDVI_B07,month,days_before_sample,land_cover
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,319.84375,144.0,1083.0,939.0,441.956349,146.0,...,277.0,1413.0,1136.0,0.581754,0.418335,0.114064,0.033551,2.0,15.0,11
0001cfa683171fe80161cdbe6a090c94,132.0,132.0,132.0,0.0,435.984375,229.0,1055.0,826.0,503.288549,154.0,...,167.0,828.0,661.0,0.574375,0.418914,0.120972,0.042572,3.0,8.0,11
0001cfa683171fe80161cdbe6a090c94,204.0,204.0,204.0,0.0,902.640625,703.0,1661.0,958.0,938.553288,672.0,...,931.0,931.0,0.0,0.455419,0.329302,0.09791,0.027511,3.0,5.0,11
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,518.484375,316.0,1554.0,1238.0,618.096939,277.0,...,574.0,1436.0,862.0,0.548216,0.39169,0.115218,0.032818,2.0,13.0,11
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,318.21875,104.0,1314.0,1210.0,485.587302,162.0,...,340.0,1114.0,774.0,0.582335,0.425137,0.129981,0.039196,2.0,10.0,11


In [6]:
# Load train labels
train = pd.read_csv(
    AnyPath(
        "s3://drivendata-competition-nasa-cyanobacteria/experiments/splits/competition/train.csv"
    )
)
train = add_unique_identifier(train)
train.shape

(17060, 10)

In [7]:
train.head(2)

Unnamed: 0_level_0,uid,data_provider,region,latitude,longitude,date,density_cells_per_ml,severity,distance_to_water_m,log_density
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
d7ebbce63c7d1498cc627a1e77f6061c,aabm,Indiana State Department of Health,midwest,39.080319,-86.430867,2018-05-14,585.0,1,0.0,6.37332
0856c3740614b5ee606f82d6c3a215a0,aacd,N.C. Division of Water Resources N.C. Departme...,south,35.875083,-78.878434,2020-11-19,290.0,1,514.0,5.673323


### Grid search

Try params from each of the winners

In [8]:
param_grid = {
    "max_depth": [-1, 8],
    "num_leaves": [31],
    "learning_rate": [0.005, 0.1],
    "bagging_fraction": [0.3, 1.0],
    "feature_fraction": [0.3, 1.0],
    "min_split_gain": [0.0, 0.1],
    "n_estimators": [100, 1000, 470],  # same as num_boost_round
}

Note that this is slightly different than our process because we use LGB.Booster, which we cannot input to the GridSearch. With our grid search, we are not using a valid set or early stopping.

In [9]:
lgb_model = lgb.LGBMModel(objective="regression", metric="rmse")

In [10]:
grid_search = GridSearchCV(
    estimator=lgb_model,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring="neg_root_mean_squared_error",
)

Grid search CV always tries to maximize the score, so root mean squared error has to be negative

In [11]:
# Load past grid search results if we can
grid_search_results_path = AnyPath(
    "s3://drivendata-competition-nasa-cyanobacteria/experiments/grid_search.csv"
)
if grid_search_results_path.exists():
    logger.info("Loading existing grid search results")
    results = pd.read_csv(grid_search_results_path)
    results = results.sort_values(by="mean_test_score", ascending=False)

# Otherwise run grid search -- takes ~30 min
else:
    logger.info("Running grid search")
    grid_search.fit(train_features, train.loc[train_features.index].log_density)
    results = pd.DataFrame(grid_search.cv_results_).sort_values(
        by="mean_test_score", ascending=False
    )
    with grid_search_results_path.open("w") as fp:
        results.to_csv(fp, index=False)
    logger.success(f"Grid search results saved to {grid_search_results_path}")


results.shape

[32m2023-08-30 14:50:52.403[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mLoading existing grid search results[0m


(96, 20)

In [12]:
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bagging_fraction,param_feature_fraction,param_learning_rate,param_max_depth,param_min_split_gain,param_n_estimators,param_num_leaves,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,38.002673,4.174478,0.297605,0.032193,1.0,0.3,0.1,-1,0.0,1000,31,"{'bagging_fraction': 1.0, 'feature_fraction': ...",-2.358523,-2.325254,-2.220499,-2.205061,-2.166473,-2.255162,0.073721,1
0,27.807155,0.867462,0.230027,0.031164,0.3,0.3,0.1,-1,0.0,1000,31,"{'bagging_fraction': 0.3, 'feature_fraction': ...",-2.358523,-2.325254,-2.220499,-2.205061,-2.166473,-2.255162,0.073721,1
3,45.640214,0.748273,0.414326,0.142349,1.0,0.3,0.1,-1,0.1,1000,31,"{'bagging_fraction': 1.0, 'feature_fraction': ...",-2.354467,-2.327364,-2.228813,-2.201996,-2.170161,-2.25656,0.071848,3
2,39.600896,1.034801,0.40317,0.092548,0.3,0.3,0.1,-1,0.1,1000,31,"{'bagging_fraction': 0.3, 'feature_fraction': ...",-2.354467,-2.327364,-2.228813,-2.201996,-2.170161,-2.25656,0.071848,3
5,17.172113,2.006705,0.095154,0.005267,1.0,0.3,0.1,-1,0.0,470,31,"{'bagging_fraction': 1.0, 'feature_fraction': ...",-2.359355,-2.323075,-2.220049,-2.221747,-2.161495,-2.257144,0.072899,5


In [13]:
# do we have multiple tied for first?
# yes, two are tied
results.rank_test_score.value_counts().sort_index().head()

rank_test_score
1    2
3    2
5    2
7    2
9    2
Name: count, dtype: int64

In [14]:
results[results.rank_test_score == 1].filter(regex="param_")

Unnamed: 0,param_bagging_fraction,param_feature_fraction,param_learning_rate,param_max_depth,param_min_split_gain,param_n_estimators,param_num_leaves
1,1.0,0.3,0.1,-1,0.0,1000,31
0,0.3,0.3,0.1,-1,0.0,1000,31


The only difference is param_bagging_fraction

**Are there different other top params for different n_estimators?**

Our n_estimators doesn't exactly match the real process because we don't have a valid set and can't use early stopping. We are more interested in grid search's results for other parameters.

In [15]:
# Are there different other top params for different n_estimators?
by_estimator = []
include_cols = results.filter(regex="param_").columns.tolist() + ["mean_test_score"]

for n_est in results.param_n_estimators.unique():
    sub = results[results.param_n_estimators == n_est]
    sub = sub[sub.rank_test_score == sub.rank_test_score.min()][include_cols]
    by_estimator.append(sub)

pd.concat(by_estimator).set_index("param_n_estimators").T

param_n_estimators,1000,1000.1,470,470.1,100,100.1
param_bagging_fraction,1.0,0.3,1.0,0.3,1.0,0.3
param_feature_fraction,0.3,0.3,0.3,0.3,1.0,1.0
param_learning_rate,0.1,0.1,0.1,0.1,0.1,0.1
param_max_depth,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
param_min_split_gain,0.0,0.0,0.0,0.0,0.0,0.0
param_num_leaves,31.0,31.0,31.0,31.0,31.0,31.0
mean_test_score,-2.255162,-2.255162,-2.257144,-2.257144,-2.329749,-2.329749


In [17]:
# how much worse is the 100 estimator model with 0.3 feature fractions?
# not much
results[
    (results.param_n_estimators == 100) & (results.param_feature_fraction == 0.3)
].mean_test_score.max()

-2.3360803306994544

In [16]:
# how much worse is the 470 estimator model with 0.3 feature fractions?
# also not much
results[
    (results.param_n_estimators == 470) & (results.param_feature_fraction == 1.0)
].mean_test_score.max()

-2.2608359852880544

In [18]:
# n_estimators = 1000 also doesn't change much with feature_fraction
results[
    (results.param_n_estimators == 1000) & (results.param_feature_fraction == 1.0)
].mean_test_score.max()

-2.260117311630598

In [19]:
param_grid

{'max_depth': [-1, 8],
 'num_leaves': [31],
 'learning_rate': [0.005, 0.1],
 'bagging_fraction': [0.3, 1.0],
 'feature_fraction': [0.3, 1.0],
 'min_split_gain': [0.0, 0.1],
 'n_estimators': [100, 1000, 470]}

**Takeaways**

Best set of LGB params based on grid search:

- `max_depth` = -1. This is the same as what we're already using (3rd place)

- `learning_rate` = 0.1. This is the same as what we're already using (3rd place)

- `bagging_fraction` = 1.0. The bagging fraction does not change the performance, and 1.0 is the default

- `feature_fraction` = 0.3. Feature fraction of 1.0 is best when we have only 100 boosting iterations, but 0.3 is best with either 470 or 1000. This makes sense because it helps deal with overfitting. When n_estimators is 470 using a feature_fraction of 1.0 instead of 0.3 has a more noticeable impact on the model than using a feature_fraction of 0.3 instead of 1.0 when n_estimators is 100 --> the risk of poor performance is greater is we stick with 1.0. From lightGBM:
    > LightGBM will randomly select a subset of features on each iteration (tree) if feature_fraction is smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree
    > 
    > can be used to speed up training
    > 
    > can be used to deal with over-fitting

- `min_split_gain` = 0.0. This is the lightGBM default and the same as what we're already using (3rd place)