Use gridsearch to see if there are more optimal lightGBM parameters we should be using.

We'll use the features that a have already been generated for our current best experiment (third sentinel + land cover features, trained with folds). Compare the results to that [best experiment](https://docs.google.com/presentation/d/1zWrSMSivxylx_iH_aOapJfyziRsDuyuXOELduOn6x3c/edit#slide=id.g278eb39bdd6_0_43)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import yaml

from cloudpathlib import AnyPath
import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import GridSearchCV

from cyano.data.utils import add_unique_identifier
from cyano.experiment.experiment import ExperimentConfig
from cyano.pipeline import CyanoModelPipeline
from cyano.settings import RANDOM_STATE

In [3]:
tmp_dir = AnyPath("tmp_dir")
tmp_dir.mkdir(exist_ok=True)

In [4]:
experiment_dir = AnyPath(
    "s3://drivendata-competition-nasa-cyanobacteria/experiments/results/third_sentinel_with_folds"
)

In [5]:
train_features = pd.read_csv(experiment_dir / "features_train.csv", index_col=0)
train_features.head()

Unnamed: 0_level_0,AOT_mean,AOT_min,AOT_max,AOT_range,B01_mean,B01_min,B01_max,B01_range,B02_mean,B02_min,...,WVP_min,WVP_max,WVP_range,NDVI_B04,NDVI_B05,NDVI_B06,NDVI_B07,month,days_before_sample,land_cover
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,319.84375,144.0,1083.0,939.0,441.956349,146.0,...,277.0,1413.0,1136.0,0.581754,0.418335,0.114064,0.033551,2.0,15.0,11
0001cfa683171fe80161cdbe6a090c94,132.0,132.0,132.0,0.0,435.984375,229.0,1055.0,826.0,503.288549,154.0,...,167.0,828.0,661.0,0.574375,0.418914,0.120972,0.042572,3.0,8.0,11
0001cfa683171fe80161cdbe6a090c94,204.0,204.0,204.0,0.0,902.640625,703.0,1661.0,958.0,938.553288,672.0,...,931.0,931.0,0.0,0.455419,0.329302,0.09791,0.027511,3.0,5.0,11
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,518.484375,316.0,1554.0,1238.0,618.096939,277.0,...,574.0,1436.0,862.0,0.548216,0.39169,0.115218,0.032818,2.0,13.0,11
0001cfa683171fe80161cdbe6a090c94,102.0,102.0,102.0,0.0,318.21875,104.0,1314.0,1210.0,485.587302,162.0,...,340.0,1114.0,774.0,0.582335,0.425137,0.129981,0.039196,2.0,10.0,11


In [6]:
# Load train labels
train = pd.read_csv(
    AnyPath(
        "s3://drivendata-competition-nasa-cyanobacteria/experiments/splits/competition/train.csv"
    )
)
train = add_unique_identifier(train)
train.shape

(17060, 10)

In [7]:
train.head(2)

Unnamed: 0_level_0,uid,data_provider,region,latitude,longitude,date,density_cells_per_ml,severity,distance_to_water_m,log_density
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
d7ebbce63c7d1498cc627a1e77f6061c,aabm,Indiana State Department of Health,midwest,39.080319,-86.430867,2018-05-14,585.0,1,0.0,6.37332
0856c3740614b5ee606f82d6c3a215a0,aacd,N.C. Division of Water Resources N.C. Departme...,south,35.875083,-78.878434,2020-11-19,290.0,1,514.0,5.673323


In [8]:
train.loc[train_features.index].log_density

sample_id
0001cfa683171fe80161cdbe6a090c94    15.800642
0001cfa683171fe80161cdbe6a090c94    15.800642
0001cfa683171fe80161cdbe6a090c94    15.800642
0001cfa683171fe80161cdbe6a090c94    15.800642
0001cfa683171fe80161cdbe6a090c94    15.800642
                                      ...    
fff76b0c751a22eda5a33f3f0d7fda98    10.338188
fff948d10c4ef9e03fddb358140b2755     9.275660
fff948d10c4ef9e03fddb358140b2755     9.275660
fff948d10c4ef9e03fddb358140b2755     9.275660
fff948d10c4ef9e03fddb358140b2755     9.275660
Name: log_density, Length: 43409, dtype: float64

Try params from each of the winners

In [9]:
param_grid = {
    'max_depth': [-1, 8],
    # 'num_leaves': [31],
    'learning_rate': [0.005, 0.1],
    'bagging_fraction': [0.3, 1.0],
    'feature_fraction': [0.3, 1.0],
    'min_split_gain': [0.0, 0.1],
    'n_estimators': [1000, 100000, 470], # same as num_boost_round
}

In [10]:
lgb_model = lgb.LGBMRegressor(objective='regression', metric='rmse')

In [11]:
grid_search = GridSearchCV(
    estimator = lgb_model,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

Note that grid search CV always tries to maximize the score, so root mean squared error has to be negative

In [None]:
%%time
grid_search.fit(
    train_features,
    train.loc[train_features.index].log_density
)

If we don't specify 'scoring' in `grid_search`, I think score is the `lgb_model.score` method, which is R^2. Unclear whether specifying `metric="rmse"` makes the score that is returned RMSE.

In [None]:
grid_search.best_estimator_.get_params()

In [None]:
# with scoring specified in grid_search
pd.DataFrame(grid_search.cv_results_).sort_values(by='mean_test_score', ascending=False)