Create Cross Validation Framework #1709

TrentonBush · 2022-06-22T23:24:03Z

The core of model building is evaluation - does the new model improve on the old one? Does it generalize outside of the training data? Answering those questions requires an appropriate cross validation framework that emulates the real application.

In our case, fuel price data is redacted for about 1/3 of records, and we want to impute it. The data represent fuel purchases at individual power plants over time.

Outline of Work

examine and characterize how fuel_price_per_mmbtu goes missing - is it per plant, per year, correlate with other columns, etc
identify key requirements of cross validation (CV)
design and implement the CV framework, ideally using sklearn built-ins

Considerations

The primary goal here is to avoid using predictive information that may link train and test sets but will not exist between observed data and imputed data.

For example, about half of the price data are part of long term contracts, which means there is likely an informational link between records part of the same contract. if we take a random row-wise subset of our observations, training set records belonging to that contract will be extra informative about test set records belonging to that same contract. Our model will perform very well according to our cross validation.

But in our actual application, whole plants are redacted, so we will not have access to any values that share contracts. The model will not be able to take advantage of that information and will operate with a) reduced performance, and b) unknown performance, because we did not evaluate it under realistic conditions.

The text was updated successfully, but these errors were encountered:

zaneselvans · 2022-06-23T02:15:11Z

What do we need to define for the cross validation? Can we create a custom method for subsetting the records and just plug that into existing sklearn cross validation infrastructure?

Can we get something naive up and running first just with random subsets before we try and refine things? Is this kind of setup crazy?

params = {
    "hist_gbr__max_depth": [3, 8],
    "hist_gbr__max_leaf_nodes": [15, 31],
    "hist_gbr__learning_rate": [0.1, 1],
}
search = GridSearchCV(pipe, params)
cv = KFold(n_splits=5, shuffle=True, random_state=0)
results = cross_validate(search, frc_data, frc_target, cv=cv)

Looking at how the missingness breaks down by fuels...

natural_gas:

55% (MMBTU)
34% (deliveries)

coal

26% (MMBTU)
29% (deliveries)

petroleum

31% (MMBTU)
41% (deliveries)

zaneselvans · 2022-06-23T13:51:05Z

Blurgh. Okay, I spent some time re-familiarizing myself with sklearn yesterday evening and finally managed to get an extremely basic model running in this notebook

I couldn't figure out how to pass the sample weights in to the cross validation though. And also I don't know how to evaluate the "test scores." Is it an error metric? Is it supposed to be zero? It does seem consistently indistinguishable from zero.

zaneselvans · 2022-06-23T17:52:10Z

Suppsedly the HistGBR model gracefully works with NA values, but when I leave NA values in the categorical (string) columns to be encoded, the OrdinalEncoder fails, complaining that

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['NAType', 'str']

Even though OrdinalEncoder can also supposedly handle NA values and retain them to be passed through to the HistGBR model.

zaneselvans · 2022-06-23T18:15:19Z

It looks like the GroupKFold iterator is the kind of thing that we would need to avoid (for example) learning about prices for a particular (plant_id_eia, fuel_group_eiaepm) combination in the training phase, in a way that isn't applicable with real data (if whole swathes of plant-fuel tend to get redacted. Or maybe GroupShuffleSplit where the groups are defined by plant_id_eia so all train-test is happening across plant boundaries?

TrentonBush · 2022-06-23T18:16:11Z

Do you have to use OrdinalEncoder or will it take pandas categorical dtypes?

TrentonBush · 2022-06-23T18:19:10Z

Ya I think there will be a built in that works for us. Just have to see what we need first!

TrentonBush · 2022-06-23T18:43:58Z

I don't know how to evaluate the "test scores." Is it an error metric?

Yes, there are two things here. 1) what objective function is the GBDT optimizing, and 2) what metric(s) are we using to evaluate the success of the model. The objective and metric are usually the same, but can be different under some circumstances (if you have some complex custom metric that you can't convert to a twice-differentiable objective function, it can hurt performance, for example).

For our purposes I think l1 aka mae (mean absolute error) is a good choice. Minimizing l1 error produces an estimate of the median, compared to l2 aka mse (mean squared error) which produces an estimate of the mean. Because our dataset seems to have some wild outliers, a median is probably a less biased central estimate (unless we can get rid of those outliers).

There are a bunch of other possible objective funcs and metrics out there we can play with, but changing the objective function is changing the purpose of the model. That is a fundamentally different thing compared with changing model "hyperparameters" such as max_depth etc, which change the tactics/implementation of the model.

I couldn't figure out how to pass the sample weights in to the cross validation though.

I think there are two param dicts here, one with lists of hyperparameters to grid search, and one of constant values to pass to the model. If putting sample weights in the second one doesn't work, I'm not sure how to do it. Would have to do some digging

TrentonBush · 2022-06-23T18:56:30Z

As far as interpreting the error metrics, lower is better, with one caveat. Error on the training set is often smaller than error on the test set. This is called 'overfitting', because the model has essentially memorized fine details of the training data that don't generalize to new data points in the test set. So we want to keep training the model basically until test error stops decreasing or starts increasing again (this is what "early stopping" does in an automated way).

When optimizing hyperparameters, you would usually choose the params that have produced the smallest error on the validation set. An exception might be if the model is unstable in some way, like if a multi-part CV like k-fold shows that though the average error is lowest, there is high variance between folds. This indicates that your CV is probably not set up properly and the folds contain dissimilar data.

zaneselvans · 2022-06-23T19:12:29Z

Do you have to use OrdinalEncoder or will it take pandas categorical dtypes?

It would be great if it could just take a native categorical type! And they're stored as integers under the hood anyway I think. But all the examples I've seen thus far are still encoding categorical columns, and they seem to have a strong preference for the OrdinalEncoder with this model, since it can happily deal with having all the categories in a single column.

Edit: Indeed, there is native CategoricalDtype support.

zaneselvans · 2022-06-23T19:41:14Z

Hmm, even using the "native" categorical support you still have to run it through the OrdinalEncoder, but you can have it select the columns that it's applied to based on the dtype of the columns.

Weirdly, it seems like you then have to pass the integer indices of the categorical columns to the model. Is there really no way to just give it the column names and have it pick out the right columns regardless of what order they're showing up in?

TrentonBush · 2022-06-23T19:55:53Z

Huh, apparently? The LightGBM model was much more user friendly in that regard. It just asked that categorical values had pd.Category dtypes

zaneselvans · 2022-06-23T20:25:43Z

I probably just don't understand how to use it correctly.

zaneselvans · 2022-07-18T23:29:36Z

Superceded by #1767

TrentonBush self-assigned this Jun 22, 2022

TrentonBush mentioned this issue Jun 22, 2022

Iterate on Imputation Model #1711

Closed

zaneselvans mentioned this issue Jun 23, 2022

Estimate redacted EIA 923 fuel prices #1708

Open

zaneselvans added eia923 Anything having to do with EIA Form 923 data-repair Interpolating or extrapolating data that we don't actually have. labels Jun 23, 2022

zaneselvans closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Cross Validation Framework #1709

Create Cross Validation Framework #1709

TrentonBush commented Jun 22, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

zaneselvans commented Jun 23, 2022 •

edited

zaneselvans commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jul 18, 2022

Create Cross Validation Framework #1709

Create Cross Validation Framework #1709

Comments

TrentonBush commented Jun 22, 2022

Outline of Work

Considerations

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

zaneselvans commented Jun 23, 2022 • edited

zaneselvans commented Jun 23, 2022

TrentonBush commented Jun 23, 2022

zaneselvans commented Jun 23, 2022

zaneselvans commented Jul 18, 2022

zaneselvans commented Jun 23, 2022 •

edited