-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Cross Validation Framework #1709
Comments
What do we need to define for the cross validation? Can we create a custom method for subsetting the records and just plug that into existing sklearn cross validation infrastructure? Can we get something naive up and running first just with random subsets before we try and refine things? Is this kind of setup crazy? params = {
"hist_gbr__max_depth": [3, 8],
"hist_gbr__max_leaf_nodes": [15, 31],
"hist_gbr__learning_rate": [0.1, 1],
}
search = GridSearchCV(pipe, params)
cv = KFold(n_splits=5, shuffle=True, random_state=0)
results = cross_validate(search, frc_data, frc_target, cv=cv) Looking at how the missingness breaks down by fuels...
|
Blurgh. Okay, I spent some time re-familiarizing myself with sklearn yesterday evening and finally managed to get an extremely basic model running in this notebook I couldn't figure out how to pass the sample weights in to the cross validation though. And also I don't know how to evaluate the "test scores." Is it an error metric? Is it supposed to be zero? It does seem consistently indistinguishable from zero. |
Suppsedly the HistGBR model gracefully works with NA values, but when I leave NA values in the categorical (string) columns to be encoded, the
Even though |
It looks like the GroupKFold iterator is the kind of thing that we would need to avoid (for example) learning about prices for a particular |
Do you have to use |
Ya I think there will be a built in that works for us. Just have to see what we need first! |
Yes, there are two things here. 1) what objective function is the GBDT optimizing, and 2) what metric(s) are we using to evaluate the success of the model. The objective and metric are usually the same, but can be different under some circumstances (if you have some complex custom metric that you can't convert to a twice-differentiable objective function, it can hurt performance, for example). For our purposes I think There are a bunch of other possible objective funcs and metrics out there we can play with, but changing the objective function is changing the purpose of the model. That is a fundamentally different thing compared with changing model "hyperparameters" such as
I think there are two param dicts here, one with lists of hyperparameters to grid search, and one of constant values to pass to the model. If putting sample weights in the second one doesn't work, I'm not sure how to do it. Would have to do some digging |
As far as interpreting the error metrics, lower is better, with one caveat. Error on the training set is often smaller than error on the test set. This is called 'overfitting', because the model has essentially memorized fine details of the training data that don't generalize to new data points in the test set. So we want to keep training the model basically until test error stops decreasing or starts increasing again (this is what "early stopping" does in an automated way). When optimizing hyperparameters, you would usually choose the params that have produced the smallest error on the validation set. An exception might be if the model is unstable in some way, like if a multi-part CV like k-fold shows that though the average error is lowest, there is high variance between folds. This indicates that your CV is probably not set up properly and the folds contain dissimilar data. |
It would be great if it could just take a native categorical type! And they're stored as integers under the hood anyway I think. But all the examples I've seen thus far are still encoding categorical columns, and they seem to have a strong preference for the Edit: Indeed, there is native CategoricalDtype support. |
Hmm, even using the "native" categorical support you still have to run it through the Weirdly, it seems like you then have to pass the integer indices of the categorical columns to the model. Is there really no way to just give it the column names and have it pick out the right columns regardless of what order they're showing up in? |
Huh, apparently? The LightGBM model was much more user friendly in that regard. It just asked that categorical values had pd.Category dtypes |
I probably just don't understand how to use it correctly. |
Superceded by #1767 |
The core of model building is evaluation - does the new model improve on the old one? Does it generalize outside of the training data? Answering those questions requires an appropriate cross validation framework that emulates the real application.
In our case, fuel price data is redacted for about 1/3 of records, and we want to impute it. The data represent fuel purchases at individual power plants over time.
Outline of Work
fuel_price_per_mmbtu
goes missing - is it per plant, per year, correlate with other columns, etcConsiderations
The primary goal here is to avoid using predictive information that may link train and test sets but will not exist between observed data and imputed data.
For example, about half of the price data are part of long term contracts, which means there is likely an informational link between records part of the same contract. if we take a random row-wise subset of our observations, training set records belonging to that contract will be extra informative about test set records belonging to that same contract. Our model will perform very well according to our cross validation.
But in our actual application, whole plants are redacted, so we will not have access to any values that share contracts. The model will not be able to take advantage of that information and will operate with a) reduced performance, and b) unknown performance, because we did not evaluate it under realistic conditions.
The text was updated successfully, but these errors were encountered: