Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve hyperparameter tuning performance #1941

Merged
merged 37 commits into from
Jul 12, 2021
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
2adce69
WIP
tveasey Feb 22, 2021
4f5d558
Merge branch 'master' into select-data-size
tveasey Jun 24, 2021
b2af714
Restrict the maximum number of rows used during hyperparameter tuning…
tveasey Jun 29, 2021
26070c4
Allow one to disable fine tuning entirely for fast mode
tveasey Jul 1, 2021
81d3ffd
Uncouple training fraction parameter from the number of folds
tveasey Jul 1, 2021
04248ee
Adjust the validation loss variance estimate to remove affects of sam…
tveasey Jul 2, 2021
f72dd4c
Formatting
tveasey Jul 2, 2021
fc0a3bc
Docs
tveasey Jul 2, 2021
78f6e37
Avoid infinite loop
tveasey Jul 2, 2021
caa7c82
Correct handling of eta growth rate per tree
tveasey Jul 2, 2021
b46c76e
Correct edge case test
tveasey Jul 2, 2021
7318193
Test threshold
tveasey Jul 2, 2021
e55ea41
Handle the case we can't sample train/test folds without replacement …
tveasey Jul 5, 2021
dd002c3
Handle edge case creating train/test splits with very little data
tveasey Jul 7, 2021
37d4690
Slightly relax tests to pass on all platforms
tveasey Jul 7, 2021
28f22f4
Review comments
tveasey Jul 8, 2021
4f3e3f9
Review comments
tveasey Jul 8, 2021
5d4edba
Explain p.
tveasey Jul 8, 2021
5748ce1
Explain poly
tveasey Jul 8, 2021
c252b24
Add explanation of mechanics of fit
tveasey Jul 8, 2021
9a7feea
Make k dependency clear
tveasey Jul 8, 2021
5b1a018
Document test interface
tveasey Jul 8, 2021
93d3264
Names, explanation and coding style guideline fixes
tveasey Jul 8, 2021
ae45379
Explicit capture
tveasey Jul 8, 2021
efdadc0
Typo
tveasey Jul 8, 2021
59c9add
Capture by reference
tveasey Jul 8, 2021
74c27f9
Rename
tveasey Jul 8, 2021
ca1d910
Update comment to reflect the current behaviour
tveasey Jul 8, 2021
40eae57
Name variable for readability
tveasey Jul 8, 2021
92de10f
Typedef
tveasey Jul 8, 2021
d0be22f
Define small constant used to prefer fast training if test error is s…
tveasey Jul 8, 2021
a380b20
We should record the fraction and number of training rows in the mode…
tveasey Jul 9, 2021
06460e6
Handle case we don't need to sample for last fold
tveasey Jul 9, 2021
8e98de0
Merge branch 'master' into select-data-size
tveasey Jul 9, 2021
ad037ec
Add an explanation of variance treatment in BO
tveasey Jul 9, 2021
e58ed73
Comments
tveasey Jul 9, 2021
e0a61bf
Move fraction of training data into its own section in instrumentation
tveasey Jul 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions include/api/CInferenceModelMetadata.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,10 @@ class API_EXPORT CInferenceModelMetadata {
static const std::string JSON_MEAN_MAGNITUDE_TAG;
static const std::string JSON_MIN_TAG;
static const std::string JSON_MODEL_METADATA_TAG;
static const std::string JSON_NUM_TRAINING_ROWS_TAG;
static const std::string JSON_RELATIVE_IMPORTANCE_TAG;
static const std::string JSON_TOTAL_FEATURE_IMPORTANCE_TAG;
static const std::string JSON_TRAIN_PARAMETERS_TAG;

public:
using TVector = maths::CDenseVector<double>;
Expand All @@ -64,6 +66,10 @@ class API_EXPORT CInferenceModelMetadata {
//! to the baseline value).
void featureImportanceBaseline(TVector&& baseline);
void hyperparameterImportance(const maths::CBoostedTree::THyperparameterImportanceVec& hyperparameterImportance);
//! Set the number of rows used to train the model.
void numberTrainingRows(std::size_t numberRows);
//! Set the fraction of data per fold used for training when tuning hyperparameters.
void trainFractionPerFold(double fraction);

private:
struct SHyperparameterImportance {
Expand All @@ -86,20 +92,23 @@ class API_EXPORT CInferenceModelMetadata {

private:
void writeTotalFeatureImportance(TRapidJsonWriter& writer) const;
void writeHyperparameterImportance(TRapidJsonWriter& writer) const;
void writeFeatureImportanceBaseline(TRapidJsonWriter& writer) const;
void writeHyperparameterImportance(TRapidJsonWriter& writer) const;
void writeTrainParameters(TRapidJsonWriter& writer) const;

private:
TSizeMeanAccumulatorUMap m_TotalShapValuesMean;
TSizeMinMaxAccumulatorUMap m_TotalShapValuesMinMax;
TOptionalVector m_ShapBaseline;
TStrVec m_ColumnNames;
TStrVec m_ClassValues;
TPredictionFieldTypeResolverWriter m_PredictionFieldTypeResolverWriter =
TPredictionFieldTypeResolverWriter m_PredictionFieldTypeResolverWriter{
[](const std::string& value, TRapidJsonWriter& writer) {
writer.String(value);
};
}};
THyperparametersVec m_HyperparameterImportance;
std::size_t m_NumberTrainingRows{0};
double m_TrainFractionPerFold{0.0};
};
}
}
Expand Down
8 changes: 7 additions & 1 deletion include/maths/CBoostedTree.h
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ class MATHS_EXPORT CBoostedTree final : public CDataFramePredictiveModel {
class MATHS_EXPORT CVisitor : public CDataFrameCategoryEncoder::CVisitor,
public CBoostedTreeNode::CVisitor {
public:
virtual ~CVisitor() = default;
~CVisitor() override = default;
virtual void addTree() = 0;
virtual void addClassificationWeights(TDoubleVec weights) = 0;
virtual void addLossFunction(const TLossFunction& lossFunction) = 0;
Expand Down Expand Up @@ -236,6 +236,12 @@ class MATHS_EXPORT CBoostedTree final : public CDataFramePredictiveModel {
//! Get the vector of hyperparameter importances.
THyperparameterImportanceVec hyperparameterImportance() const;

//! Get the number of rows used to train the model.
std::size_t numberTrainingRows() const override;

//! Get the fraction of data per fold used for training when tuning hyperparameters.
double trainFractionPerFold() const override;

//! Get the column containing the dependent variable.
std::size_t columnHoldingDependentVariable() const override;

Expand Down
25 changes: 13 additions & 12 deletions include/maths/CBoostedTreeFactory.h
Original file line number Diff line number Diff line change
Expand Up @@ -209,9 +209,10 @@ class MATHS_EXPORT CBoostedTreeFactory final {
TDoubleDoublePrVec estimateTreeGainAndCurvature(core::CDataFrame& frame,
const TDoubleVec& percentiles) const;

//! Perform a line search for the test loss w.r.t. a single regularization
//! hyperparameter and apply Newton's method to find the minimum. The plan
//! is to find a value near where the model starts to overfit.
//! Perform a line search for the test loss w.r.t. a single hyperparameter.
//! At the end we use a smooth curve fit through all test loss values (using
//! LOWESS regression) and use this to get a best estimate of where the true
//! minimum occurs.
//!
//! \return The interval to search during the main hyperparameter optimisation
//! loop or null if this couldn't be found.
Expand Down Expand Up @@ -277,14 +278,14 @@ class MATHS_EXPORT CBoostedTreeFactory final {
private:
TOptionalDouble m_MinimumFrequencyToOneHotEncode;
TOptionalSize m_BayesianOptimisationRestarts;
bool m_StratifyRegressionCrossValidation = true;
double m_InitialDownsampleRowsPerFeature = 200.0;
std::size_t m_MaximumNumberOfTrainRows = 500000;
double m_GainPerNode1stPercentile = 0.0;
double m_GainPerNode50thPercentile = 0.0;
double m_GainPerNode90thPercentile = 0.0;
double m_TotalCurvaturePerNode1stPercentile = 0.0;
double m_TotalCurvaturePerNode90thPercentile = 0.0;
bool m_StratifyRegressionCrossValidation{true};
double m_InitialDownsampleRowsPerFeature{200.0};
std::size_t m_MaximumNumberOfTrainRows{500000};
double m_GainPerNode1stPercentile{0.0};
double m_GainPerNode50thPercentile{0.0};
double m_GainPerNode90thPercentile{0.0};
double m_TotalCurvaturePerNode1stPercentile{0.0};
double m_TotalCurvaturePerNode90thPercentile{0.0};
std::size_t m_NumberThreads;
TBoostedTreeImplUPtr m_TreeImpl;
TVector m_LogDownsampleFactorSearchInterval;
Expand All @@ -294,7 +295,7 @@ class MATHS_EXPORT CBoostedTreeFactory final {
TVector m_LogLeafWeightPenaltyMultiplierSearchInterval;
TVector m_SoftDepthLimitSearchInterval;
TVector m_LogEtaSearchInterval;
TTrainingStateCallback m_RecordTrainingState = noopRecordTrainingState;
TTrainingStateCallback m_RecordTrainingState{noopRecordTrainingState};
};
}
}
Expand Down
12 changes: 9 additions & 3 deletions include/maths/CBoostedTreeImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,13 @@ class MATHS_EXPORT CBoostedTreeImpl final {
//! \return The best hyperparameters for validation error found so far.
const CBoostedTreeHyperparameters& bestHyperparameters() const;

//! \return The fraction of data we use for train per fold when tuning hyperparameters.
double trainFractionPerFold() const;

//! \return The full training set data mask, i.e. all rows which aren't missing
//! the dependent variable.
core::CPackedBitVector allTrainingRowsMask() const;

//!\ name Test Only
//@{
//! The name of the object holding the best hyperaparameters in the state document.
Expand Down Expand Up @@ -203,9 +210,8 @@ class MATHS_EXPORT CBoostedTreeImpl final {
//! Check if we can train a model.
bool canTrain() const;

//! Get the full training set data mask, i.e. all rows which aren't missing
//! the dependent variable.
core::CPackedBitVector allTrainingRowsMask() const;
//! Get the mean number of training examples which are used in each fold.
double meanNumberTrainingRowsPerFold() const;

//! Compute the \p percentile percentile gain per split and the sum of row
//! curvatures per internal node of \p forest.
Expand Down
2 changes: 1 addition & 1 deletion include/maths/CDataFrameAnalysisInstrumentationInterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ class MATHS_EXPORT CDataFrameTrainBoostedTreeInstrumentationInterface
SRegularization s_Regularization;
double s_DownsampleFactor{-1.0};
std::size_t s_NumFolds{0};
double s_TrainFractionPerFold{0.0};
double s_NumTrainingRows{0};
std::size_t s_MaxTrees{0};
double s_FeatureBagFraction{-1.0};
double s_EtaGrowthRatePerTree{-1.0};
Expand Down
6 changes: 6 additions & 0 deletions include/maths/CDataFramePredictiveModel.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ class MATHS_EXPORT CDataFramePredictiveModel {
//! \warning Will return a nullptr if a trained model isn't available.
virtual CTreeShapFeatureImportance* shap() const = 0;

//! Get the number of rows used to train the model.
virtual std::size_t numberTrainingRows() const = 0;

//! Get the fraction of data per fold used for training when tuning hyperparameters.
virtual double trainFractionPerFold() const = 0;

//! Get the column containing the dependent variable.
virtual std::size_t columnHoldingDependentVariable() const = 0;

Expand Down
23 changes: 10 additions & 13 deletions include/maths/CLowess.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@
#ifndef INCLUDED_ml_maths_CLowess_h
#define INCLUDED_ml_maths_CLowess_h

#include <maths/CLeastSquaresOnlineRegression.h>

#include <maths/CBasicStatistics.h>
#include <maths/CLeastSquaresOnlineRegression.h>

#include <utility>
#include <vector>
Expand All @@ -34,7 +33,8 @@ class CLowess {
//!
//! \param[in] data The training data.
//! \param[in] numberFolds The number of folds to use in cross-validation to
// compute the best weight function from the family exp(-k |xi - xj|).
//! compute the best weight function from the family exp(-k |xi - xj|) with
//! k a free parameter which determines the amount of smoothing to use.
void fit(TDoubleDoublePrVec data, std::size_t numberFolds);
tveasey marked this conversation as resolved.
Show resolved Hide resolved

//! Predict the value at \p x.
Expand All @@ -47,23 +47,17 @@ class CLowess {
//! \note Defined as (0,0) if no data have been fit.
TDoubleDoublePr minimum() const;

//! \name Test Only
//@{
//! Get an estimate of residual variance at the observed values.
//!
//! \note Defined as zero if no data have been fit.
double residualVariance() const;

//! Compute the sublevel set of \p f containing \p xmin.
//!
//! \param[in] xmin The argument of the minimum of the interpolated function.
//! \param[in] fmin The value of the minimum of the function.
//! \param[in] f The value of the function for which to compute the sublevel set.
//! \note \p f should be greater than fmin.
//! \note Defined as (0,0) if no data have been fit.
TDoubleDoublePr sublevelSet(double xmin, double fmin, double f) const;

//! Get how far we are prepared to extrapolate as the interval we will search
//! in the minimum and sublevelSet functions.
TDoubleDoublePr extrapolationInterval() const;
//@}

private:
using TDoubleVec = std::vector<double>;
Expand All @@ -81,7 +75,10 @@ class CLowess {
private:
TDoubleDoublePrVec m_Data;
TSizeVec m_Mask;
double m_K = 0.0;
//! The weight to assign to data points when fitting polynomial at x is given
//! by exp(-k |xi - xj|). This can therefore be thought of as the inverse of
//! the amount of smoothing.
double m_K{0.0};
};
}
}
Expand Down
Loading