# Scaling the target

### Should you scale the target in regression tasks?

It's fairly well known that it's often use, and sometimes essential, to scale (eg normalize or standardize) the features (i.e. `X`) before training a predictive model on them.

The question of when, if ever, one should _also_ scale the target, `y`, sometimes comes up.

I asked ChatGPT for its opinion, and it surprised me a bit because it almost seems to suggest that you should _always_ scale the target:

> Scaling the target variable in a regression task can be beneficial in certain scenarios:
> 
> 1. When using regularization techniques: If you are using a regression algorithm with regularization, such as Lasso or Ridge regression, scaling the target variable can help ensure that the magnitude of the coefficients and the regularization penalty term are on a similar scale. This can lead to more stable and accurate models.
> 
> 2. When using neural networks: Neural networks often work better when both input features and target variables are on a similar scale, as it can help with gradient-based optimization and avoid issues like vanishing or exploding gradients. Scaling the target variable in this context can lead to faster convergence and better model performance.
> 
> 3. When comparing models with different target scales: If you are comparing models across different datasets with varying target variable scales, scaling the target variable can help make performance comparisons more consistent and meaningful.
> 
> 4. When using gradient-based optimization algorithms: Some machine learning algorithms, like support vector regression (SVR), use gradient-based optimization methods. Scaling the target variable can help improve the convergence of the optimization process.
> 
> 5. When the target variable has a large range or extreme values: In cases where the target variable has a large range or extreme values, scaling it can help stabilize the learning process and avoid potential issues with numerical stability or model performance.
> 
> However, it is important to remember that after training the model, you will need to reverse the scaling transformation to interpret the model's predictions in their original scale.
> 
> In summary, while not always necessary, scaling the target variable in a regression task can be beneficial in specific situations, such as when using regularization, neural networks, or gradient-based optimization algorithms.

And there's [this conversation](https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re) on StackOverflow, with answers ranging from "No" to "Maybe" to "Yes".

So let's load some data and try some things!

### TLDR

Experimentation suggests that very large or very small target values make SVR and MLPRegressor very unstable and hard to train, even with hyperparameter search. (I probably didn't think hard enough about how to compensate for the value magnitudes, mostly just playing with a couple of arguments). In each case, scaling the target (standardizing) solved the problem.

As far as I can tell, scaling the target did not substantially affect the performance of LinearRegression (with or without L2 regularization, see below re L1), KNN regression, SGD regression (surprised me), or RandomForestRegressor.

For whatever reason, I cannot get Lasso (L1 regularization) to converge at all on my data, even when scaling the input and the output.

If you're getting unstable results from an SVR or MLP, I think the best strategy may be to just use another algorithm, rather than trying to find the right hyperparameters, or deal with the hassle of scaling the target. If hellbent on using a neural net or SVR, just scale the target.

## Load some data

In [None]:
import pandas as pd

df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/MD-GR-NPOR-RHOB-DT4P-DT4S.txt',
                 names='MD-GR-NPOR-RHOB-DT4P-DT4S'.split('-'),
                )

df.describe()

In [None]:
# Remove problem samples.
df = df[df['DT4S'] > 200]

# Make new targets.
df['DT4S_scaled'] = (df['DT4S'] - df['DT4S'].mean()) / df['DT4S'].std()
df['DT4S_huge'] = df['DT4S'] * 1e6
df['DT4S_tiny'] = df['DT4S'] / 1e6

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import make_pipeline


# Make X and y.
X = df[['MD', 'GR', 'NPOR', 'RHOB', 'DT4P']].values
y = df['DT4S_tiny'].values

# Split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Linear regression

Should not make a difference if I scale the target or not.

In [None]:
# Make and fit pipeline.
regr = LinearRegression()
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

With scaled target (the [`TransformedTargetRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html) class helps a lot because transforming the target is super-fiddly:

In [None]:
regr = TransformedTargetRegressor(LinearRegression(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

Now with regularization, using `Ridge()`. The idea is that if you use regularization then it might be a good idea to scale the target.

In [None]:
regr = Ridge()
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

With scaled target:

In [None]:
regr = TransformedTargetRegressor(Ridge(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

Makes no difference in either case, i.e. regularization or not.

Now with `Lasso()`

In [None]:
from sklearn.linear_model import Lasso

regr = Lasso(alpha=0)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
from sklearn.linear_model import Lasso

regr = Lasso()
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(Lasso(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

`Lasso` does not converge at all and I don't know why.

## Random forest

Now with `RandomForestRegressor()`

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor()
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(RandomForestRegressor(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

## Stochastic gradient descent

And with SGD -- tiny not affect, huge is bad.

In [None]:
regr = SGDRegressor(penalty=None)  # No regularization.
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = SGDRegressor(alpha=0.01)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(SGDRegressor(alpha=0.01), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

## Neural network

In [None]:
from sklearn.neural_network import MLPRegressor

regr = MLPRegressor(alpha=0,  max_iter=1000)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = MLPRegressor(alpha=0.0001, max_iter=1000)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(MLPRegressor(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

💥 So this one does blow up.

## KNN

In [None]:
from sklearn.neighbors import KNeighborsRegressor

regr = KNeighborsRegressor(n_neighbors=50)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = KNeighborsRegressor()
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(KNeighborsRegressor(), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

## Support vector machine

And with SVR -- all affected

In [None]:
regr = SVR(C=1e12)  # Almost no regularization.
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = SVR(C=1)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = TransformedTargetRegressor(SVR(C=1), transformer=StandardScaler())
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

## Question: can you always rescue things with (say) GridSearch

**Short answer: Maybe, if you know which params to change and really check all the cases (esp near edges!). But for SVR at least, tiny values seem to be very difficult to compensate for.**

In [None]:
regr = SVR(C=1)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np


grid = GridSearchCV(pipe, param_grid={'svr__epsilon': np.logspace(-6, 6, 13)}, cv=6, verbose=1)
grid.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, param_grid={'svr__C': np.logspace(-6, 6, 13),
                                      'svr__epsilon': np.logspace(-6, 6, 13)},
                    cv=6, verbose=1)
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
grid.score(X_test, y_test)

In [None]:
regr = SVR(C=1e6)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
regr = SVR(C=1e9)
pipe = make_pipeline(StandardScaler(), regr)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

Conclusion: probably better to just use another algorithm rather than trying to resuce stability from SVR or MLP.

---

&copy; 2023 Matt Hall, licensed CC BY