# Comparing R2 between different datasets!

### "Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is. Your baseline projection is a mean of your training data." - https://stackoverflow.com/questions/25691127/r-squared-on-test-data

### "Second, let us see what using R2 for model choice/evaluation means. Suppose we choose from a set of predictions Y¯M that were generated using a model M:M∈, where  is the collection of models under consideration (in your example, this collection would contain Neural networks, random forests, elastic nets, ...). Since SST will remain constant amongst all the models, if minimizing R2 you will choose exactly the model that minimizes SSR. In other words, you will choose M∈M that produces the minimal square error loss!" - https://stats.stackexchange.com/questions/83948/is-r-squared-value-appropriate-for-comparing-models

$$SSE\_residuals = \sum_{i=0}^{m}(y_i - \hat{y_i}) ^{2}$$

$$SSE\_total = \sum_{i=0}^{m}(y_i - \bar{y_i}) ^{2}$$

$$R^2 = 1 - \frac{SSE\_residuals}{SSE\_total}$$

$R^2$ will be high when either:
* $SSE\_residuals$ is low --> Value close to 0 the ratio goes to 0
* $SSE\_total$ is high --> High value brings the ratio to 0

$SSE\_residuals$ are dependent on model, but the baseline / normalisation factor should be the same if we want to compare multiple runs. How are them in our dataset?

In [18]:

SSE_residuals_train = np.sum((y_train - y_pred_train)**2)
SSE_total_train = np.sum((y_train - np.mean(y_train))**2)
r_2_train = 1 - (SSE_residuals_train)/SSE_total_train

SSE_residuals_validation = np.sum((y_val - y_pred_val)**2)
SSE_total_validation = np.sum((y_val - np.mean(y_val))**2)
r_2_val = 1 - (SSE_residuals_validation)/SSE_total_validation


print('--- TRAIN')
print('SSE_residuals_train ' + str(SSE_residuals_train))
print('SSE_total_train ' + str(SSE_total_train))
print('R2 train ' + str(r_2_train))
print('\n--- validation')
print('SSE_residuals_validation ' + str(SSE_residuals_validation))
print('SSE_total_validation ' + str(SSE_total_validation))
print('R2 val ' + str(r_2_val))

print('\nRatio between SSE_residuals_train and SSE_residuals_validation ' + str(SSE_residuals_train/SSE_residuals_validation))
print('Ratio between SSE_total_train and SSE_total_validation ' + str(SSE_total_train/SSE_total_validation))

--- TRAIN
SSE_residuals_train 11926839144.44499
SSE_total_train 81514569108.86674
R2 train 0.8536845710548245

--- validation
SSE_residuals_validation 12026156003.659708
SSE_total_validation 40894736795.43265
R2 val 0.7059241128309974

Ratio between SSE_residuals_train and SSE_residuals_validation 0.9917415956366694
Ratio between SSE_total_train and SSE_total_validation 1.9932777539717714


As SSE_total_train was almost **double** of the SSE_total_validation the performance of the model in the validation set (SSE_residuals_validation) would have to be **double** (to be double as good in terms of SSE is reducing the SSE in half) in order for them to have the same value. Lets check

In [20]:
SSE_residuals_validation_SUPER = np.sum((y_val - y_pred_val)**2) / 2 # dividing the error in half!
r_2_val_SUPER = 1 - (SSE_residuals_validation_SUPER)/SSE_total_validation

print('SSE_residuals_train ' + str(SSE_residuals_train))
print('SSE_residuals_validation_SUPER ' + str(SSE_residuals_validation_SUPER))
print('\nRatio between SSE_residuals_train and SSE_residuals_validation_SUPER ' + str(SSE_residuals_train/SSE_residuals_validation_SUPER))

print('R2 train ' + str(r_2_train))
print('R2 val ' + str(r_2_val_SUPER))

SSE_residuals_train 11926839144.44499
SSE_residuals_validation_SUPER 6013078001.829854

Ratio between SSE_residuals_train and SSE_residuals_validation_SUPER 1.9834831912733388
R2 train 0.8536845710548245
R2 val 0.8529620564154987


# So what?

## R2 is useful for evaluating different models performances on the **same** dataset. This avoids having different $SSE\_total$ and therefore different baselines

## If we are comparing training and validation (besides comparing them inside a CV setup) we can stablish the training baseline in order to calculate the validation R2, *i.e.*:

$$R^2_{train} = 1 - \frac{SSE\_residuals_{train}}{SSE\_total_{train}}$$  
  
  
$$R^2_{val} = 1 - \frac{SSE\_residuals_{val}}{SSE\_total_{train}}$$