# Notebook 2: Least Squares Regression and Scaling

#TODO What is happening here

In [5]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os
import tqdm
import glob
import pandas as pd
import sklearn
from src import ana_utils as utils

#np.set_printoptions(suppress=True)
from sklearn.linear_model import LinearRegression

np.random.seed(7)

Import datasets that were preprocessed in Notebook 1

In [6]:
wb_data = pd.read_csv("data/wb_data.csv", index_col="Country Name")
wb_data_short = pd.read_csv("data/wb_data_short.csv", index_col="Country Name")
whr_data = pd.read_csv("data/whr_data.csv", index_col="Country name")

# test: are the same countries present in each dataset?
print(sorted(list(wb_data.index))==sorted(list(whr_data.index)))

True


Split data into train and test set. We choose a 80/20 split, i.e. 120 countries in the training set and 30 countries in the test set.

In [7]:
test_size = 30
train, test, train_gt, test_gt = utils.split_data(wb_data_short, whr_data, test_size)

# verify set shapes
print(train.shape, test.shape, train_gt.shape, test_gt.shape)

#verify that data prder and groundtruth order and indices match 
print(list(train.index)==list(train_gt.index), list(test.index)==list(test_gt.index))


(120, 120) (30, 120) (120, 1) (30, 1)
True True


## Linear regression


Let's see how linear least squares regression performs on wb_data and wb_data_short (redundant indicators removed). We choose 2000-fold validation after noticing quite some variance for lower n.

In [8]:
least_squares = sklearn.linear_model.LinearRegression()

# full set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data, gt=whr_data, test_size=test_size, scaling="no_scaling", calc_adj_r_squared=True)
print("Mean loss (full set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

# manually reduced set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data_short, gt=whr_data, test_size=test_size, scaling="no_scaling", calc_adj_r_squared=True)
print("Mean test loss (reduced set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

ValueError: not enough values to unpack (expected 6, got 4)

While linear regression performs better after manually removing redundancies, both of the results are still quite poor. 
We suspect multicolinearity to be a main reason for bad performance. 

But before starting to deal with multicolinearity, we want to normalize/standardize the data. This is because, in the end, we aim to compare coefficients. But as of now, coefficient sizes vary drastically, thus requiring rescaling of the features.

Hence, performing analysis also on the normalized/standardized data along the way is necessary to prevent us from developing a model that works only on non-normalized/non-standardized data. 

### Normalization 

Here, we normalize each row, using the L2 norm. That is, for each country the indicator values are scaled such that the sum of all squared indicator values is one.

In [None]:
least_squares = sklearn.linear_model.LinearRegression()

# full set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data, gt=whr_data, test_size=test_size, scaling="normalize", calc_adj_r_squared=True)
print("Mean loss (full set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

# manually reduced set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data_short, gt=whr_data, test_size=test_size, scaling="normalize", calc_adj_r_squared=True)
print("Mean test loss (reduced set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

Mean loss (full set of indicators): 31.28188481600015
The average size of the first ten coefficients ((full set of indicators)): [ 28.7335 -15.5441  -0.1055  -0.523   -0.7349  60.7378  21.7037  67.2968
   0.1485 -20.7325] 

Mean loss (reduced set of indicators): 4.944546921287039
The average size of the first ten coefficients (reduced set of indicators): [ 21.4104  -7.3186   1.0311  -0.4875  -3.9778 101.6384 116.0385  23.1098
  -5.2663   1.4116]


### Standardization
Each value x is scaled with the formula $\frac{x-\mu}{\sigma}$

In [None]:
least_squares = sklearn.linear_model.LinearRegression()

# full set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data, gt=whr_data, test_size=test_size, scaling="standardize", calc_adj_r_squared=True)
print("Mean loss (full set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

# manually reduced set
loss_list, mean_loss, mean_train_loss, coef_list, avg_coefs, adjusted_r_squared = utils.n_fold_ceval(reg_model=least_squares, n=1000, data=wb_data_short, gt=whr_data, test_size=test_size, scaling="standardize", calc_adj_r_squared=True)
print("Mean test loss (reduced set of indicators):", mean_loss)
print("Mean train loss (reduced set of indicators):", mean_train_loss)
print("Adjusted R^2:", adjusted_r_squared)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

Mean loss (full set of indicators): 25.08420041301712
The average size of the first ten coefficients ((full set of indicatoreg_model=least_squaresrs)): [ 0.6734 -0.1709  0.0026 -0.0407 -0.049   2.535   1.4409  1.1724  0.2123
 -0.9681] 

Mean loss (reduced set of indicators): 5.3233445138595075
The average size of the first ten coefficients (reduced set of indicators): [ 0.5083 -0.0975  0.0848 -0.0108 -0.189   4.9715  4.1716  0.674  -0.3332
  0.0779]


### Conclusion for Standardization/Normalization
Both methods seem to worsen mean loss for the regression on the full dataset. However, they don't change the mean loss notably for the dataset containing the manually reduced set of indicators.  
Therefore, we will from now on normalize the data in order to make the resulting regression coefficients more interpretable.