# Notebook 2: Conducting and Evaluating Regression Analysis

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os
import tqdm
import glob
import pandas as pd
import sklearn
from src import ana_utils as utils

#np.set_printoptions(suppress=True)
from sklearn.linear_model import LinearRegression

np.random.seed(7)

Import datasets that were preprocessed in Notebook 1

In [2]:
wb_data = pd.read_csv("data/wb_data.csv", index_col="Country Name")
wb_data_short = pd.read_csv("data/wb_data_short.csv", index_col="Country Name")
whr_data = pd.read_csv("data/whr_data.csv", index_col="Country name")

# test: are the same countries present in each dataset?
print(sorted(list(wb_data.index))==sorted(list(whr_data.index)))

True


Split data into train and test set. We choose a 80/20 split, i.e. 120 countries in the training set and 30 countries in the test set.

In [3]:
test_size = 30
train, test, train_gt, test_gt = utils.split_data(wb_data_short, whr_data, test_size)

# verify set shapes
print(train.shape, test.shape, train_gt.shape, test_gt.shape)

#verify that data prder and groundtruth order and indices match 
print(list(train.index)==list(train_gt.index), list(test.index)==list(test_gt.index))


(120, 120) (30, 120) (120, 1) (30, 1)
True True


## Linear regression


Let's see how linear regression performs on wb_data and wb_data_short (redundant indicators removed). We choose 2000-fold validation after noticing quite some variance for lower n.

In [4]:
least_squares = sklearn.linear_model.LinearRegression()

In [5]:
# For the full dense indicator data
loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data, whr_data, test_size, scaling="no_scaling", reg_model=least_squares)
print("Mean loss (full set of indicators):", mean_loss)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data_short, whr_data, test_size, scaling="no_scaling", reg_model=least_squares)
print("Mean loss (reduced set of indicators):", mean_loss)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

Mean loss (full set of indicators): 14.448268503467897
The average size of the first ten coefficients ((full set of indicators)): [[ 1.530e-02 -8.200e-03 -0.000e+00 -0.000e+00 -2.900e-03  6.360e-02
  -9.500e-03  7.310e-02  7.740e-02 -8.620e-02 -2.280e-02  1.198e-01
  -1.645e-01  1.871e-01 -1.450e-02 -2.496e-01 -2.301e-01  5.030e-02
   1.144e-01  2.140e-02 -5.950e-02  5.000e-04 -1.047e-01  1.590e-02
   3.990e-02 -2.330e-01  1.970e-02  4.478e-01 -2.379e-01 -6.268e-01
   8.632e-01 -6.018e-01  3.000e-04 -1.028e-01  1.964e-01 -7.580e-02
   8.270e-02  1.303e-01 -6.590e-02  2.219e-01 -3.221e-01 -2.942e-01
  -1.801e-01 -0.000e+00 -0.000e+00  6.693e-01  1.131e-01 -5.282e-01
  -8.860e-02  0.000e+00  1.250e-02 -1.030e-02  9.700e-03  1.000e-02
   1.560e-02  1.780e-02 -6.800e-03 -0.000e+00  2.300e-03  2.970e-02
   1.840e-02  3.090e-02  3.460e-02 -3.600e-02  1.570e-02  1.830e-02
  -2.578e-01  1.941e-01 -1.776e-01  9.690e-02  2.197e-01 -4.080e-02
   2.198e-01  2.919e-01 -7.377e-01  1.162e-01  3.403e-

While linear regression performs better after manually removing redundancies, both of the results are still quite poor. 
We suspect multicolinearity to be a main reason for bad performance. 

But before starting to deal with multicolinearity, we want to normalize/standardize the data. This is because, in the end, we aim to compare coefficients. Hence, performing analysis also on the normalized/standardized data along the way is necessary to prevent us from developing a model that works only on non-normalized/non-standardized data.

### Normalization 

Here, we normalize each row, using the L2 norm. That is, for each country the indicator values are scaled such that the sum of all squared indicator values is one.

In [6]:
# For the full dense indicator data
loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data, whr_data, test_size, scaling="normalize", reg_model=least_squares)
print("Mean loss (full set of indicators):", mean_loss)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data_short, whr_data, test_size, scaling="normalize", reg_model=least_squares)
print("Mean loss (reduced set of indicators):", mean_loss)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

Mean loss (full set of indicators): 31.28188481600015
The average size of the first ten coefficients ((full set of indicators)): [[ 2.873350e+01 -1.554410e+01 -1.055000e-01 -5.230000e-01 -7.349000e-01
   6.073780e+01  2.170370e+01  6.729680e+01  1.485000e-01 -2.073250e+01
  -3.417900e+00  2.036860e+01 -1.383680e+01  5.558600e+00  1.046250e+01
  -2.308350e+01 -5.012930e+01  6.280900e+00  3.571740e+01  1.436430e+01
  -9.814000e+00  4.462000e+00 -2.309000e+01 -3.169000e+00  1.969130e+01
  -6.239250e+01  2.913470e+01  5.520390e+01  9.847900e+00 -2.179852e+02
   2.015448e+02 -1.038450e+01  6.337000e-01 -1.105073e+02  6.661500e+01
   5.759830e+01  4.759510e+01 -9.912370e+01 -7.942900e+01  7.162820e+01
  -5.639740e+01  1.020272e+02 -1.122523e+02 -8.484800e+00 -1.880000e+00
   3.867868e+02 -2.049211e+02 -3.147230e+01 -7.710000e+00  1.069610e+01
   5.160000e-02 -1.626900e+00  5.195700e+00  1.179200e+00  1.087000e+00
   1.776400e+00 -2.367700e+00 -1.417010e+01  5.231000e-01  1.433160e+01
   2.06

### Standardization
Each value x is scaled with the formula $\frac{x-\mu}{\sigma}$

In [7]:
# For the full dense indicator data
loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data, whr_data, test_size, scaling="standardize", reg_model=least_squares)
print("Mean loss (full set of indicators):", mean_loss)
print("The average size of the first ten coefficients ((full set of indicators)):", avg_coefs[:10], "\n")

loss_list, mean_loss, coef_list, avg_coefs = utils.n_fold_ceval(2000, wb_data_short, whr_data, test_size, scaling="standardize", reg_model=least_squares)
print("Mean loss (reduced set of indicators):", mean_loss)
print("The average size of the first ten coefficients (reduced set of indicators):", avg_coefs[:10])

Mean loss (full set of indicators): 25.08420041301712
The average size of the first ten coefficients ((full set of indicators)): [[ 6.7340e-01 -1.7090e-01  2.6000e-03 -4.0700e-02 -4.9000e-02  2.5350e+00
   1.4409e+00  1.1724e+00  2.1230e-01 -9.6810e-01 -3.4300e-02  7.4930e-01
  -2.7600e-01  2.2900e-01  6.6260e-01 -1.1026e+00 -1.0718e+00 -8.1400e-02
   8.9340e-01  9.2850e-01 -5.0190e-01 -2.6360e-01 -1.1950e-01  2.6500e-02
   1.8290e-01 -1.8295e+00  1.0490e-01  6.7160e-01  6.2610e-01 -4.6759e+00
   5.0977e+00 -5.4920e-01  2.3400e-02 -3.4373e+00  1.0347e+00  1.3878e+00
   1.7285e+00 -2.1630e+00 -4.4370e-01  4.3790e-01 -1.1262e+00  1.5499e+00
  -2.1301e+00 -4.5770e-01 -7.7400e-02  5.4737e+00 -2.9100e-01 -4.0696e+00
  -7.3700e-02  8.9030e-01  2.5800e-02 -1.1270e-01  1.8870e-01  8.7600e-02
   9.6800e-02  1.2370e-01 -1.2120e-01 -1.1341e+00  5.0800e-02  4.7290e-01
   1.1140e-01  4.1510e-01  1.8290e-01 -1.6430e-01  1.8300e-01  2.3420e-01
  -1.1241e+00  1.5586e+00 -1.5860e-01 -2.0394e+00  1.3440

### Conclusion for standardization/normalization
Both methods seem to worsen mean loss for the regression on the full dataset. However, they don't change the mean loss significantly for the dataset containing the manually reduced set of indicators.  
Therefore, we will from now on normalize the data in order to make the resulting regression coefficients more interpretable.