## CROSS VALIDATION FOR DIFFERENT MODELS

In [7]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import helpers
from implementations import *
from crossvalidation import *
from preprocessing import *
from dataset_splitting import *
from feature_engineering import *
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
#Loading train data

filename = 'train.csv'
data_folder = './data/'
file_path = data_folder + filename
y,tx,ids,features = load_train_data(file_path)

# Computing preprocessing routine

list_subsets, list_features, y_0, y_1, y_2_3, columns_to_drop_in_subsets = preprocessing(tx,y,ids,features)

for idx in range(3):
    list_subsets[idx],mean,std = standardize(list_subsets[idx])

We want to introduce interaction factors between variables during the preprocessing routine. 
However, there is not statistical significance that multiplying variables with trigonometric functions 
may improve the model performance. Therefore, since trigonometric values are the last columns of each 
dataset, we save in a list how many columns are related to trigonometric values in each subset in order 
not to multiply columns with them later.

In [9]:
how_many_trig_features=[2,1,2]

Defining parameters to test

In [10]:
lambdas = np.logspace(-6,-3,5)
degrees = [3,5,7]
k_fold = 4
gamma = 0.1
max_iters = 200

Defining lists to save optimal degrees and lambdas for each subset

In [8]:
optimal_lambdas = [0]*3
optimal_degrees = [1]*3

#### RIDGE REGRESSION

Doing cross validation on subsets_0 for ridge regression

In [6]:
optimal_degrees[0], optimal_lambdas[0], best_rmse = cross_validation_demo_ridge(y_0, list_subsets[0], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[0])

The choice of lambda which leads to the best test rmse is 0.00018 with a test rmse of 0.335. The best degree is 7.0


Doing cross validation on subsets_1 for ridge regression

In [7]:
optimal_degrees[1], optimal_lambdas[1], best_rmse = cross_validation_demo_ridge(y_1, list_subsets[1], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[1])

The choice of lambda which leads to the best test rmse is 0.00018 with a test rmse of 0.371. The best degree is 7.0


Doing cross validation on subsets_2_3 for ridge regression

In [8]:
optimal_degrees[2], optimal_lambdas[2], best_rmse = cross_validation_demo_ridge(y_2_3, list_subsets[2], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[2])

The choice of lambda which leads to the best test rmse is 0.00003 with a test rmse of 0.347. The best degree is 7.0


#### REGULARIZED LOGISTIC REGRESSION

Doing cross validation on subsets_0 for regularized logistic regression

In [33]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_0[:50000], list_subsets[0][:50000], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[0])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.360. The best degree is 3.0


Doing cross validation on subsets_1 for regularized logistic regression

In [11]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_1, list_subsets[1], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[1])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.419. The best degree is 3.0


Doing cross validation on subsets_2_3 for regularized logistc regression

In [12]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_2_3, list_subsets[2], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[2])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.377. The best degree is 3.0


#### COMPUTING ACCURACY FOR RIDGE REGRESSION

Computing accuracy for ridge regression using optimal values as hyperparameters

In [10]:
list_outputs = [y_0,y_1,y_2_3]

In [14]:
compute_accuracy(list_outputs,list_subsets,0.7,[0.00018,0.00018,0.00003],[7,7,7],how_many_trig_features,pred_threshold = 0.5)

Average train accuracy: 0.8356289020064241
std train accuracy: 0.00036995549985801624
Average test accuracy: 0.8351266427558743
std train accuracy: 0.0008986234722572053


#### COMPUTING ACCURACY FOR REGULARIZED LOGISTIC REGRESSION

Computing accuracy for regularized logistic regression using optimal values as hyperparameters

In [15]:
compute_accuracy(list_outputs,list_subsets,0.7,[0.0000, 0.0000, 0.0000],[3,3,3],how_many_trig_features, pred_threshold=0.55,method = 'logistic',gamma = 0.1)

Average train accuracy: 0.8320193575709321
std train accuracy: 0.00046693834276353334
Average test accuracy: 0.8319370556540728
std train accuracy: 0.0010278248657757754


#### COMPUTING ACCURACY FOR LEAST SQUARES USING NORMAL EQUATIONS

Since degree = 7 proved to be a good choice for ridge regression, we decide to test the accuracy for least squares after 
computing a polynomial expansion up to such degree.
<br/>
Notice that values of lambdas are useless in the following case. Value of gamma is chosen in order not to move too much along the gradient direction.

In [26]:
compute_accuracy(list_outputs,list_subsets,0.7,[0.0000, 0.0000, 0.0000],[7,7,7],how_many_trig_features,
                 pred_threshold=0.55,method = 'ls_normal_equations',gamma = 0.3)

Average train accuracy: 0.8337482563392427
std train accuracy: 0.0004043635190978398
Average test accuracy: 0.8333355555259263
std train accuracy: 0.0009857922860326252
