# Project 1 of Machine Learning CS-433, 2023

### _By_ _Salya_ _Diallo_, _Shrinidi_ _Singaravelan_ _and_ _Fanny_ _Ghez_

In this project, our main goal is to determine the risk of a person in developing CVD (Cardiovascular Diseases) based on features of their personal lifestyle factors, using the given data set.

---

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

In [2]:
from helpers import *
from standardization import *
from clean_and_predict import *
from implementations import *
from cross_validation import *

### We then import and clean our train and set data:

In [3]:
# You have to change the path in function of where the data set is located in your computer
x_train, x_test, y_train1, train_ids, test_ids = load_csv_data('dataset_to_release', sub_sample=False)

#### We do not want any _NaN_ values in our data. Especially in x_train and x_test there is unknown numerical values: we want to change them into another value, for example the median of the corresponding column, which is what we will do now.

In [4]:
x_train1, i_1, mean_1, std_1 = clean_data(x_train, [1])
x_train2, i_2, mean_2, std_2 = clean_data(x_train, [2])

Youpi!
(138783, 322)
Youpi!
(189352, 322)


In [5]:
# We need to create another clean function -> because our standardize function will not be the same
# We will be using the mean and std of our train set:

x_test1, ind_1_test = clean_data_test_set(x_test, [1], mean_1, std_1) 
x_test2, ind_2_test = clean_data_test_set(x_test, [2], mean_2, std_2)

Youpi!
(46401, 322)
Youpi!
(62978, 322)


#### For y_train, there is non _NaN_ values but only -1 or 1. Or, we want to have 0s instead os -1s. Hence we will only modify this for y_train:

In [6]:
y_train = y_train1.copy()
y_train = np.where(y_train == -1, 0, y_train)

In [7]:
y_train.shape

(328135,)

### We now create a function that will compute the mistake percentage of our prediction of y_train, and then our prediction for y_test, in function of the weights obtained from one of our algorithm. This will return the prediction and print the mistake percentage and the prediction.

In [8]:
def error_and_prediction(w_1, w_2):
    # Let us first compute and print the mistake percentage of our prediction of y_train:
    y_pred_1 = predict_labels(w_1, x_train1)
    y_pred_2 = predict_labels(w_2, x_train2)
    
    e1 = np.count_nonzero(y_pred_1 - y_train[i_1])
    e2 = np.count_nonzero(y_pred_2 - y_train[i_2])

    print('The mistake percentage on the train set is:',round((e1+e2)*100/y_train.shape[0], 4), '%.')
    
    # Now we predict and print our y_test:
    y_pred_test_1 = predict_labels(w_1, x_test1)
    y_pred_test_2 = predict_labels(w_2, x_test2)
    
    prediction = np.zeros(len(test_ids))
    
    prediction[ind_1_test] = y_pred_test_1
    prediction[ind_2_test] = y_pred_test_2
    # We still need to change the zeros into -1s since y_train initially have -1 instead of zeros
    prediction[prediction==0] = -1
    print('Our prediction for y_test is:', prediction,'.')
    
    return prediction

---
## Gradient Descent


In [9]:
# Initialization:
D1 = x_train1.shape[1]
initial_w1 = np.zeros((D1,))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_gd1, loss_gd1 = mean_squared_error_gd(y_train[i_1], x_train1, initial_w1, max_iters, gamma)

In [10]:
# Initialization:
D2 = x_train2.shape[1]
initial_w2 = np.zeros((D2,))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_gd2, loss_gd2 = mean_squared_error_gd(y_train[i_2], x_train2, initial_w2, max_iters, gamma)

In [11]:
print(loss_gd1)
print(loss_gd2)

0.04901843315569615
0.03233364855262832


### We now want to predict our y. To do so we use our previously defined function to compute the error percentage and our prediction:

In [12]:
y_pred_gd = error_and_prediction(w_gd1, w_gd2)

The mistake percentage on the train set is: 8.8262 %.
Our prediction for y_test is: [-1. -1. -1. ... -1. -1. -1.] .


In [13]:
OUTPUT_PATH = 'Gradient_Descent' 
create_csv_submission(test_ids, y_pred_gd, OUTPUT_PATH)

F1 score: 0.411

Accuracy: 0.843

---

## Stochastic Gradient Descent

In [14]:
# Initialization:
w1 = np.zeros((x_train1.shape[1],))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_sgd1, loss_sgd1 = mean_squared_error_sgd(y_train[i_1], x_train1, w1, 1, max_iters, gamma)

In [15]:
# Initialization:
w2 = np.zeros((x_train2.shape[1],))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_sgd2, loss_sgd2 = mean_squared_error_sgd(y_train[i_2], x_train2, w2, 1, max_iters, gamma)

In [16]:
print(loss_sgd1)
print(loss_sgd2)

0.05203075128358397
0.03334871416215792


In [17]:
y_pred_sgd = error_and_prediction(w_sgd1, w_sgd2)

The mistake percentage on the train set is: 9.5445 %.
Our prediction for y_test is: [-1. -1. -1. ... -1.  1. -1.] .


In [18]:
OUTPUT_PATH = 'Stochastic_Gradient_Descent' 
create_csv_submission(test_ids, y_pred_sgd, OUTPUT_PATH)

F1 score: 0.158

Accuracy: 0.558

---

## Least Squares

In [19]:
w_ls1, loss_ls1 = least_squares(y_train[i_1], x_train1)

In [20]:
w_ls2, loss_ls2 = least_squares(y_train[i_2], x_train2)

In [21]:
print(loss_ls1)
print(loss_ls2)

0.04082041310188043
0.028765329027445052


In [22]:
y_pred_ls = error_and_prediction(w_ls1, w_ls2)

The mistake percentage on the train set is: 11.1415 %.
Our prediction for y_test is: [-1. -1. -1. ... -1.  1.  1.] .


In [23]:
OUTPUT_PATH = 'Least_Squares' 
create_csv_submission(test_ids, y_pred_ls, OUTPUT_PATH)

F1 score: 0.379

Accuracy: 0.787

---

## Rigde regression

In [24]:
# Parameter: 
lambda_ = 0.1

w_rr1, loss_rr1 = ridge_regression(y_train[i_1], x_train1, lambda_)

In [25]:
w_rr2, loss_rr2 = ridge_regression(y_train[i_2], x_train2, lambda_)

Let us now predict our test data:

In [26]:
y_pred_rr = error_and_prediction(w_rr1, w_rr2)

The mistake percentage on the train set is: 10.0166 %.
Our prediction for y_test is: [-1. -1. -1. ... -1.  1. -1.] .


In [27]:
OUTPUT_PATH = 'Ridge_Regression' 
create_csv_submission(test_ids, y_pred_rr, OUTPUT_PATH)

F1 score: 0.380

Accuracy: 0.788

### We try now to use Cross-Validation (CV) to see if it changes our accuracy:

In [28]:
lambdas = np.logspace(-6, 0, 6)
k = 4

w1, loss1 = ridge_regression_cross_validation(y_train[i_1], x_train1, k, lambdas)
w2, loss2 = ridge_regression_cross_validation(y_train[i_2], x_train2, k, lambdas)

The best parameter lambda is: 1e-06
The best parameter lambda is: 1e-06


In [29]:
y_pred_rr_cv = error_and_prediction(w1, w2)

The mistake percentage on the train set is: 11.1411 %.
Our prediction for y_test is: [-1. -1. -1. ... -1.  1.  1.] .


In [30]:
OUTPUT_PATH = 'Ridge_Regression_CV' 
create_csv_submission(test_ids, y_pred_rr_cv, OUTPUT_PATH)

F1 score: 0.380

Accuracy: 0.788

We see here that using cross validation does not change our accuracy or F1 score in this case. By modifying lambdas or k, these values changes but not much.

---

## Logistique regression

In [31]:
# Initialization:
D1 = x_train1.shape[1]
w1 = np.zeros((D1,))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_lr1, loss_lr1 = logistic_regression(y_train[i_1], x_train1, w1, max_iters, gamma)

In [32]:
# Initialization:
D2 = x_train2.shape[1]
w2 = np.zeros((D2,))

# Parameters: 
gamma = 0.001
max_iters = 100

# Computation of the weights and loss:
w_lr2, loss_lr2 = logistic_regression(y_train[i_2], x_train2, w2, max_iters, gamma)

In [33]:
print(loss_lr1)
print(loss_lr2)

0.6687461415287278
0.6705917743935812


In [34]:
y_pred_lr = error_and_prediction(w_lr1, w_lr2)

The mistake percentage on the train set is: 8.8205 %.
Our prediction for y_test is: [-1. -1. -1. ... -1. -1. -1.] .


In [35]:
OUTPUT_PATH = 'Logistic_Regression' 
create_csv_submission(test_ids, y_pred_lr, OUTPUT_PATH)

F1 score: 0.332

Accuracy: 0.903

---

## Regularized Logistic regression

In [36]:
# Initialization:
D1 = x_train1.shape[1]
initial_w1 = np.zeros((D1,))

# Parameters:
lambda_ = 0.01
gamma = 0.001
max_iters = 1000

# Computation of the weights and loss:
w_rlr1, loss_rlr1 = reg_logistic_regression(y_train[i_1], x_train1, lambda_, initial_w1, max_iters, gamma)

In [None]:
# Initialization:
D2 = x_train2.shape[1]
initial_w2 = np.zeros((D2,))

# Parameters:
lambda_ = 0.01
gamma = 0.001
max_iters = 1000

# Computation of the weights and loss:
w_rlr2, loss_rlr2 = reg_logistic_regression(y_train[i_2], x_train2, lambda_, initial_w1, max_iters, gamma)

In [None]:
print(loss_rlr1)
print(loss_rlr2)

In [None]:
y_pred_rlr = error_and_prediction(w_rlr1, w_rlr2)

In [None]:
all(y_pred_rlr==-1)

In [None]:
OUTPUT_PATH = 'Ridge_Logistic_Regression_1000' 
create_csv_submission(test_ids, y_pred_rlr, OUTPUT_PATH)

400 iterations:
F1 score: 0.385
and
Accuracy: 0.896

1000 iterations:
F1 score: 0.369
and
Accurcy: 0.890


#### We see that by increasing the number of iterations, the F1 score increases but accuracy decreases. With 100 iterations, we got a F1 score of 0.012 (not good at all) and an accuracy of 0.912. This means that this algorithm needs more iterations to be performant. Nevertheless, even with 1000 iterations, it takes only a few 