<a href="https://colab.research.google.com/github/felixsimard/comp551-p2/blob/main/P2_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Part 1: Optimization (80 points)**

## Setup

In [None]:
import time
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
from IPython.core.debugger import set_trace
import warnings
warnings.filterwarnings('ignore')
from joblib import Parallel, delayed

# Additional Python files
from LogisticRegression import LogisticRegression, TrainingResults
from Gradient import *
LogisticRegression.gradient = gradient

In [None]:
# Define datasets paths
diabetes_train_dir = r'diabetes/diabetes_train.csv'
diabetes_val_dir = r'diabetes/diabetes_val.csv'
diabetes_test_dir = r'diabetes/diabetes_test.csv'

diabetes_train_df = pd.read_csv(diabetes_train_dir, engine="python", error_bad_lines=False)
diabetes_val_df = pd.read_csv(diabetes_val_dir, engine="python", error_bad_lines=False)
diabetes_test_df = pd.read_csv(diabetes_test_dir, engine="python", error_bad_lines=False)

## Feature-Target split

In [None]:
# split into feature and target
diabetes_train_X =  diabetes_train_df.drop('Outcome', axis=1)
diabetes_train_y = diabetes_train_df.loc[:, 'Outcome']
diabetes_val_X = diabetes_val_df.drop('Outcome', axis=1)
diabetes_val_y = diabetes_val_df.loc[:, 'Outcome']
diabetes_test_X = diabetes_test_df.drop('Outcome', axis=1)
diabetes_test_y = diabetes_test_df.loc[:, 'Outcome']

## 1. Gradient descent
You should first start by running the logistic regression code using the given implementation. This will serve as a baseline for the following steps. Find a learning rate and a number of training iterations such that the model has fully converged to a solution. Make sure to provide empirical evidence supporting your decision (e.g. training and validation accuracy as a function of number of training iterations).

In [None]:
# method to create and fit the LR model
def get_acc_list(lr, max_iters, itv):
    model = LogisticRegression(verbose=True, learning_rate=lr, max_iters=max_iters)
    acc_list = model.fit_for_vis(train_X, train_y, val_X, val_y, itv)
    return acc_list

We will fit the model with the following learning rates: [0.2, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]

In [None]:
# configuration
max_iter = 1e6
itv = int(1e3)
lr_list = [0.2, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]

In [None]:
# parallerize training
accs_list = Parallel(n_jobs=-1, verbose=10)(delayed(get_acc_list)(i, max_iter, itv) for i in lr_list)

In [None]:
# Create plot
length = len(accs_list[0])
row = list(range(1, length*itv, itv))
[plt.plot(row, accs_list[i]) for i in range(len(accs_list))]
plt.legend(['0.2', '1e-1', '1e-2', '1e-3', '1e-4', '1e-5', '1e-6', '1e-7'],
           bbox_to_anchor=(1.04,1))
plt.grid()
plt.title("Change in validation accuracy over iterations")
# plt.savefig('/content/drive/MyDrive/COMP551/mini2/figures/lrs_compare_max_itrs=1e6.png', bbox_inches="tight")

Now, increase the max_iters and try again with [1e-4, 1e-5, 1e-6, 1e-7].

In [10]:
# configuration
new_max_iter = 3*(1e6)
new_itv = int(1e4)
new_lr_list = [1e-4, 1e-5, 1e-6, 1e-7]

In [None]:
new_accs_list = []
for i in range(len(new_lr_list)):
    result = get_acc_list(new_lr_list[i], new_max_iter, new_itv)
    print('\n')
    new_accs_list.append(result)

In [None]:
# Create plot
new_length = len(new_accs_list[0])
new_row = list(range(0, new_length*new_itv, new_itv))
[plt.plot(new_row, new_accs_list[i]) for i in range(len(new_accs_list))]
plt.legend(['1e-4', '1e-5', '1e-6', '1e-7'],
           bbox_to_anchor=(1.04,1))
plt.grid()
plt.title("Change in validation accuracy over iterations")
plt.savefig('/content/drive/MyDrive/COMP551/mini2/figures/lrs_compare_max_itrs=3_1e6.png',
            bbox_inches="tight")

## Best Configuration
From the plot above:

max_iters = 1.8e6

lr = 1e-4

---------------------------
epsilon=1e-4 (default value) but the norm of the gradient didn't decrease below 1e-4 in my experiments, so this parameter is not tuned.

## 2. Mini-batch stochastic gradient descent
Implement mini-batch stochastic gradient descent. Then, using growing minibatch sizes (e.g. 8, 16, 32, ...) com- pare the convergence speed and the quality of the final solution to the fully batched baseline. What configuration works the best among the ones you tried ?

### Helper methods and variables

In [None]:

batch_sizes = [8, 16, 32, 64, 128, 256, 512]

# function to create and fit LR model and receive training info
def get_training_results_batch(lr, max_iters, itv, batch_size, max_epochs, momentum):
    model = LogisticRegression(verbose=True, learning_rate=lr, max_iters=max_iters)
    return model.fit_for_vis_complex(diabetes_train_X, diabetes_train_y, diabetes_val_X, diabetes_val_y, itv, batch_size, max_epochs, momentum)

# helper function to have all time lists be the same length
def same_length_lsts(results):
    max_time_num_itv = 0
    for r in results:
        if len(r.acc_list_time) > max_time_num_itv:
            max_time_num_itv = len(r.acc_list_time)
    for r in results:
        if len(r.acc_list_time) < max_time_num_itv:
            r.acc_list_time += [r.acc_list_time[-1]] * (max_time_num_itv - len(r.acc_list_time))
        if len(r.grad_list_time) < max_time_num_itv:
            r.grad_list_time += [r.grad_list_time[-1]] * (max_time_num_itv - len(r.grad_list_time))
    return results

# method to plot the epochs results
def plot_epochs(results, num_epochs):
    epoch_row = list(range(1, num_epochs + 1))
    [plt.plot(epoch_row, results[i].acc_list_epoch) for i in range(len(results))]
    plt.legend(batch_sizes, bbox_to_anchor=(1.04, 1))
    plt.grid()
    plt.title(f"Change in validation accuracy over iterations by batch size (lr = {learning_rate})")
    plt.savefig('./' + num_epochs + '_acc_epoch_batch')
    
# method to plot all other results
def plot_results(results, lr_val):
    length = len(results[0].acc_list_it)
    it_row = list(range(1, length*itv, itv))
    time_row = list(range(0, 15*len(results[0].acc_list_time), 15))
    learning_rate = results[0].lr_model.learning_rate
    
    [plt.plot(it_row, results[i].acc_list_it) for i in range(len(results))]
    plt.legend(batch_sizes, bbox_to_anchor=(1.04, 1))
    plt.grid()
    plt.title(f"Change in validation accuracy over iterations by batch size (lr = {learning_rate})")
    plt.savefig('./' + lr_val + '_acc_iter_batch')

    
    [plt.plot(time_row, results[i].acc_list_time) for i in range(len(results))]
    plt.legend(batch_sizes, bbox_to_anchor=(1.04, 1))
    plt.grid()
    plt.title(f"Speed of accuracy convergence by batch size (lr = {learning_rate})")
    plt.savefig('./figures/' + lr_val + '_acc_speed_batch')


    [plt.plot(it_row, results[i].grad_list_it) for i in range(len(results))]
    plt.legend(batch_sizes, bbox_to_anchor=(1.04, 1))
    plt.grid()
    plt.title(f"Change in gradient over iterations by batch size (lr = {learning_rate})")
    plt.savefig('./figures/' + lr_val + '_grad_iter_batch')

    
    [plt.plot(time_row, results[i].grad_list_time) for i in range(len(results))]
    plt.legend(batch_sizes, bbox_to_anchor=(1.04, 1))
    plt.grid()
    plt.title(f"Speed of gradient convergence by batch size (lr = {learning_rate})")
    plt.savefig('./figures/' + lr_val + '_grad_speed_batch')

### Train (by epochs) with all batch sizes


In [None]:
# train with 10000 epochs
results = [ get_training_results_batch(1e-4, 3*(1e6), 1e4, batch_size, 10000, 0) for batch_size in batch_sizes]
print("10000 epochs training results")
for result in results:
    test_yh = (result.lr_model.predict(test_X) > 0.5).astype('int')
    print("Batch size: ", result.batch_size, " Accuracy: ", accuracy_score(test_y, test_yh))
print()

### Train (by iterations) with all batch sizes and 3 learning rates

In [None]:
new_max_iter = 3*(1e6)
new_itv = 1e4

batch_sizes = [8, 16, 32, 64, 128, 256, 512]
results_lr_best = Parallel(n_jobs=-1, verbose=10)(delayed(get_training_results_batch)(1e-4, new_max_iter, new_itv, batch_size, max_epochs=3) for batch_size in batch_sizes)
results_lr_low = Parallel(n_jobs=-1, verbose=10)(delayed(get_training_results_batch)(1e-7, new_max_iter, new_itv, batch_size) for batch_size in batch_sizes)
results_lr_high = Parallel(n_jobs=-1, verbose=10)(delayed(get_training_results_batch)(0.2, new_max_iter, new_itv, batch_size) for batch_size in batch_sizes)

modified_results_lr_best = same_length_lsts(results_lr_best)
modified_results_lr_low = same_length_lsts(results_lr_low)
modified_results_lr_high = same_length_lsts(results_lr_high)

plot_results(modified_results_lr_low, 'lr_low')
plot_results(modified_results_lr_best, 'lr_best')
plot_results(modified_results_lr_high, 'lr_high')

#### 3. Momentum
Add momentum to the gradient descent implementation. Trying multiple values for the momentum coefficient, how does it compare to regular gradient descent ? Specifically, analyze the impact of momentum on the conver- gence speed and the quality of the final solution.

In [None]:
# Momentum added to LogisticRegression.py fit_for_vis_complex function

#### 4.
repeat the previous step for a) the smallest batch size and b) largest batch size you tried in 2). In which setting (small mini-batch, large mini-batch, fully batched) is it the most / least effective ?

In [5]:
# do