#### Homework2
Please explain clearly and include your entire computational work when needed. Should you include any code, please make sure to provide additional comments to explain your solution. 

Q1 (8 points) Answer the following questions clearly. 
- (4 points) Compare the cost functions in Ridge and Lasso Regression and indicate the regularization parameter. 

- (4 points) Explain which weights are more penalized in Ridge Regression and why (discuss your answer in the context of constraint satisfaction and take into account the constraint on Ridge Regression coefficients). 


LASSO

Loss function: $\sum_{i=1}^M (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^N|w_j| $

Regularization term: $\lambda \sum_{j=1}^N|w_j|$ 

Regularization parameter: $\lambda$ 


RIDGE

Loss function: $\sum_{i=1}^M(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}w_j^2 $

Regularization term: $\lambda \sum_{j=1}^Nw_j^2$ 

Regulaion parameter:  $\lambda$ 

When comparing the penalty terms in both the fuctions you can see that Ridge (L2) is more effective at penalizing large weights due to its quadratic penalty term, while Lasso (L1) can reduce some coefficients entirely to zero which aids in feature selection. Due to this, Ridge Regression penalizes larger weights compared to smaller ones.

Constraint: $c > 0, \sum_{j=1}^Nw_j^2 < c$  

Ridge Regression imposes a constraint on the sum of squares of the weights. This forces the algorithm to find a solution that balances minimizing the residual sum of squares and satisfying the regularization constraint.



Q2 (12 points) In the context of training a linear regression model using Maximum-Likelihood-Estimation, answer the following questions:
- (4 points) Indicate all assumptions discussed in the lecture under the MLE principle about the data, residual error, and the type of the probability density function used in the Likelihood function. 
- (4 points) Indicate the Likelihood function mathematically with respect to the assumptions made under MLE principle, and describe each term/parameters used in the likelihood function. 
- (4 points) Explain how the concept of maximizing the likelihood of observing data under model parameters is convertible to minimizing the NLL? Discuss in terms of the mathematical notation and the shape of the function. 



ASSUMPTIONS:

$x_0 = 1$

$x_1$, $x_2$, ... $x_n$ -> are independent variables

$y$ -> dependent variable

$y$ is independent across observation

$\epsilon ~ N(0, \sigma^2)$ The residual term follows a normal distribution.

$\epsilon$ are independent across observations


LLE:

$L(w_0, w_1, ..., w_m) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}}\exp(-\frac{(y_i - g(x_i))^2}{2\sigma^2}$ 

$w_0, w_1, ..., w_m$ -> weights

$\sigma^2$ -> variance

$g(x_i)$ -> predicted value of ith 

$y_i$ -> actual value of the output of ith


The logarithm function is monotonically increasing, so maximizing the likelihood function is equivalent to maximizing the log-likelihood function. But, minimizing the log likelihood function is equivalent to maximizing the likelihood function. It is also easier to work with the log-likelihood function because taking the log of the likelihood function converts products into sums which is computationally a lot simpler than dealing with probability values.



Q3 (10 points) Use the sklearn Breast_cancer dataset and use min-max scalar to transform the input attributes. Next, develop two classifiers using logistic regression, and perceptron learning. Train on the training data (75% of the entire data) and compare the performance of the models by reporting accuracy "accuracy = accuracy_score(y_test, y_pred). Which model performs better? Provide your coding for the developed models and document your code. Failing proper documentation leads to losing points. Necessary library functions are provided.

In [None]:
%pip install scikit-learn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score

# load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# split data into training and testing (75%/25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# scale input attributes using Min-Max scaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# create the logistic regression classifiers
logistic_model = LogisticRegression(random_state=42)

# create perceptron classifier
perceptron = Perceptron(random_state=42)

# train classifiers
logistic_model.fit(X_train_scaled, y_train)
perceptron.fit(X_train_scaled, y_train)

# test classifiers via test data prediction 
model_y_pred = logistic_model.predict(X_test_scaled)
perceptron_y_pred = perceptron.predict(X_test_scaled)

# calculate accuracy scores
model_accuracy = accuracy_score(y_test, model_y_pred)
perceptron_accuracy = accuracy_score(y_test, perceptron_y_pred)

# display accuracy scores
print(f"{model_accuracy = }")
print(f"{perceptron_accuracy = }")

Q4 (6 points) Compare and contrast Newton's method and gradient descent as optimization algorithms for finding the minimum of a function. Provide insights into their convergence properties, computational complexities, and practical considerations. Discuss situations where Newton's method should not be used. 

-GRADIENT DESCENT:   

$w_j = w_j - \alpha \nabla w_j$  

Learning rate: $\alpha$  

Weight gradient: $\nabla$


-NEWTON'S METHOD:

$w_j = w_j - \alpha H^{-1} \nabla l(w)$

Learning rate: $\alpha$ 

Gradient of loss function: $\nabla l(w)$ 

ANALYSIS:

The difference in use of first-order and second-order information differentiates the Newton's method as a second-order optimization algorithm, while gradient descent is a first-order optimization algorithm. The gradient gives us information about the direction of steepest ascent or descent, while the Hessian matrix gives us information about the curvature of the function. However, Newton's method leads to faster convergance compared to gradient descent, but is comutatationally intensive because we would need to calculate the Hessian matrix. Situations where Newton's Method should not be used are when handling datasets with a large number of features or if the function is not convex (because it can converge to local minima or saddle points rather than the global minimum) and/or too noisy.

Q5 (6 points) Mathematically explain how a perceptron learning model is trained. Discuss in terms of the gradient of the error function used in Perceptron Learning algorithm. 

The perceptron learning model is trained by iteratively updating its weights and bias using gradient descent via the gradient of the error function. It works to mnimize the error by constantly adjusting the parameters of the model in the direction that reduces the error. Since it is a binary classification algorithm, it accepts features with their corresponding weights and calculates the weighted sum. The perceotron then inputs this weighted sum into an activation function and returns 0 or 1.


Error function: $E(w) = \frac{1}{2}\sum^{n}_{i=1}(y_i - \hat{y}_i)^2 = \frac{1}{2}\sum^{n}_{i=1}(y_i - w^T x_i)^2$

The weights and bias are updated iteratively using the gradient descent update rule

Updated value of the weight: $\Delta w_j = \sum^{n}_{i=1}(y_i - \hat{y}_i) (-x_j)$

Q6 (4 points) Compare a Perceptron Learning algorithm with "Binary Step function" used as activation function, with a linear regression function in the context of binary classification. 

QUESTION 6

Q7 (10 points) Answer the following questions: 
- (6 points) Discuss the vanishing gradient problem in the context of training deep neural networks and identify activation functions that are particularly susceptible to this phenomenon. 
  
- (4 points) Explain why these activation functions lead to vanishing gradients during backpropagation (hint: discuss in terms of the shape of the activation function). 


The vanishing gradient problem refers to when the gradient becomes too small and results in the training process becoming flawed since the "vanishing gradients" fail to learn from previous layers and thus the updates to the weights become void or inconsistent. In the context of deep neural networks, this is prevalent since this problem usually occurs during the first layers of the network. If it does, it becomes detrimental to the later processes since the following weights will also not update.

Sigmoid and Tanh lead to vanishing gradients. In the sigmoid fuction, the gradient approaches 0 when the input diverges from 0. In the tanh function, large negative and postivie inputs cause it to flatten out shape-wise and also similarly resluts in a gradient that approaces 0. 

Q8 (10 points) List all hyperparameters discussed in the class related to Artificial and Deep Neural Networks and explain the role/impact of each hyperparameter. Which technique(s) can be used to perform hyperparameter tuning? Explain how the technique(s) work.  

LIST OF RELEVANT HYPERPARAMETERS:

-Batch size: The number of samples that are processed by the model in each training iteration. Impacts training efficieny and convergence behavior of the model.

-Acitivation function: In this, non-linearity is added to add complexity to the function and works with a variety of data. The sigmoid also returns a value between 0 and 1 which can be utilized for classification problem.

-Num of epochs: The num of iterations of the dataset is used to complete training the model and update weights. Using too many epochs can lead to overfitting, as the model may start to learn noise and outliers in the training data whereas using too few epochs may result in underfitting because the model doesn't have enough time to understand patterns in the data

-Regularization constant: Prevents overfitting. A large regularization constant can underfit the model and vice versa.

-Learning rate: Determines the step size when updating weights/how fast the weights are updated. The size of the learning rate is porportional to how fast the model reaches convergence/diverges.

-Num of hidden layers: The num of layers between the input layers and the outputlayers. It determines its capacity to learn complex patterns in the data. More hidden layers capture intricate relationships but can lead to an overfitted and overly complex model. Too little leads to underfitting and poor performance.

-Num of neurons @ each layer: Size of a specific layer.

-Momentum term: Determines how much the past gradient updates influences the current update. A large momentum forces the model to move in the direction of previous updates for a longer amount of time but, can cause to overshoot the global minima, coneverge slower, and/or oscilate from the solution. If the momentum is too small it is less effected by the past updates.


We can use a grid search or a validation set to tune hyperparameters. However, instead of searching through all combinations like grid search, you can manually select hyperparameter values and train the model multiple times using different combinations. Then, you can choose the hyperparameters that result in the best performance.


Q9 (20 points) Given a dataset with input attributes x1 and x2, and output variable y, you are training a 3 layer neural network. Assume that activation function used in each layer is sigmoid. Mathematically describe one feed-forward pass followed by one backward-pass in terms of updating the weights of each layer in this neural network. 

Since there are two input attributes (x1 and x2), an output y, and a 3 layer neural network, our Layer 1 will be the input layer, Layer 2 will be the hidden layer, and Layer 3 the output layer. 
In the feed-forward pass, the input values are multipled by their respective weights and added to a bias term. The result of this is now the input of the activation function. Basically it goes from:

input->hidden layer ->  hidden->output layer 

Just like before, the hidden layer's output is multiplied by weights and added with biases for each neuron in the output layer and another sigmoid activation is applied to get the final output of the neural network model.

In terms of backpropogation or backward pass, the errors are calculated at the output layer by comparing predicted outputs to actual outputs. The derivative of the sigmoid function is used in the error calculation since it's adjacent to the gradient. The error is then propagated backward from the output layer to the hidden layer (one in this case). The weights and biases between the hidden and output layers are updated based on these errors. The process then repeats by adjusting weights and biases at each layer.

Updating weight of hidden layer:  $w_{kj} = w_{kj} - \alpha \sum^{N}_{i=1} (\hat{y}_i - y_i) \times \hat{y}_i(1-\hat{y}_i) v_{k} \times o_{ik}(1-o_{ik}) x_{ik}$

Updating weight of output layer:  $v_{k} = v_{k} - \alpha \sum^{N}_{i=1} (\hat{y}_i - y_i) \times \hat{y}_i(1-\hat{y}_i) o_{ik}$


Q10 (14 points) In this exercise, use the output information generated by this code to perform a comparative study of the performance (i.e., loss , Accuracy) of the neural networks models based on the hyperparameters used. Generate a table to report your analysis.
Note: If you want to run this code and have trouble with imported libraries, try 'pip install keras==2.12.0'

In [7]:
pip install keras==2.12.0
import numpy as np
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Load dataset
digits = load_digits()
X, y = digits.data, digits.target

# Preprocess data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define create_model function for KerasClassifier
def create_model(num_layers=1, num_neurons=64, activation='relu', dropout_rate=0.0, momentum=0.9):
    model = Sequential()
    model.add(Dense(num_neurons, input_dim=X_scaled.shape[1], activation=activation))
    model.add(Dropout(dropout_rate))
    for _ in range(num_layers - 1):
        model.add(Dense(num_neurons, activation=activation))
        model.add(Dropout(dropout_rate))
    model.add(Dense(10, activation='softmax'))
    optimizer = SGD(learning_rate=0.01, momentum=momentum)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Define parameters grid for grid search
param_grid = {
    'num_layers': [3],
    'num_neurons': [32, 64],
    'activation': ['relu', 'tanh'],
    'dropout_rate': [0.2, 0.5],
    'momentum': [0.5, 0.9]
}

# Create KerasClassifier wrapper for scikit-learn
model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32)

# Perform grid search with cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kfold, scoring='accuracy')
grid_result = grid_search.fit(X_scaled, y)

# Print results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, std, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, std, param))


Epoch 1/10


  model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
E

| Activation | Dropout Rate | Momentum | # Layers   | # Neurons   | Mean Accuracy | Std Deviation |
|------------|--------------|----------|------------|-------------|---------------|---------------|
| relu       | 0.2          | 0.5      | 3          | 32          | 0.856476      | 0.037973      |
| relu       | 0.2          | 0.5      | 3          | 64          | 0.919864      | 0.012134      |
| relu       | 0.2          | 0.9      | 3          | 32          | 0.945461      | 0.008758      |
| relu       | 0.2          | 0.9      | 3          | 64          | 0.968830      | 0.007578      |
| relu       | 0.5          | 0.5      | 3          | 32          | 0.636066      | 0.043534      |
| relu       | 0.5          | 0.5      | 3          | 64          | 0.821951      | 0.028709      |
| relu       | 0.5          | 0.9      | 3          | 32          | 0.865873      | 0.035402      |
| relu       | 0.5          | 0.9      | 3          | 64          | 0.941006      | 0.015917      |
| tanh       | 0.2          | 0.5      | 3          | 32          | 0.913729      | 0.012792      |
| tanh       | 0.2          | 0.5      | 3          | 64          | 0.936001      | 0.013650      |
| tanh       | 0.2          | 0.9      | 3          | 32          | 0.960477      | 0.010216      |
| tanh       | 0.2          | 0.9      | 3          | 64          | 0.966605      | 0.008465      |
| tanh       | 0.5          | 0.5      | 3          | 32          | 0.864214      | 0.021556      |
| tanh       | 0.5          | 0.5      | 3          | 64          | 0.900379      | 0.007878      |
| tanh       | 0.5          | 0.9      | 3          | 32          | 0.942683      | 0.009247      |
| tanh       | 0.5          | 0.9      | 3          | 64          | 0.958256      | 0.007508      |
|-------------------------------------------------------------------------------------------------|

The tanh activation function yeilds higher accuracy and a somewhat lower standard deviation. Relu has a lower dropout rate compared to tanh. You can also see that the dropout rate of 0.2 has better accuracy than the dropout rate of 0.5