# Final Exam - Practice Problems SOLUTIONS

# Problem 1 - Backpropagation

**Use the figure below and the chain rule of derivatives to show how the weights $w_i$ can be trained through a sigmoid unit ($f(\bullet)$ is a sigmoid) using the gradient of any smooth non-negative cost function $J=C(e)$.**

<div><img src="net.png", width="400"><!div>

<div><img src="net_solution.png", width="400"><!div>

# Problem 2 - Neural Network

**Consider the following two-dimensional, two-class data set plotted in the figure below. Design a neural network that would obtain 100% classification accuracy on this data set. Be sure to define the network architecture and all parameters precisely (give exact numbers). Use a hard limit activation function as defined in the following equation:**

$$f(x) = \begin{cases} 0 & \text{if } x\leq 0 \\ 1 & \text{if } x>0 \end{cases}$$

**Show and explain your work.**

<div><img src="data.png", width="300"><!div>

We need to first determine the decision boundaries that will delimit class 0 and class 1. A picture of the linear decision boundaries can be shown in the figure below:

<div><img src="data_solution.png", width="300"><!div>
    
The bottom line, $L_1$, passes through points $(10,0)$ and $(0,5)$. The top line, $L_2$, passes through points $(20,0)$ and $(0,10)$.
    
These lines have an approximate equation:

$$L_1: y = -2x+5 \iff 2x + y - 5 = 0$$
    
and 

$$L_2: y = -2x+10 \iff 2x + y -10 = 0$$

The network architecture needed is a 2-2-1: 2 units in the input layer, 2 units in the single hidden layer and 1 unit in the output layer.

Using the equations above, we can fill in the weights and biases for the connection input-hidden layers.
    
The final network architecture will all the parameters is:

<div><img src="net_problem2.png", width="700"><!div>

# Problem 3 - Random Forests

**What is bootstrapping and how does it relate to random forest classifiers? Be precise (use pseudo-code).**

Bootstrapping is a sampling technique often used to create resamples of a given data set and test some hypothesis. 

Random forests make use of Bootstrapping. Every individual learner in a random forest, a decision tree, is trained using a Bootstrap sample to fit its parameters.

# Problem 4 - K-NN

**This question has two parts.**

1. **Given the following training data set, what is the predicted class label of the test point $[2, 2]$ if you are using a K-NN classifier with squared Euclidean distance as the distance metric and $K = 3$?**

<div><img src="K-NN.png", width="300"><!div>

2. **What is the predicted class label of the test point $[100, 100]$ if you are using a K-NN classifier with squared Euclidean distance as the distance metric and $K=3$? Given the training data, would you trust this result? Why or why not?**

**Solution to part 1**

Using Euclidean distance and $k=3$ neighbors to determine the class membership of a point in test, the class label assigned to test point $[2,2]$ will be class cross (x) as this point will have all 3 closest points belonging to class cross (x).

**Solution to part 2**

Similarly, for test point $[100,100]$ will be class cross (x). However, point $[100,100]$ falls outside the region of the training data, and the neighboring distances will be very large. We can use this distance information to access a confidence in the class assignment for each data point: if distance is short, then we will be confident in that decision; if the distance is large, then we will not be confident.

# Problem 5 - Feed-forward Neural Network

**Suppose you have the following network with all activation functions equal to the hard limit defined in the following equation:**

$$f(x) = \begin{cases} 0 & \text{if }x\leq 0 \\ 1 & \text{if } x>0  \end{cases}$$

**Given a test point with the value of $[1, -2]$ what is the output of the network? Show your work.**

<div><img src="ANN.png", width="500"><!div>

Let's call the output at neuron A $y_A$, and the output at neuron B $y_B$. So the output equations for each unit in the network are given as:

\begin{align}
y_A &= f\left(x_1 +0.25 x_2 - 0.25\right) \\
y_B &= f\left(0.5x_1 + 0.1 x_2 +0.9\right)\\
y_1 &= f\left(y_A + 0.5 y_B -1.35\right)\\
y_2 &= f\left(2y_A + 0.2\right)
\end{align}

where $f(x) = \begin{cases} 0 & \text{if }x\leq 0 \\ 1 & \text{if } x>0  \end{cases}$.

For the test point $[1,-2]$: 

\begin{align}
y_A &= f(1 -0.5 - 0.25) = 1\\
y_B &= f(0.5 - 0.2 + 0.9) = 1\\ 
y_1 &= f(1 + 0.5 - 1.35) = 1\\ 
y_2 &= f(2 + 0.2) = 1
\end{align}

The output of the network is $[1,1]$.

# Problem 6 - Gradient Descent

**Gradient descent is a general optimization approach where parameters of interest are iteratively updated. Does gradient descent ensure finding the global optima? Why or why not? What effect does the learning rate have on gradient descent?**

No, the resulting solution is initialization dependent. Given an appropriate learning rate, gradient descent will move to the closest local optima.

If the learning rate is too large, a step in gradient descent can jump over optima (possibly causing oscillation). A learning rate too small will take a long time to converge to a local optima. 

# Problem 7 - Online vs Batch Learning

**The following three questions are about online vs batch learning of MLPs.**

1. **Briefly define Online learning and provide pseudo-code to describe how it can be implemented.**

2. **Briefly define Batch learning and provide pseudo-code to describe how it can be implemented.**

3. **Briefly compare and contrast online vs batch learning. What are their relative advantages/disadvantages? When would you use one over the other?**

4. **Will any of these training approaches find the globally best parameter settings for a network? Why or why not?**

A batch update uses all training data to update parameters. An epoch is equivalent to an iteration.

Mini-batch uses a subset of the training data to update parameters. An epoch loops through all mini-batches to cover all training data points.

Online update uses a single data point update parameters each iteration. An epoch loops through all training data points.

In neural networks, all of these approaches still rely on gradient descent. So, no, the global optima is not guaranteed.

# Problem 8 - Model

**Suppose you have the following training data set and suppose you expect your training data set to be representative of your test data. Of all of the methods we can covered in the course, which would you use to classify this data into black x vs. red o. Why? Justify your answer.**

<div><img src="model.png", width="300"><!div>

The first observation from this data set is that classes are not linearly separable. Therefore, models such as LDA and Perceptron will not be successful learning these classes.

An MLP can be used for this data. An architecture of 2-19-1 can work for this data set. We can considered about 19 units in the hidden layer as they represent the 19 linear boudaries that need to be places diagonally to separate regions of class 'x' and class 'o'.

The k-NN model is also able to learn such spaces but it is highly proned to overfitting. Furthermore, empty regions (e.g. around point ~[9,3]) without training samples representations may cause the test to perform poorly.

A type of model that is "designed" for this type of problems is SVM. SVM equipped with an RBF kernel is able to project this data into a higher dimensional space where the classes are linearly separable and easy to separate.

# Problem 9 - Neural Network

**Consider the following two-dimensional data set and desired values for a two-class classification problem:**

|  $x_1$  |  $x_2$  |    d    |
|---------|---------|---------|
|    0    |    0    |    0    |
|  -0.01  |   0.01  |    0    |
|   0.05  |   0.05  |    0    |
|    0    |    1    |    1    |
|  -0.01  |   1.05  |    1    |
|   0.01  |   0.99  |    1    |
|    1    |    0    |    1    |
|   1.05  |  -0.06  |    1    |
|   1.01  |   0.04  |    1    |
|    1    |    1    |    0    |
|   1.01  |   1.02  |    0    |
|   0.98  |   0.99  |    0    |

**Define a neural network structure and associated parameter values that can solve this classification problem (with zero error on this data set).**

This data set can be seen in the figure below:

<div><img src="problem9_solution.png", width="500"><!div>
    
This data set is a noisy XOR problem. Therefore, we will need one hidden layer MLP to solve this problem. I chose the two linear boundaries depicted in the figure above but other choices are also possible.

The equations for these boundaries are:

$$L_1: x_2 = x_1 + 0.5$$

$$L_2: x_2 = x_1 - 0.5$$
    
The final network architecture with all parameters is as follows:
    
<div><img src="problem9_net.png", width="700"><!div>

# Problem 10 - Regularization in Neural Networks

**Assume you have a network with a single output neuron. Assume you would like to use gradient descent to minimize the objective function and you want to attempt to prevent overtraining through weight decay regularization (L2-norm). Derive the gradient descent update equation for a weight at the output layer neuron by minimizing the following objective function:**

$$J = \frac{1}{2} \sum_{i=1}^N e_i^2 + \lambda R(w)$$

**where $e_i$ is the error given the current network parameters and the desired output, $d_i$, for data point $x_i$, and $R(w)$ is the weight decay term. Keep your result general and applicable for any activation function, $\phi(\bullet)$. Clearly define any notation you use.**

The weight decay regularization term corresponds to the L2-norm penalty.

The cost function with the L2 penalty on the weights at the output layer, $w_j$, is given as follows:

$$J = J_e + \lambda J_r = \frac{1}{2}\sum_{i=1}^N e_i^2 + \lambda \sum_{j=1}^M$$

The gradient descrent update for the weights at the output layer is given as follows:

\begin{align}
w_j^{(t+1)} &= w_j^{(t)} - \eta \frac{\partial J}{\partial w_j} \\
w_j^{(t+1)} &= w_j^{(t)} - \eta \left(\frac{\partial J_e}{\partial w_j} + \lambda\frac{\partial J_r}{\partial w_j} \right)\\
w_j^{(t+1)} &= w_j^{(t)} - \eta \left(-e_i\phi'(v_j)y_j + \lambda 2w_j^{(t)} \right)
\end{align}

# Problem 11 - Unknown Class

**Suppose you trained an MLP using standard backpropagation on a training set that contained two classes. Suppose, you designed your MLP to have a single output neuron and you trained the MLP to map the input data to the class labels -1 or 1. During test, data from a third class was put through your trained network for classification. Is it possible for the MLP you trained to be able to identify and correctly classify data from the third class? Why or Why not? What strategies can you employ during training to guard against this situation?**

No. The third class is not learnt to generalize the network, it will not give a result that discriminates it from class 1 and class 2.

However, we can threshold the output probability to make the network more certain about prediction. In this way, we will be able to consider uncertain samples as an unknown class.

# Problem 12 - Deep Learning

**What is the difference between Deep Learning and Machine Learning? When would you use or the other?**

Machine Learning refers to the process of having a machine to learn patterns in data. Machine Learning is typically accompanied by the design of hand-crafted features.

Deep Learning is a subgroup of Machine Learning. Deep Learning uses deep neural network architectures with self-generating features. Therefore, Deep Learning completely bypasses the need to manual feature extraction, rather it learns the necessary features to discriminante data needing only data.

# Problem 13 - Convolutional Neural Networks

1. **How are convolutional neural networks different from MLPs?**

2. **Why are they a class of deep learning models?**

3. **Discuss advantages and disavantages of training a CNN for data classification.**

1. CNNs are weight-shared networks. The parameter space of a CNN is much smaller than a MLP. In addition, CNNs have convolutional layers that employ convolution operations to extract particular features. An MLP could also learn a convolutional operation but without any weight-sharing restrictions, it will be very unlikely.

2. CNNs are a class of deep learning models as they carry self-feature generation. CNNs are also typical deep architectures.

3. CNN has a clear advantage over MLP simply for the fact that features can be learned without the need of user input. CNN parameter space is greatly reduced which helps prevent overfitting. CNNs need a great amound of data is needed to learn *important* features.

# Problem 14 - Pipeline

**Suppose you are interested in performing classification of a data set with $M$ classes.**

1. **Describe the standard pipeline approach of a Machine Learning algorithm. List all steps and its function.**

2. **Describe the standard pipeline approach of a Deep Learning algorithm. List all steps and its function.**

1. **Step 1**: Data Acquisition. **Step 2**: Feature Generation feature extraction and/or feature selection. includes hand-crafting of features that will be used to discriminate between classes. **Step 3**: Pre-processing. Includes data scaling and data transformation. **Step 4**: Model Selection. Involves selecting a model and its hyper-parameters. **Step 5**: Hyper-parameter tuning. Involves training the model and fine-tuning its hyper-parameters for a given data set. **Step 6**: Model Evaluation.

2. **Step 1**: Data Acquisition. **Step 2**: Pre-processing. Includes data scaling and data transformation. **Step 3**: Model Selection. Involves selecting a model and its hyper-parameters. **Step 4**: Hyper-parameter tuning. Involves training the model and fine-tuning its hyper-parameters for a given data set. **Step 5**: Model Evaluation.

# Problem 15 - Avoiding Overfitting

**What are some strategies that we can use to avoid overfitting while training a neural network. Provide a brief description for all items listed.**

While training a neural network, there are quite a few strategies that can be used to avoid overfitting. They include:

* Adding more data. 
* Cross-validation. Make use of a validation set during training to check the network's generalization ability.
* Regularization. We can add a regularization to penalize weights from being too large.
* Early Stopping. Training can be stopped when a certian number of epochs did not improve the overall error/accurac or the model.
* Batch normalization. At each layer, we can perform normalize the output values before feeding them to the next layer. This will place the output values within a fixed interval.
* Boosting. Assign weights to *difficult* samples.
* Bagging. Train the same model using multiple Bootstrap samples.

# Problem 16 - Accerated Learning & Adaptive Learning

**What are some strategies that we can use to accelerate the learning process in a neural network? Provide a brief description for all items listed.**

The learning process with gradient descent can be accelerate using a momentum term. Typical learning strategies are Stochastic Gradient Descent (SGD) with momentum or SGD with Nestervo's momentum.

We can also adaptively change the learning rate value. In addition to adding a momentum term to speed up learning, the ADAM algorithm also adaptively changes the value of the learning rate as needed.

# Problem 17 - Take your pick

**Suppose a group of friends are going to train a neural network for the first time to perform classification, and they ask you (the ML expert) which network architecture they should use in order to avoid overfitting. Would you recommend them to use a single hidden layer network with a large number of units in the hidden layer, or would you recommend them to use a network with several layers (e.g., ten layers)? Why?**

A deeper network is more prone to overfitting as adding layers create a complex feature representation.

In order to avoid overfitting, a shallow network is preferred.