# **Homework 3 Solutions**

Due:

## **Written Assignment**

### **Gradient Descent**

Consider using gradient descent to find the minimum of $f$, where,
- $f$ is a convex function over the closed interval \[-*b*,*b*\], *b* > 0
- $f'$ is the derivative of $f$
- $\alpha$ is some positive number which will represent a learning rate parameter

The steps of gradient descent are as follows:

- Start at $x_{0} = 0$
- At each step, set $x_{t+1} = x_{t} - \alpha f'(x_{t})$
- If $x_{t+1}$ falls below  -*b*, set it to -*b*, and if it goes above *b*, set it to *b*.

We say that an optimization algorithm (such as gradient descent)
*$\epsilon$-converges* if, at some point, $x_{t}$ stays within  $\epsilon$ of
the true minimum. Formally, we have *$\epsilon$-convergence* at time $t$ if

$\quad \quad |x_{t'}-x_{\min}| \le \epsilon, \quad \text{where } x_{\min}=\underset{x \in [-b,b]}{argmin} f(x)$ for all $t' \geq t$.

### **Question 1**
For $\alpha$ = 0.1, *b* = 1, and $\epsilon$ = 0.001, find a convex function $f$ so that running gradient descent does not  $\epsilon$-converge.
Specifically, make it so that *x*<sub>0</sub> = 0,
$x_1$ = *b*,  $x_2$ =  - *b*,  $x_3$ = *b*,  $x_4$ =  - *b*, etc. 

**Solution:**

It suffices to have a function in which $f(1) = 10$, $f'(0) = -5$, and $f'(-1) =  - 10$.  
Consider $f(x) = 100(x-1/2)^2$.

$f'(x) = 200 (x-1/2) \\\\
    f'(0) = -100 \\\\
    f'(1) = 100 \\\\
    f'(-1) = -300$

The first step will shoot to $x_1$ = 0 - .1(-100) = 10,
which is clipped back to 1.  
The second step will shoot to $x_2$ = 1 - .1(100) =  - 9,
which is clipped back to -1.  
The third step will shoot to $x_3$ =  - 1 - .1(-300) = 29,
which is clipped back to 1.  

### **Question 2**
For *$\alpha$* = 0.1, *b* = 1, and  $\epsilon$ = 0.001, find a convex function $f$
so that gradient descent does *$\epsilon$-converge*, but only after at least
10,000 steps. 

**Solution:** 
 
Consider $f(x) = 0.0001x$. Within the range
\[-*b*,*b*\] = \[-1,1\], this function is minimized at
$x_{min}$ =  - 1. Under gradient descent, each
corresponding $x_{t+1}$ would be 0.1 \* 0.0001 = 0.00001
smaller than $x_{t}$. Since $x_0$ starts at 0,
after 100,000 steps,

$x_{10,000} = 0 - 100,000 \backslash 0.00001 =  - 1 = x_{min}$.

There are many possible solutions, as long as they converge after at
least 10,000 steps.

## **Coding Assignment**

#### Run the evironmant test below, make sure you get all green check, if not, you will lose 2 points for each red flag.

In [5]:
from __future__ import print_function
from packaging.version import parse as Version
from platform import python_version

OK = '\x1b[42m[ OK ]\x1b[0m'
FAIL = "\x1b[41m[FAIL]\x1b[0m"

try:
    import importlib
except ImportError:
    print(FAIL, "Python version 3.10 is required,"
                " but %s is installed." % sys.version)

def import_version(pkg, min_ver, fail_msg=""):
    mod = None
    try:
        mod = importlib.import_module(pkg)
        if pkg in {'PIL'}:
            ver = mod.VERSION
        else:
            ver = mod.__version__
        if Version(ver) == Version(min_ver):
            print(OK, "%s version %s is installed."
                  % (lib, min_ver))
        else:
            print(FAIL, "%s version %s is required, but %s installed."
                  % (lib, min_ver, ver))    
    except ImportError:
        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))
    return mod


# first check the python version
pyversion = Version(python_version())

if pyversion >= Version("3.10.7"):
    print(OK, "Python version is %s" % pyversion)
elif pyversion < Version("3.10.7"):
    print(FAIL, "Python version 3.10.7 is required,"
                " but %s is installed." % pyversion)
else:
    print(FAIL, "Unknown Python version: %s" % pyversion)

    
print()
requirements = {'matplotlib': "3.7.2", 'numpy': "1.24.4",'sklearn': "1.3.0", 
                'pandas': "2.0.3", "pytest": "7.2.1"}

# now the dependencies
for lib, required_version in list(requirements.items()):
    import_version(lib, required_version)

[42m[ OK ][0m Python version is 3.10.7

[42m[ OK ][0m matplotlib version 3.7.2 is installed.
[42m[ OK ][0m numpy version 1.24.4 is installed.
[42m[ OK ][0m sklearn version 1.3.0 is installed.
[42m[ OK ][0m pandas version 2.0.3 is installed.
[42m[ OK ][0m pytest version 7.2.1 is installed.


### Introduction

In this assignment, you will be using a modified version of the UCI
Census Income data set to predict the education levels of individuals
based on certain attributes collected from the 1994 census database. You
can read more about the dataset here:
[`https://archive.ics.uci.edu/ml/datasets/Census+Income`](https://archive.ics.uci.edu/ml/datasets/Census+Income).  

### Stencil Code

We have provided the following stencil code within this file:

-   `Models` contains the `LogisticRegression` model you will be
    implementing.

-   `Check Model` contains a series of tests to ensure you are coding your 
    model properly.

-   `Main` is the entry point of program which will read in the
    dataset, run the model, and print the results.

You should not modify any code in `Check Model` and `Main`. If you do for debugging
or other purposes, please make sure any additions are commented out in
the final handin. All the functions you need to fill in reside in this notebook,
marked by `TODO`s. You can see a full description of them in the section
below.

### **The Assignment**

In `Model`, there are a few functions you will implement. They are:

-   `LogisticRegression`:

    -   **train()** uses stochastic gradient descent to train the
        weights of the model.

    -   **loss()** calculates the log loss of some dataset divided by
        the number of examples.

    -   **predict()** predicts the labels of data points using the
        trained weights. For each data point, you should apply the
        softmax function to it and return the label with the highest
        assigned probability.

    -   **accuracy()** computes the percentage of the correctly
        predicted labels over a dataset.

*Note*: You are not allowed to use any packages that have already
implemented these models (e.g. scikit-learn). We have also included some
code in `main` for you to test out the different random seeds and
calculate the average accuracy of your model across those random seeds.

### **Logistic Regression**

Logistic Regression, despite its name, is used in classification
problems. It learns sigmoid functions of the inputs
$$h_{\bf w}(x)_j = \phi_{sig}(\langle {\bf w}_j, {\bf x} \rangle)$$
where $h_{\bf w}(x)_j$ is the probability that sample
$\bf x$ is a member of class *j*.  

In multi-class classification, we need to apply the `softmax` function
to normalize the probabilities of each class. The loss function of a
Logistic Regression classifier over *k* classes on a *single* example
(*x*,*y*) is the **log-loss**, sometimes called **cross-entropy loss**:
$$\ell(h_{\bf w}, ({\bf x}, y)) = - \sum_{j = 1}^{k}
\left\{\begin{array}{lr}
    \log( h_{\bf w}({\bf x})_j ), & y = j\\
    0, & \text{otherwise} \\
\end{array}\right\}$$
Therefore, the ERM hypothesis of **w** on a dataset of *m* samples has weights
$$
{\bf w} = \underset{\bf w}{argmin} (-\frac{1}{m}\sum_{i = 1}^m \sum_{j = 1}^{k}
\left\{\begin{array}{lr}
    \log( h_{\bf w}({\bf x}_i)_j), & y_{i} = j\\
    0, & \text{otherwise} \\
\end{array}\right\} )
$$
To learn the ERM hypothesis, we need to perform gradient descent. The
partial derivative of the loss function on a single data point
$$
\frac{\partial l_S(h_{\bf w})}{\partial {\bf w}_{st}} =
\left\{\begin{array}{lr}
    h_{\bf w}({\bf x})_s - 1, & y = s\\
    h_{\bf w}({\bf x})_s, & \text{otherwise} \\
    \end{array}\right\}
    {\bf x}_t
$$
With respect to a single row in the weights matrix, ${\bf w}_s$,
the partial derivative of the loss is
$$
\frac{\partial l_S(h_{\bf w})}{\partial {\bf w}_{s}} =
\left\{\begin{array}{lr}
    h_{\bf w}({\bf x})_s - 1, & y = s\\
    h_{\bf w}({\bf x})_s, & \text{otherwise} \\
    \end{array}\right\}
    {\bf x}
$$
You will need to descend this gradient to update the weights of your
Logistic Regression model.

### **Stochastic Gradient Descent**

You will be using Stochastic Gradient Descent (SGD) to train your
`LogisticRegression` model. Below, we have provided pseudocode for SGD
on a sample *S*:

$\text{initialize parameters } {\bf w}\text{, learning rate } \alpha \text{, and batch size b}$  <br />
$\quad \text{converge = False}$ <br />
$\quad \text{while not converge:}$ <br />
$\quad \quad	\text{epoch + 1}$ <br />
$\quad \quad	\text{shuffle training examples}$ <br />
$\quad \quad	\text{calculate last epoch loss}$ <br />
$\quad \quad	\text{for } i = 0,1,...,\lceil{n_{examples}/b}\rceil-1: \text{-- iterate over batches:}$ <br />
$\quad \quad \quad X_{batch} = X[i \cdot b: (i+1) \cdot b] \text{ -- select the X in the current batch}$ <br />
$\quad \quad \quad {\bf y}_{batch} = {\bf y}[i \cdot b: (i+1) \cdot b] \text{ -- select the labels in the current batch}$ <br />
$\quad \quad \quad \text{initialize } \nabla L_{{\bf w}} \text{ to be a matrix of zeros}$ <br />
$\quad \quad \quad \text{for each pair of training data point } ({\bf x},y)\in (X_{batch}, {\bf y}_{batch}):$ <br />
$\quad \quad \quad \quad \text{for }j = 0,1,..., n_{classes} - 1:$ <br />
$\quad \quad \quad \quad \quad \text{-- calculate the partial derivative of the loss with respect to}$ <br />
$\quad \quad \quad \quad \quad \text{-- a single row in the weights matrix}$ <br />
$\quad \quad \quad \quad \quad \text{if }y = j: \nabla L_{{\bf w}_j} \text{ += } 
(softmax(\langle {\bf w}_j, {\bf x} \rangle) - 1) \cdot {\bf x} $ <br />
$\quad \quad \quad \quad \quad \text{else: }\nabla L_{{\bf w}_j} \text{ += } (softmax(\langle {\bf w}_j, {\bf x} \rangle) ) \cdot {\bf x}$ <br />
$\quad \quad \quad {\bf w} = {\bf w} - \frac{\alpha \nabla L_{\bf w}}{len(X_{batch})} \text{ -- update the weights}$ <br />
$\quad \quad \text{calculate this epoch loss}$ <br />
$\quad \quad \text{if |Loss}(X,{\bf y})_{this-epoch}-Loss(X,{\bf y})_{last-epoch}| <  \text{CONV-THRESHOLD: }$ <br />
$\quad \quad \quad \text{converge = True -- break the loop if loss converged}$


 **Hints**: Consistent with the notation in the lecture, ${\bf w}$ are
initialized as a *k* x *d* matrix, where *k* is the number of classes
and *d* is the number of features (with the bias term). With *n* as the
number of examples, *X* is a *n* x *d* matrix, and ${\bf y}$ is a vector
of length *n*.

### **Tuning Parameters**

Convergence is achieved when the change in loss between iterations is
some small value. Usually, this value will be very close to but not
equal to zero, so it is up to you to tune this threshold value to best
optimize your model's performance. Typically, this number will be some
magnitude of 10<sup>-x</sup>, where you experiment with *x*. Note that
when calculating the loss for checking convergence, you should be
calculating the loss for the entire dataset, not for a single batch
(i.e., at the end of every epoch).  
  
You will also be tuning batch size (and one of the report questions
addresses the impact of batch size on model performance). In order to
reach the accuracy threshold, you will need to tune both parameters. *$\alpha$*
would typically be tuned during the training process, but we are fixing
*$\alpha$* = 0.03 for this assignment. **Please do not change *$\alpha$* in your
code**.  
  
You can tune the batch size and convergence threshold in `main`.

### **Extra: Numpy Shortcuts**

While optional, there are many numpy shortcuts and functions that can make your code cleaner. We encourage you to look up numpy documentation and learn new functions.

Some useful shortcuts:
* `A @ B` is a shortcut for `np.matmul(A, B)`
* `X.T` is a shortcut for `np.transpose(X)`
* `X.shape` is a shortcut for `np.shape(X)`

## **Model**

In [6]:
import random
import numpy as np

def softmax(x):
    '''
    Apply softmax to an array
    @params:
        x: the original array
    @return:
        an array with softmax applied elementwise.
    '''
    e = np.exp(x - np.max(x))
    return (e + 1e-6) / (np.sum(e) + 1e-6)

class LogisticRegression:
    '''
    Multiclass Logistic Regression that learns weights using 
    stochastic gradient descent.
    '''
    def __init__(self, n_features, n_classes, batch_size, conv_threshold):
        '''
        Initializes a LogisticRegression classifer.
        @attrs:
            n_features: the number of features in the classification problem
            n_classes: the number of classes in the classification problem
            weights: The weights of the Logistic Regression model
            alpha: The learning rate used in stochastic gradient descent
        '''
        self.n_classes = n_classes
        self.n_features = n_features
        self.weights = np.zeros((n_classes, n_features + 1))  # An extra row added for the bias
        self.alpha = 0.03  # DO NOT TUNE THIS PARAMETER
        self.batch_size = batch_size
        self.conv_threshold = conv_threshold

    def train(self, X, Y):
        '''
        Trains the model using stochastic gradient descent
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: a 1D Numpy array containing the corresponding labels for each example
        @return:
            num_epochs: integer representing the number of epochs taken to reach convergence
        '''
        # TODO
        converge = False
        epoch = 0

        while not converge:
            epoch += 1

            # Shuffle training examples
            c = list(zip(X, Y))
            random.shuffle(c)
            X, Y = zip(*c)
            
            # the previous loss needs to be calculated on X and Y
            previous_loss = self.loss(X, Y)
            
            for i in range(len(X)//self.batch_size):

                X_batch = X[i*self.batch_size:(i+1)*self.batch_size]
                Y_batch = Y[i*self.batch_size:(i+1)*self.batch_size]
                L_gradient = np.zeros(self.weights.shape)

                for (x, y) in zip(X_batch, Y_batch):
                    for j in range(self.n_classes):
                        if y == j:
                            L_gradient[j] += (softmax(self.weights @ x)[j] - 1) * x
                        else:
                            L_gradient[j] += softmax(self.weights @ x)[j] * x
                            
                self.weights = self.weights - (self.alpha*L_gradient)/len(X_batch)
                
            # the current loss need to be calculated on X and Y
            current_loss = self.loss(X, Y)
            
            # Check if converged
            if abs(current_loss - previous_loss) < self.conv_threshold:
                converge = True

        return epoch
    

    def loss(self, X, Y):
        '''
        Returns the total log loss on some dataset (X, Y), divided by the number of examples.
        @params:
            X: 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: 1D Numpy array containing the corresponding labels for each example
        @return:
            A float number which is the average loss of the model on the dataset
        '''
        # TODO
        sum = 0
        for i in range(len(Y)):
            # Calculates probs that x belongs to each class
            prob = softmax(self.weights @ X[i])
            for j in range(self.n_classes):
                if Y[i] == j:
                    sum -= np.log(prob[j])
        return sum/len(Y)


    def predict(self, X):
        '''
        Compute predictions based on the learned weigths and examples X
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
        @return:
            A 1D Numpy array with one element for each row in X containing the predicted class.
        '''
        # TODO
        # Creates Predictions 
        pred = np.zeros(len(X))
        for i in range(len(X)):
            pred[i] = np.argmax(softmax(self.weights @ X[i]))
        return pred


    def accuracy(self, X, Y):
        '''
        Outputs the accuracy of the trained model on a given testing dataset X and labels Y.
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: a 1D Numpy array containing the corresponding labels for each example
        @return:
            a float number indicating accuracy (between 0 and 1)
        '''
        # TODO
        pred = self.predict(X)
        acc = sum(pred == Y)/len(Y)
        return acc

## **Check Model**

In [16]:
import pytest
# Sets random seed for testing purposes
random.seed(0)
np.random.seed(0)

# Creates Test Model with 2 predictors, 2 classes, a Batch Size of 5 and a Threshold of 1e-2
test_model1 = LogisticRegression(2, 2, 5, 1e-2)

# Creates Test Data
x_bias = np.array([[0,4,1], [0,3,1], [5,0,1], [4,1,1], [0,5,1]])
y = np.array([0,0,1,1,0])
x_bias_test = np.array([[0,0,1], [-5,3,1], [9,0,1], [1,0,1], [6,-7,1]])
y_test = np.array([0,0,1,0,1])

# Creates Test Model with 2 predictors, 1 classes, a Batch Size of 1 and a Threshold of 1e-2
test_model2 = LogisticRegression(2, 3, 1, 1e-2)

# Creates Test Data
x_bias2 = np.array([[0,0,1], [0,3,1], [4,0,1], [6,1,1], [0,1,1], [0,4,1]])
y2 = np.array([0,1,2,2,0,1])
x_bias_test2 = np.array([[0,0,1], [-5,3,1], [9,0,1], [1,0,1]])
y_test2 = np.array([0,1,2,0])


# Test Model Loss
# assert test_model1.loss(x_bias, y) == pytest.approx(0.693, .001) # Checks if answer is within .001
# print(test_model1.loss(x_bias, y))
# assert test_model2.loss(x_bias2, y2) == pytest.approx(1.099, .001) # Checks if answer is within .001

# # Test Train Model and Checks Model Weights
# assert test_model1.train(x_bias, y) == 14
# assert test_model1.weights == pytest.approx(np.array([[-0.218, 0.231, 0.0174], [ 0.218, -0.231, -0.0174]]), 0.01) # Answer within .01

# assert test_model2.train(x_bias, y) == 9
# assert test_model2.weights == pytest.approx(np.array([[-0.300,  0.560,  0.093], [ 0.523, -0.257,  0.032], [-0.226, -0.304, -0.123]]), .05) 

# # Test Model Predict
# assert (test_model1.predict(x_bias_test) == np.array([0.,0., 1., 1., 1.])).all()
# assert (test_model2.predict(x_bias_test2) == np.array([0, 0, 1, 1])).all()

# # Test Model Accuracy
# assert test_model1.accuracy(x_bias_test, y_test) == .8
# assert test_model2.accuracy(x_bias_test2, y_test2) == .25


In [17]:
random.seed(0)
np.random.seed(0)
print(test_model1.loss(x_bias, y), pytest.approx(0.693, .001))

0.6931466805603205 0.693 ± 6.9e-04


## **Main**

In [23]:
from sklearn.model_selection import train_test_split

DATA_FILE_NAME = 'normalized_data.csv'
# DATA_FILE_NAME = 'unnormalized_data.csv'
# DATA_FILE_NAME = 'normalized_data_nosens.csv'

CENSUS_FILE_PATH = DATA_FILE_NAME

NUM_CLASSES = 3
BATCH_SIZE = 1  # tune this parameter
CONV_THRESHOLD = 1 # tune this parameter

def import_census(file_path):
    '''
        Helper function to import the census dataset
        @param:
            train_path: path to census train data + labels
            test_path: path to census test data + labels
        @return:
            X_train: training data inputs
            Y_train: training data labels
            X_test: testing data inputs
            Y_test: testing data labels
    '''
    data = np.genfromtxt(file_path, delimiter=',', skip_header=False)
    X = data[:, :-1]
    Y = data[:, -1].astype(int)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
    return X_train, Y_train, X_test, Y_test

def test_logreg():
    X_train, Y_train, X_test, Y_test = import_census(CENSUS_FILE_PATH)
    num_features = X_train.shape[1]

    # Add a bias
    X_train_b = np.append(X_train, np.ones((len(X_train), 1)), axis=1)
    X_test_b = np.append(X_test, np.ones((len(X_test), 1)), axis=1)

    ### Logistic Regression ###
    model = LogisticRegression(num_features, NUM_CLASSES, BATCH_SIZE, CONV_THRESHOLD)
    num_epochs = model.train(X_train_b, Y_train)
    acc = model.accuracy(X_test_b, Y_test) * 100
    print("Test Accuracy: {:.1f}%".format(acc))
    print("Number of Epochs: " + str(num_epochs))


# Set random seeds. DO NOT CHANGE THIS IN YOUR FINAL SUBMISSION.
random.seed(0)
np.random.seed(0)
test_logreg()

Test Accuracy: 72.4%
Number of Epochs: 1


## **Report Questions**

### **Question 1**

Make sure that you have implemented a variable batch size using the
constructor given for `LogisticRegression`. Try different batch
sizes ([1, 8, 64, 512, 4096] - there are ~5700 points in the dataset), and try different convergence thresholds ([1e-2, 1e-3, 1e-4]) in the cell below. Visualize the accuracy and number of epochs taken to converge.

Answer the following questions:
-   What tradeoffs exist between good accuracy and quick
    convergence?
-    Why do you think the batch size led to the results you received?

### **Question 1: Visualization**

Fill in the `generate_array()` and `generate_heatmap()` functions so you can visualize how accuracy and number of epochs taken changes as we change batch size and convergence threshold. Fill out BATCH_SIZE_ARR and CONV_THRESHOLD_ARR with different values (at least 3 of each).

-   **generate_array()** should loop through both BATCH_SIZE_ARR and CONV_THRESHOLD_ARR to populate `epoch_arr` and `acc_arr`. Make sure to round `acc_arr` to 2 decimal places before returning (Hint: `np.round`).
        
-   **generate_heatmap()** should create a matplotlib heatmap of the arrays. You should label the axis and title of each plot using BATCH_SIZE_ARR and CONV_THRESHOLD_ARR. It might be helpful to look at Matplotlib's guide for heatmaps: https://matplotlib.org/stable/gallery/images_contours_and_fields/image_annotated_heatmap.html

**Hint:** Runs with large batch sizes and low convergence thresholds might take several minutes to complete. We recommend that you develop the code below with a small subset of the parameters (e.g., batch size of [1,2,4] and conv_threshold of [1e-2, 1e-3]). Once your code works and your figures look good, rerun everything with the batch size and conv_threshold values described in Question 1 above.

In [5]:
import matplotlib.pyplot as plt
import time
random.seed(0)
np.random.seed(0)

BATCH_SIZE_ARR = [1, 8, 64, 512, 4096]
CONV_THRESHOLD_ARR = [1e-2, 1e-3, 1e-4]

def generate_array():
    '''
        Runs the logistic regression model on different batch sizes and
        convergence thresholds to populate arrays for accuracy and number of epochs taken.
        @return:
            epoch_arr: 2D array of epochs taken, for each batch size and conv threshold
            acc_arr: 2D array of accuracies, for each batch size and conv threshold
    '''
    X_train, Y_train, X_test, Y_test = import_census(CENSUS_FILE_PATH)
    num_features = X_train.shape[1]

    # Add a bias
    X_train_b = np.append(X_train, np.ones((len(X_train), 1)), axis=1)
    X_test_b = np.append(X_test, np.ones((len(X_test), 1)), axis=1)

    # Initializes the accuracy and epoch arrays
    acc_arr = np.zeros((len(BATCH_SIZE_ARR), len(CONV_THRESHOLD_ARR)))
    epoch_arr = np.zeros((len(BATCH_SIZE_ARR), len(CONV_THRESHOLD_ARR)))

    ### EXTRA (ONLY FOR TA'S): DON'T GRADE ON THIS
    time_arr = np.zeros((len(BATCH_SIZE_ARR), len(CONV_THRESHOLD_ARR)))

    ### Populate arrays ###
    # [TODO]
    for i in range(len(BATCH_SIZE_ARR)):
        for j in range(len(CONV_THRESHOLD_ARR)):
            model = LogisticRegression(num_features, NUM_CLASSES, BATCH_SIZE_ARR[i], CONV_THRESHOLD_ARR[j])

            start = time.time()
            num_epochs = model.train(X_train_b, Y_train)
            end = time.time()
            acc = model.accuracy(X_test_b, Y_test) * 100

            epoch_arr[i][j] = num_epochs
            acc_arr[i][j] = acc
            time_arr[i][j] = end - start
    return epoch_arr, np.round(acc_arr, 2), np.round(time_arr, 2)


def generate_heatmap(arr, name):
    '''
        Generates a matplotlib heatmap for an array
        convergence thresholds to populate arrays for accuracy and number of epochs taken.
        @param:
            arr: 2D array to generate heatmap of
            name: title of the plot (Hint: use plt.title)
        @return:
            None
    '''
    # [TODO]
    fig, ax = plt.subplots()
    im = ax.imshow(arr)

    # Show ticks and label them with the respective list entries
    ax.set_xticks(np.arange(len(CONV_THRESHOLD_ARR)), labels=CONV_THRESHOLD_ARR)
    ax.set_yticks(np.arange(len(BATCH_SIZE_ARR)), labels=BATCH_SIZE_ARR)

    # Loop over data dimensions and create text annotations
    for i in range(len(BATCH_SIZE_ARR)):
        for j in range(len(CONV_THRESHOLD_ARR)):
            text = ax.text(j, i, arr[i, j], ha="center", va="center", color="w")

    fig.tight_layout()
    plt.xlabel("CONV THRESHOLDS")
    plt.ylabel("BATCH SIZES")
    plt.title(name)
    plt.show()

# Students do not need to have time array
epoch_arr, acc_arr, time_arr = generate_array()
generate_heatmap(epoch_arr, "Epochs")
generate_heatmap(acc_arr, "Accuracy")

### EXTRA (ONLY FOR TA'S): DON'T GRADE ON THIS
generate_heatmap(time_arr, "Time")

KeyboardInterrupt: 

**Solution:**

- For a lower convergence threshold, SGD will take
longer to converge and SGD will better approximate the ideal
weights that correctly classify the training set. This leads to
higher accuracy initially as you decrease the criteria, but
could also result in overfitting. For a higher convergence
threshold, SGD will converge faster, but could result in a lower
accuracy (since training loss will remain higher). This is why
it is important to experiment with different convergence
thresholds and batch sizes to find the right balance.

- Students should notice that as they increase the batch size, the
    number of epochs it takes to cause their loss to converge
    increases. 

### **Question 2**
Try to run the model with `unnormalized_data.csv` instead of
`normalized_data.csv`. Report your findings when running the model
on the unnormalized data. In a few short sentences, explain what
normalizing the data does and why it affected your model's
performance.

**Solution:**  

Note to Grader: The solution code works extremely poorly with
unnormalized data; it does not reach close to the minimum accuracy.

There is a gap in accuracy when training on normalized vs
unnormalized data. This gap is huge because the optimizer cannot
tune the weights for all features at the same time very well, since
there's one step size (i.e., alpha in the pseudocode, also referred
to as learning rate) for all weights. In other words, if feature A
ranges from 0 to 1000 and feature B ranges from 0 to 1, then the
weights corresponding to feature A should in theory have a larger
degree of change than the weights corresponding to feature B because
of the sheer range of feature A's values compared to feature B's in
order to have a satisfying accuracy.

### **Question 3**
Try the model with `normalized_data_nosens.csv`; in this data file,
we have removed the `race` and `sex` attributes. Report your
findings on the accuracy of your model on this dataset (averaging
over many random seeds here may be useful). Can we make any
conclusion based on these accuracy results about whether there is a
correlation between sex/race and education level? Why or why not?

**Solution:** 

We expect the accuracy to stay approximately the same.
However, we can't make a claim that there is no correlation between
sex/race and education level. We expect answers like: *accuracy* is
distinct from *correlation*; or there may be other attributes that
serve as proxy variables for race and gender.