PROBLEM 1

a)

I noticed that the key frame annotations are not always accurately labeled, for example
some have the left/right labels backwards for what the keyframes show, some keyframes don't
show movements indicative of what is labele, some keyframes look different from others (different
subject or completely different outfits/angles), some keyframes no in portrait mode, etc. 

I observed that the keypoint annotations are sometimes mislabeled, as for some frames they
are not even on the body or not even in the picture frame, some are missing many keypoints, 
some had left/right keypoints backwards, etc.

b)

There is a wide variety in the lighting, angles of video capture, distance of subject 
from camera, position of the subject in the frame, clarity of the capture, etc. for each 
video. It seems it would be very difficult to analyze the data if the actual key frame
images were used, and using keypoint annotations there would be some variability in 
the locations of the points due to the different angles people recorded at and the various
poses of people that did the exercises.

c) 

I noticed that the keypoint annotations are not that accurate, as some keypoints are 
not on the subject, some are on the wrong side, and some are in the wrong locations. Also, 
the bounding boxes often don't enclose the entire subject and sometimes more than one 
subject is detected even if there is only one.

The sampling rate seems like it did adequately capture the movement, as for the squat example
you can see frames for which the subject raised their arms, began to squat, went into a full
squat, and then came back up.

d) I saw that compared to the normalized aligned Neck keypoints, the raw keypoints had 
much more variability and outliers. The raw graph had keypoints scattered across the graph
with a very large area of points as the main cluster, but the normalized aligned graph
had almost all the points centered in a cluster of small area around 0 with only a few
outliers.

\pagebreak

PROBLEM 2

2.1.1

1. Code:

def affine_forward(x, w, b):
    
    out = None
    
    N = x.shape[0]
    D = np.prod(x.shape[1:])
    xM = x.reshape(N, D)
    out = np.dot(xM, w) + b
 
    cache = (x, w, b)
    return out, cache
    
def affine_backward(dout, cache):

    x, w, b = cache
    dx, dw, db = None, None, None
    
    N = x.shape[0]
    D = np.prod(x.shape[1:])
    xM = x.reshape(N, D)
    dx = np.dot(dout, w.T).reshape(x.shape)
    dw = np.dot(xM.T, dout)
    db = np.sum(dout, axis=0)

    return dx, dw, db
    
2. 
From lecture, the partial derivative for component i in the backward pass 
is equal to $$\sum_{j=1}^{k} \frac{dL}{dz} \Delta_i z_j $$
 
For dx: 
the derivative with respect to x of xw + b is w, so the partial
derivative for the backward pass is equal to the dot product of
the upstream derivative, dout, with w transpose. Then dx should 
be properly reshaped to be the same shape as x.

For dw:
the derivative with respect to w of xw + b is x, so the partial
derivative for the backward pass is equal to the dot product of 
x transpose with the upstream derivative, dout. x is first 
reshaped to be dimension (N, D) where D is d1 * ... * dk to 
be able to perform the multiplication. 

For db:
the derivative with respect to b of xw + b is 1, so the partial
derivative for the backward pass is equal to the dot product of
the upstream derivative, dout, with 1, which is equal to the sum
over dout.


3. Output of numerical gradient checking:

grad:

[[-1.04345093  0.20915809 -0.47062589 -0.58676516]

 [-1.58678198  0.45455414  0.13067486 -0.30436602]
 
 [-0.95020206  0.30773446  0.29861707 -0.02918212]]
 
ngrad:

[[-1.04345093  0.20915809 -0.47062589 -0.58676516]

 [-1.58678198  0.45455414  0.13067486 -0.30436602]
 
 [-0.95020206  0.30773446  0.29861707 -0.02918212]]
 
4. No inline questions

\pagebreak

2.1.2

1. Code:

def relu_forward(x):

    out = None
   
    out = x * (x>=0)
    
    cache = x
    return out, cache
    
def relu_backward(dout, cache):

    dx, x = None, cache
    
    dx = dout * (x>=0)
    
    return dx

2. 
The ReLU forward function replaces x values where x < 0 with 0. The ReLU backward function does the same thing but replaces the derivative values from dout with 0 where x < 0.


3. Output of numerical gradient checking:

grad:

[[ 0.33349295  1.23562329 -0.12733505]

 [ 0.16847097  0.         -0.        ]]
 
 ngrad:
 
[[ 0.33349295  1.23562329 -0.12733505]

 [ 0.16847097  0.          0.        ]]

4. The Sigmoid function has the vanishing gradient problem as well as the tanh function. 
This happens when the inputs of most training points cause the gradient to become almost 
zero, caused when sigmoid output is close to 0 or 1 and when tanh output is close to -1 
or 1. For these values, the sigmoid and tanh functions are at near flat spots graphically 
(near 0 gradient).

\pagebreak

2.1.3

1. Code:

def softmax_loss(x, y):

    loss = 0.0
    dx = None
    
    m = np.max(x, axis=1, keepdims=True)
    N = x.shape[0]
    val = x - m
    exp_val = np.exp(val)
    exp_sum = np.sum(exp_val, axis=1, keepdims=True)
    log_exp_sum = np.log(exp_sum)
    loss = np.mean((-val + log_exp_sum)[np.arange(N), y])
    dx = exp_val/exp_sum
    dx[np.arange(N), y] -= 1
    dx /= N
    
    return loss, dx

2. From lecture, the loss function for cross entropy loss is 
$-\sum_{i=1}^{k} y_i ln(z_i)$ where z is softmax activation output, and 
the gradient of that is $\frac{-y_j}{z_j}$. So, dx is 
$$\frac{-y_j}{z_j} \frac{dz_j}{dx}$$
Also from lecture, the gradient of the softmax output is $z_j(1-z_j)$ if $j=i$, and $-z_j z_i$ if $j\neq i$
So, $$dx = -y_j (1-z_j) - \sum_{k\neq j} y_k (-z_k z_i)$$
$$= z_j (y_j + \sum_{k\neq 1} y_k) - y_j$$
which equals $z_j - y_j$ because $y_j + \sum_{k\neq 1} y_k = 1$. 

So, in the code for dx we subtract 1 from the softmax activation outputs where $y=1$.

In the code, to get a singular value I took the mean of the losses and divided by N.

3. Output of numerical gradient checking:

grad:

[[ 0.18840093 -0.317676    0.12927507]

 [ 0.31719863 -0.40597951  0.08878088]]
 
ngrad:

[[ 0.18840093 -0.317676    0.12927507]

 [ 0.31719863 -0.40597951  0.08878088]]
 
4. No inline questions.


\pagebreak

2.2

Best parameter combination: 

hidden_dim: 5000, 'lr_decay': .9, 'num_epochs': 10,
'batch_size': 50, 'learning_rate': .0001, weight_scale: .01

Best validation accuracy: 
0.930833




hidden_dim values tried: 10, 100, 1000, 5000, 4000

'lr_decay' values tried: .8, .7, .9

'num_epochs' values tried: 5, 15, 10

'batch_size' values tried: 10, 100, 50

'learning_rate' values tried: .01, .001, .0001

weight_scale: .1, .01, .2, .001

In my exploration, I focused on one parameter at a time, increasing with
larger and smaller values until I found one that maximized the validation
accuracy. I started with the hidden_dim values, increasing until I saw 
a noticeable increase in accuracy at 1000, then experimented with other
values one at a time. After I thought I found the best hyperparameter 
values I increased the dimensions again until my validation accuracy
stopped increasing. I did not do any transformations to the data before
testing.

Output of best parameter training:

(Iteration 1 / 720) loss: 49.434835

(Epoch 0 / 10) train acc: 0.109000; val_acc: 0.125000

(Epoch 1 / 10) train acc: 0.846000; val_acc: 0.821667

(Iteration 101 / 720) loss: 0.865761

(Epoch 2 / 10) train acc: 0.957000; val_acc: 0.910833

(Iteration 201 / 720) loss: 0.221717

(Epoch 3 / 10) train acc: 0.962000; val_acc: 0.924167

(Epoch 4 / 10) train acc: 0.967000; val_acc: 0.923333

(Iteration 301 / 720) loss: 0.008394

(Epoch 5 / 10) train acc: 0.981000; val_acc: 0.924167

(Iteration 401 / 720) loss: 0.001829

(Epoch 6 / 10) train acc: 0.978000; val_acc: 0.930000

(Iteration 501 / 720) loss: 0.024000

(Epoch 7 / 10) train acc: 0.990000; val_acc: 0.938333

(Epoch 8 / 10) train acc: 0.990000; val_acc: 0.925833

(Iteration 601 / 720) loss: 0.022938

(Epoch 9 / 10) train acc: 0.992000; val_acc: 0.930833

(Iteration 701 / 720) loss: 0.154167

(Epoch 10 / 10) train acc: 0.994000; val_acc: 0.930833

\pagebreak

2.3

Hidden layer combinations tried:

One layer: 4000, because 1 layer was experimented with already in 2.2

Two layers: [1000, 500], [1000, 1000], [2000, 1000], [3000, 3000], [2000, 4000]

Three layers: [1000, 1000, 1000], [2000, 2000, 2000], [3000, 1000, 3000], [1000, 3000, 1000]

Four layers: [500, 1000, 750, 2000], [3000, 1000, 2000, 3000], [1000, 2000, 3000, 4000]

Best result: Two hidden layers, [2000, 4000]

Validation Accuracy: 0.934167

Hyperameters kept the same as in 2.2

I tried different combinations for each network, experimenting with different variations
of lower/higher dims for each layer, and trying not to use too many dims for 3 or 4 layers
because it made it made computation slow on my computer. Accuracies ranged from 75%-92% 
for all but the best result, with networks that used less that 1000 nodes doing the worst.
The best result came from a 2 layer network for me, with other hyperparameters the same as 
in 2.2. Again no transformations were done to the data.

Training and Validation Accuracies for [2000, 4000] netwrk:

(Iteration 1 / 720) loss: 8.675402

(Epoch 0 / 10) train acc: 0.120000; val_acc: 0.125000

(Epoch 1 / 10) train acc: 0.792000; val_acc: 0.787500

(Iteration 101 / 720) loss: 0.217967

(Epoch 2 / 10) train acc: 0.919000; val_acc: 0.901667

(Iteration 201 / 720) loss: 0.098382

(Epoch 3 / 10) train acc: 0.901000; val_acc: 0.895833

(Epoch 4 / 10) train acc: 0.946000; val_acc: 0.919167

(Iteration 301 / 720) loss: 0.267109

(Epoch 5 / 10) train acc: 0.964000; val_acc: 0.926667

(Iteration 401 / 720) loss: 0.067065

(Epoch 6 / 10) train acc: 0.983000; val_acc: 0.925833

(Iteration 501 / 720) loss: 0.126912

(Epoch 7 / 10) train acc: 0.971000; val_acc: 0.930000

(Epoch 8 / 10) train acc: 0.973000; val_acc: 0.933333

(Iteration 601 / 720) loss: 0.069198

(Epoch 9 / 10) train acc: 0.975000; val_acc: 0.931667

(Iteration 701 / 720) loss: 0.212222

(Epoch 10 / 10) train acc: 0.980000; val_acc: 0.934167

\pagebreak

PROBLEM 4

a) For my first model attempt, I started with 

parameters: batch_size=128, learning_rate=.001, num_epochs=5 padding=2

architecture: 

NeuralNet(

  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  
  (fc1): Linear(in_features=95040, out_features=120, bias=True)
  
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  
  (fc3): Linear(in_features=84, out_features=8, bias=True))
  
validation accuracy: 52.0 %

I decided to then try increasing the number of convolution layers, with new layers having 
smaller kernel size, and increased the output layers as much as my gpu would allow without 
crashing. Increasing output layers required me to lower my batch size also for my code to run.
I also increased the number of epochs to make sure there was still room to learn before 
overfitting. In implementing my convolutional layers I drew inspiration from AlexNet when
I was choosing dimensions, as piazza mentioned that nets similar to that would be unlikely
to perform poorly, but used a stride of 1 for all my convolution layers and never used a kernel
size larger than 5. I also added regularization because I thought my low validation 
accuracies were a product of overfitting too quickly.

architecture:

NeuralNet(

  (conv1): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1))
  
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  
  (conv2): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1))
  
  (conv3): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1))
  
  (conv4): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1))
  
  (fc1): Linear(in_features=79872, out_features=200, bias=True)
  
  (fc2): Linear(in_features=200, out_features=100, bias=True)
  
  (fc3): Linear(in_features=100, out_features=8, bias=True))
  
validation accuracy: 62.56410256410256 %

After restarting my computer Google Colab suddenly allowed me to use much larger numbers of layers
without crashing, so I added a convolutional layer and increased the output layer sizes to 4000 to 
be similar in size to my multilayered net. I also decreased my learning rate by a factor of 10, 
calculated the means and standard deviations for each color channel, removed regularization 
because it was now preventing my model from learning as quick as I wanted, and reduced the epochs 
back to 5 because I thought the model was overfitting.

architecture:

NeuralNet(

  (conv1): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1))
  
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  
  (conv2): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1))
  
  (conv3): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1))
  
  (conv4): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1))
  
  (conv5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1))
  
  (fc1): Linear(in_features=15360, out_features=4000, bias=True)
  
  (fc2): Linear(in_features=4000, out_features=4000, bias=True)
  
  (fc3): Linear(in_features=4000, out_features=8, bias=True))
  
validation accuracy: 77.43589743589743 %

For my final model, I decided to run the previous one for 20 epochs even though training
loss reached near zero because validation accuracy continued to increase, although very slowly. 

validation accuracy: 78.05128205128206 %

\pagebreak

c) As you can see from the table, you would need more than 2 million additional units 
in order to implement an FCC network comparable to the convolutional layers of the CNN.

\pagebreak

d) I ran the model on random frames for one epoch because training was so slow that 
just one epoch plus calculating validation loss took almost 4 hours. The training 
loss was .2213 and validation accuracy was 66.10555847799404 %. 

reach: 0.340956340956341

squat: 0.4829749103942652

inline: 0.5451327433628319

lunge: 0.8029993183367417

hamstrings: 0.787920384351407

stretch: 0.8116683725690891

deadbug: 0.6883468834688347

pushup: 0.7184873949579832

\pagebreak

e)

In [6]:
import pandas as pd

label_map = ['reach','squat','inline','lunge','hamstrings','stretch',
             'deadbug','pushup', 'overall %']
random_frame = [0.340956340956341, 0.4829749103942652, 
                0.5451327433628319, 0.8029993183367417, 
                0.787920384351407, 0.8116683725690891,
                0.6883468834688347, 0.7184873949579832,
                66.10555847799404]
key_frame = [0.72, 0.8133333333333334,
             0.8, 0.7733333333333333,
             0.7533333333333333, 0.8333333333333334,
             0.7333333333333333, 0.8266666666666667,
             78.05128205128206]

table = pd.DataFrame(index = label_map)
table['key frame'] = key_frame
table['random frame'] = random_frame
table

Unnamed: 0,key frame,random frame
reach,0.72,0.340956
squat,0.813333,0.482975
inline,0.8,0.545133
lunge,0.773333,0.802999
hamstrings,0.753333,0.78792
stretch,0.833333,0.811668
deadbug,0.733333,0.688347
pushup,0.826667,0.718487
overall %,78.051282,66.105558


\pagebreak

f) I observed that my better model was the key frame one because  was able to 
train it more because it did not take so many hours to perform. I chose to 
explore the errors for my model for key frames. I also noticed that reach was 
sometimes classified as inline while hamstring was sometimes classified as deadbug.
most pictures that were misclassified were ones in which images were a little 
blurrier, cut off, or the subject was in a position that was somewhat
different from the normal keyframe position. 

\pagebreak

g) My kaggle username is ayee and my best submission was:

0.69979

I think my model would have been able to do better but Google Colab
went through my data so slowly for random frames that training was so 
slow I was not able to experiment with parameters much, and was only 
able to get it to train for four epochs.