In [1]:
import numpy as np
import math
from hw6_nlt import LinRegNLT2
from hw6_dataload import LFD_Data

# HW 6
## Overfitting and Deterministic Noise
Given hypothesis set &Hscr; of target function f and &Hscr;' &subset; &Hscr;, using &Hscr;' in general will lead to **increased deterministic noise** since we will have less hypotheses available at our disposal to deal with a higher-order determinstic function (and moreso, deterministic noise can only increase by using less than the available number of hypotheses).

## Regularization with Weight Decay

**Given :** two-dim input x = (x1, x2) so that &Xscr; = &reals;<sup>2</sup> with label &Yscr; = {-1,1}. 

**Want :** Linear Regression with a non-linear transformation for classification given by:

&phi;(x1, x2) = (1, x1, x2, x1<sup>2</sup>, x2<sup>2</sup>, x1x2, |x1-x2|, |x1+x2|)


Classification error is defined as the fraction of misclassified points.

In [2]:
rwd_train = "hw6_train.dta"
rwd_test = "hw6_test.dta"
l_reg = math.pow(10.0, -3) #lambda regularization term

# load data from external files and init
rwd_data = LFD_Data(rwd_train, rwd_test)
rwd_algo = LinRegNLT2(rwd_data.dim, 7, l_reg)

def rwd_print_error(algo,data):
    ein = algo.calc_error(data.train_X, data.train_Y)
    eout = algo.calc_error(data.test_X, data.test_Y)
    print("E_in: %f, E_out: %f" % (ein, eout))
    

#train without regularization
rwd_algo.train(rwd_data.train_X, rwd_data.train_Y)
print("Linear Regression without Weight Decay:")
rwd_print_error(rwd_algo, rwd_data)


rwd_k = np.arange(-3, 4)

for k in rwd_k:
    cur_lam = math.pow(10.0, k)
    rwd_algo.set_lambda(cur_lam)
    rwd_algo.train_reg(rwd_data.train_X, rwd_data.train_Y)
    print("Linear Regression with Weight Decay (k = %d):" % k)
    rwd_print_error(rwd_algo, rwd_data)

Linear Regression without Weight Decay:
E_in: 0.028571, E_out: 0.084000
Linear Regression with Weight Decay (k = -3):
E_in: 0.028571, E_out: 0.080000
Linear Regression with Weight Decay (k = -2):
E_in: 0.028571, E_out: 0.084000
Linear Regression with Weight Decay (k = -1):
E_in: 0.028571, E_out: 0.056000
Linear Regression with Weight Decay (k = 0):
E_in: 0.000000, E_out: 0.092000
Linear Regression with Weight Decay (k = 1):
E_in: 0.057143, E_out: 0.124000
Linear Regression with Weight Decay (k = 2):
E_in: 0.200000, E_out: 0.228000
Linear Regression with Weight Decay (k = 3):
E_in: 0.371429, E_out: 0.436000


## Regularization for Polynomials

**Given :** Transform from a linear model to a space &Zscr; given by &Phi;: &Xscr; &rarr; &Zscr; where z &isin; &Zscr; is a vector of Legendre polynomials (1, L<sub>1</sub>(x), L<sub>2</sub>(x),...,L<sub>Q</sub>(x)) and the hypthosis set &Hscr;<sub>Q</sub> is given by:

&Hscr;<sub>Q</sub> = { h | h(x) = w<sup>T</sup>z= = &sum;(q=0;Q){&wscr;<sub>q</sub>L<sub>q</sub>(x)}}, where L<sub>0</sub>(x) = 1



Given the constrained hypothesis set:

&Hscr;(Q,C,Q<sub>o</sub>) = { h | h(x) = w<sup>T</sup>z &isin; &Hscr;<sub>Q</sub>; &wscr;<sub>q</sub> = C for q &ge; Q<sub>o</sub>}


we see that if C = 0, it doesn't have an polynomials of degree &ge; Q<sub>o</sub> and thus the largest degree polynomial in this polynomial set is of degree Q<sub>o</sub> - 1. Thus:


&Hscr;(10,0,3) &Intersection; &Hscr;(10,0,4) = &Hscr;<sub>2</sub>

## Neural Networks

**Given :** fully connected neural network with L = 2; d<sup>(0)</sup> = 5, d<sup>(1)</sup> = 3, d<sup>(2)</sup> = 1 only counting products of the form w<sub>ij</sub><sup>(l)</sup>x<sub>i</sub><sup>(l-1)</sup>, w<sub>ij</sub><sup>(l)</sup>&delta;<sub>j</sub><sup>(l)</sup>, and x<sub>i</sub><sup>(l-1)</sup>&delta;<sub>j</sub><sup>(l)</sup> as operations. The &delta;<sub>j</sub><sup>(l)</sup> are partial derivatives &PartialD;e(w)/&PartialD;s<sub>j</sub><sup>(l)</sup> where e(w) is the error function.


Furthermore, s<sub>j</sub>'s are given as so:

s<sub>j</sub><sup>(l)</sup> = &sum;(i=0;d<sup>(l-1)</sup>){w<sub>ij</sub><sup>(l)</sup>x<sub>i</sub><sup>(l-1)</sup>}. (1)



&delta;<sub>j</sub><sup>(l)</sup> other than l = L can be calculated like so:

&delta;<sub>i</sub><sup>(l-1)</sup> = (1-(x<sub>i</sub><sup>(l-1)</sup>)<sup>2</sup>)&sum;(j=1;d<sup>(l)</sup>){w<sub>ij</sub><sup>(l)</sup>&delta;<sub>j</sub><sup>(l)</sup>} (2)


After backpropagation, weights are updated with the following equation:


w<sub>ij</sub><sup>(l)</sup> = w<sub>ij</sub><sup>(l)</sup> - &eta;x<sub>i</sub><sup>(l-1)</sup>&delta;<sub>j</sub><sup>(l)</sup> (3)


### Calculating operations needed for backpropagation

- Calculating the xi's requires the use of ** w<sub>i</sub><sup>(l)</sup>x<sub>i</sub><sup>(l-1)</sup>** terms since x<sub>i</sub> = &theta;(s<sub>i</sub>) and equation (1) for the si's are given above. The input layer has 6 nodes x0 = 1, x1, x2, x3, x4, x5 going into 3 nodes in layer 1, making 18 operations. Layer 1 has 4 nodes (including the additional i = 0 constant node) going into 1 node in layer 2, making 4 operations and 18 + 4 = **22** operations to calculate all the xi's.
- For backpropagation proper, we are not considering the operations necessary to calculate &delta;<sub>1</sub><sup>L</sup> (not listed above). To calculate the &delta;s for l < 2, we consider equation (2) given above with the **w<sub>ij</sub><sup>(l)</sup>&delta;<sub>j</sub><sup>(l)</sup>**). For each l, x<sub>0</sub> = 1 and thus equation (2)'s coefficient will be 0 and thus &delta;<sub>0</sub>'s will not factor into our operation count. Thus for the &delta;<sup>(1)</sup>'s, we have 3 operations, 1 for each &delta; since there is only one term in each right-hand sum w<sub>ij</sub>(2)&delta;<sub>j</sub>(2) since d<sup>2</sup> =  1. Since there's no s<sub>i</sub> signals for the input layer l = 0, we do not calculate any &delta;<sub>i</sub>'s for them and thus l = 0 contributes 0 operations. Thus for backpropogation we have 3 + 0 = **3** operations.
- Updating the weights requires use of **x<sub>i</sub><sup>(l-1)</sup>&delta;<sub>j</sub><sup>(l)</sup>** products. Everywhere we used a weight, we need to apply equation (3) and each application involves one product term each. As we found in the feedforward calculation of the xi's, there are 22 weights and thus **22** operations, 6 &times; 3 = 18 going from l=0 to l=1, 4 &times; 1 = 4 going from l=1 to l=2. 
- Thus overall, there are 22 + 3 + 22 = **47** operations used.


**Given :** Neural network with 10 input units (counting x0s), 36 hidden units (counting x0s), and 1 output unit, with the hidden units arrangeable in any number of layers l = 1,...,L-1 and each layer fully connected to the layer above it.

- The **minumum** possible number of weights this network could have would be if each node in the hidden layers had its own layer. Then, we could have 10 weights for the input layer feeding into the first hidden layer and 36 weights, 1 for each hidden layer giving a total of **46** weights.
- Transitioning from one layer to the next requires &rho; = d<sup>(l-1)</sup> &times; (d<sup>(l)</sup> - 1) weights (since we don't feed into the x0 terms) and thus we want to somehow maximize this quantity. This occurs when the d<sup>(l-1)</sup> = d<sup>(l)</sup> - 1 since the product should be the closest to forming a square as possible.  

In [26]:
hidden_layers = [22,14]

def number_of_weights(d_list):
    prev_d = 10 #number of nodes in input list
    num_w = 0 #number of weights
    for d in d_list:
        num_w = num_w + ((d- 1)  * prev_d)
        prev_d = d
    return num_w + d_list[-1]

print("With hidden layer dimensions:")
print(hidden_layers)
print("We require %d weights." % number_of_weights(hidden_layers))


With hidden layer dimensions:
[22, 14]
We require 510 weights.


Using the above reasoning as a jumping off point (and some fiddling around with values), the **maximum** number of weights needed by the given neural network is **510**.