In [1]:
from IPython.display import IFrame

In [3]:
IFrame("hw5.pdf", width=1000, height=1000)

## Problem 1

__[c]__

For $\sigma = 0.1$ and $d=8$, we want to solve the following inequality for $N$: $$ \sigma^2 \left( 1 - \dfrac{d+1}{N} \right) \geq 0.008 $$

Plugging things in we get $$ (0.1)^2 \left( 1 - \dfrac{9}{N} \right) \geq 0.008,$$
where dividing out the $\sigma$ value gives $$ \left( 1 - \dfrac{9}{N} \right) \geq 0.8, $$
and by moving terms around more we have $$ 0.2 \geq \dfrac{9}{N}. $$

Finally, multiplying both sides by $N$ and dividing through by $0.2$ gives us the solution $$ N \geq 45. $$

Among the answer choices, the smallest $N$ that gives us the desired expected value for $E_{in}$ is $N = 100$.

## Problem 2

__[d]__

Let's recall the basic equation for a hyperbola centered at $(0,0)$ and see if that helps (in terms of $x_1$ and $x_2$): $$ \dfrac{x_1^2}{a} - \dfrac{x_2^2}{b} = 1, $$ where the distance between the vertices is $2a$ and distance between foci is $2c$ where $c^2 = a^2+b^2$, _not that we actually care about any of that information_.

Note that the decision boundary assigns a positive value $+1$ to a training example $(1,x_1,x_2)$ if it falls __in between the hyperbola boundaries__. 

Let's simplify things further and get rid of the annoying $a$ and $b$ to yield the equation $$ x_1^2 - x_2^2 = 1, $$ which tells us we want something like $$ x_1^2 - x_2^2 < 1. $$

In the $\mathcal{Z}$-space, we have $ z_1 = x_1^2 $ and $ z_2 = x_2^2 $. Our linear hypothesis looks like $$ \tilde{w}^T\boldsymbol{z} = \tilde{w}_0z_0 + \tilde{w}_1z_1 + \tilde{w}_2z_2 = \tilde{w}_0 + \tilde{w}_1x_1^2 + \tilde{w}_2x_2^2. $$

We want $ \tilde{w}_1 < 0 $ and $ \tilde{w}_2 > 0 $.

## Problem 3

__[c]__

The transformed space has $15$ dimensions.

## Problem 4

__[e]__

We want the partial derivative $ \frac{\partial E}{\partial u} $ of $$ E(u,v) = (ue^v - 2ve^{-u})^2 $$

Using the chain rule,we get $$ \dfrac{\partial E}{\partial u} = 2(ue^v - 2ve^{-u})(e^v + 2ve^{-u}), $$ noting that the $u$ does not come down from the exponential because we differentiated with respect to $u$ (just as $ \frac{d}{du}e^u = e^u $).

## Problem 5

__[d]__

We also want the partial derivative of $E$ with respect to $v$: $$ \dfrac{\partial E}{\partial v} = 2(ue^v - 2ve^{-u})(ue^v - 2e^{-u}) $$

In [5]:
import numpy as np

In [5]:
def gradientError(u, v):
    return (u*np.exp(v) - 2*v*np.exp(-u))**2

In [14]:
def dE_du(u, v):
    """
    returns the partial derivative of E(u,v) w/r/t u
    """
    return 2 * (u * np.exp(v) - 2 * v * np.exp(-u)) * (np.exp(v) + 2 * v * np.exp(-u))

In [8]:
def dE_dv(u, v):
    """
    returns the partial derivative of E(u,v) w/r/t v
    """
    return 2 * (u * np.exp(v) - 2 * v * np.exp(-u)) * (u * np.exp(v) - 2 * np.exp(-u))

In [9]:
def updateWeights(weights, nabla):
    '''
    assuming there are just 2 weights we care about
    new_weight = old_weight - (learning rate) * (partial derivative w.r.t. u or v)
    '''
    u = weights[0]
    v = weights[1]
    
    # perform update
    weights[0] -= nabla * dE_du(u,v)
    weights[1] -= nabla * dE_dv(u,v)
    
    return weights

In [10]:
def q5():
    nabla = 0.1 # Lr given in problem
    thresh = 10 ** -14 # thresh we want error to get below
    weights = [1.0,1.0] # initial weight values
    num_iters = 0
    
    while True:
        error = gradientError(weights[0], weights[1]) # calculate current error
        num_iters += 1
        
        if error < thresh or num_iters > 10000:
            print("num iterations: " + str(num_iters))
            break
        else:
            weights = updateWeights(weights, nabla) # perform GD
            
    print("Error: " + str(gradientError(weights[0], weights[1])))
    return weights

In [15]:
q5()

num iterations: 11
Error: 1.20868339442e-15


[0.044736290397782069, 0.023958714099141746]

Among the answer choices, $10$ is the closest to the actual number of iterations we get.

## Problem 6

__[e]__

We returned the weights above so we can just use the values we got, which are closest to the answer pair $[0.045, 0.024]$.

## Problem 7

__[a]__

We'll want methods to update u,v separately so let's write those.

In [16]:
def updateU(weights, nabla):
    weights[0] -= nabla # since u is the first coord we update weights[0]
    return weights

In [17]:
def updateV(weights, nabla):
    weights[1] -= nabla
    return weights

In [18]:
def q7():
    nabla = 0.1
    weights = [1.0, 1.0]
    num_iters = 15
    
    for i in range(num_iters):
        weights = updateU(weights, nabla) # update u first
        weights = updateV(weights, nabla) # then update v
        
    print("error: " + str(gradientError(weights[0], weights[1])))
    return weights

In [19]:
q7()

error: 1.81025168875


[-0.4999999999999999, -0.4999999999999999]

Our error is roughly 1.81, which is closest to the answer value $10^{-1}$.

## Problem 8

Let's copy over some utility functions from HW1.

In [4]:
import random
import numpy as np

In [2]:
def generatePoints(numberOfPoints):
    x1 = random.uniform(-1, 1)
    y1 = random.uniform(-1, 1)
    x2 = random.uniform(-1, 1)
    y2 = random.uniform(-1, 1)
    points = []

    for i in range (numberOfPoints):
        x = random.uniform (-1, 1)
        y = random.uniform (-1, 1)
        points.append([1, x, y, targetFunction(x1, y1, x2, y2, x, y)]) # add 1/-1 indicator to the end of each point list
    return x1, y1, x2, y2, points

In [3]:
def targetFunction(x1,y1,x2,y2,x3,y3):
    u = (x2-x1)*(y3-y1) - (y2-y1)*(x3-x1)
    if u >= 0:
        return 1
    elif u < 0:
        return -1

In [6]:
def stochasticGradient(weights, point):
    """
    sigmoid function
    """
    return (np.array(point[:3]) * point[3])/(1.0 + np.exp(point[3]*np.dot(weights, point[:3]))) * - 1

In [8]:
def stochasticLogisticRegression(threshold, weights, points, nabla):
    increment = 1.0
    iterations = 0
    
    while np.any(increment >= threshold):
        iterations += 1
        random.shuffle(points)
        oldWeights = list(weights)
        
        # perform gradient descent on the weights
        for point in points:
            grad = stochasticGradient(weights, point)
            weights -= nabla * grad # decrement using LR
            
        # we store the increment at the end of each point, and will eventually stop when increment is below 0.01 (threshold)
        increment = np.abs(oldWeights - weights) # we are interested in when w(t) - w(t-1) < 0.01 so we want this value.
        
    return weights, iterations

In [7]:
def crossEntropy(weights, points):
    """
    cross-entropy cost function
    
    for each point (x_n,y_n) we know point[3] = y_n, while point[:3] = x_n because of how defined.
    """
    vals = []
    for point in points:
        vals.append(np.log(1.0 + np.exp(- point[3] * np.dot(weights, point[:3]))))
        
    return np.mean(vals) # the 1/N term in the cost

In [11]:
def q8(numTrials, numPoints):
    """
    for this question we will do 100 trials with N=100 points.
    """
    eouts = []
    iters = []
    thresh = 0.01
    nabla = 0.01
    numPoints = 1000
    
    for i in range(numTrials):
        weights = np.array([0.0,0.0,0.0]) # init weights
        x1, y1, x2, y2, points = generatePoints(numPoints)
        weights, iterations = stochasticLogisticRegression(thresh, weights, points, nabla)
        errorCt = 0
        
        # generate more points to get e_out
        x = []
        for i in range(numPoints):
            x_ = random.uniform(-1,1)
            y_ = random.uniform(-1,1)
            x.append([1, x_, y_, targetFunction(x1,y1,x2,y2,x_,y_)]) # same form as usual
            
        eouts.append(crossEntropy(weights, x)) # calculate crossentropy of trained weights on new points
        iters.append(iterations)
        
    return np.mean(eouts), np.mean(iters)

In [13]:
q8(100,100)

(0.035797335507102269, 692.15999999999997)

From the above run, the average $E_{out}$ is $0.0357$.

At the moment, the answers for this question and question 9 do not match with the expected responses.

## Problem 9

We can also use the above run--the mean number of iterations was roughly $692$.

## Problem 10

__[e]__

This question is asking which error measure, if used with SGD, would produce the exact same weight update as with the PLA algorithm we learned at the beginning of the course. For reference, recall that to do this we would pick a _misclassified_ points $\text{sign}(w^T\boldsymbol{x}_n) \neq y_n$ and update the weight vector $$ \boldsymbol{w} \leftarrow \boldsymbol{w} + y_nx_n. $$

In gradient descent, we perform the update $$ \boldsymbol{w} \leftarrow \boldsymbol{w} - \nabla E_{in}(\boldsymbol{w}(0)). $$

We can see that the only answer with expected behavior is $$ -\min(0, y_n\boldsymbol{w}^Tx_n). $$
As with the PLA algorithm, $\boldsymbol{w}$ is not updated in the case that our point is correctly classified (the error function is 0), but if not then we differentiate $ y_n\boldsymbol{w}^Tx_n $ with respect to $\boldsymbol{w}$ which gives us an update of $y_nx_n$.