In [28]:
import util
import numpy as np

In [29]:
def scl(s,v):
    """return s*v where is a scalar and v is a sparse vector"""
    return {key:s*val for key,val in v.items()}

# Problem 1

In [30]:
reviews = [
            "pretty bad",
            "good plot",
            "not good",
            "pretty scenery",
          ]

def phi(review):
    """Returns sparse feature vector from review"""
    return {word:1 for word in review.split()}

xs = list(map(phi,reviews))
ys = [-1.,1.,-1.,1.] # ground truth

xs

[{'pretty': 1, 'bad': 1},
 {'good': 1, 'plot': 1},
 {'not': 1, 'good': 1},
 {'pretty': 1, 'scenery': 1}]

In [45]:
def loss_hinge(x,y,w):
    return max(0,1-y * util.dotProduct(w,x))

def training_loss(xs,ys,w):
    return np.array([loss_hinge(x,y,w) for x,y in zip(xs,ys)]).mean()

def loss_hinge_grad(x,y,w):
    if 1 - (y*util.dotProduct(w,x)) <= 0 : return 0
    else: return scl(-y,x)

In [48]:
eta = 0.5
w = {}
for x,y in zip(xs,ys):
    print("w: {}, training loss: {}".format(w, training_loss(xs,ys,w)))
    util.increment(w, -eta, loss_hinge_grad(x,y,w))
print("w: {}, training loss: {}".format(w, training_loss(xs,ys,w)))

w: {}, training loss: 1.0
w: {'pretty': -0.5, 'bad': -0.5}, training loss: 0.875
w: {'pretty': -0.5, 'bad': -0.5, 'good': 0.5, 'plot': 0.5}, training loss: 0.75
w: {'pretty': -0.5, 'bad': -0.5, 'good': 0.0, 'plot': 0.5, 'not': -0.5}, training loss: 0.625
w: {'pretty': 0.0, 'bad': -0.5, 'good': 0.0, 'plot': 0.5, 'not': -0.5, 'scenery': 0.5}, training loss: 0.5


That last vector is the answer to problem 1e.

It is possible to reach training loss 0:

In [53]:
eta = 0.5
w = {}
for _ in range(2):
    for x,y in zip(xs,ys):
        print("w: {}, training loss: {}".format(w, training_loss(xs,ys,w)))
        util.increment(w, -eta, loss_hinge_grad(x,y,w))
print("w: {}, training loss: {}".format(w, training_loss(xs,ys,w)))

w: {}, training loss: 1.0
w: {'pretty': -0.5, 'bad': -0.5}, training loss: 0.875
w: {'pretty': -0.5, 'bad': -0.5, 'good': 0.5, 'plot': 0.5}, training loss: 0.75
w: {'pretty': -0.5, 'bad': -0.5, 'good': 0.0, 'plot': 0.5, 'not': -0.5}, training loss: 0.625
w: {'pretty': 0.0, 'bad': -0.5, 'good': 0.0, 'plot': 0.5, 'not': -0.5, 'scenery': 0.5}, training loss: 0.5
w: {'pretty': -0.5, 'bad': -1.0, 'good': 0.0, 'plot': 0.5, 'not': -0.5, 'scenery': 0.5}, training loss: 0.5
w: {'pretty': -0.5, 'bad': -1.0, 'good': 0.5, 'plot': 1.0, 'not': -0.5, 'scenery': 0.5}, training loss: 0.5
w: {'pretty': -0.5, 'bad': -1.0, 'good': 0.0, 'plot': 1.0, 'not': -1.0, 'scenery': 0.5}, training loss: 0.25
w: {'pretty': 0.0, 'bad': -1.0, 'good': 0.0, 'plot': 1.0, 'not': -1.0, 'scenery': 1.0}, training loss: 0.0


## Problem 1f

Suppose we had a dataset consisting of the reviews
- $r_1 = $"good"
- $r_2=$"not good"
- $r_3=$"bad"
- $r_4=$"not bad"

The ground truth about these reviews is obvious: $[+1,-1,-1,+1]$
(where $+1$ means it's a positive review and $-1$ means negative review).

Then the feature space of word counts has dimension 3, if we say
$$ \phi([\text{a review}]) = 
\left[
\begin{array}{c}
\text{number of occurances of "not"}\\
\text{number of occurances of "good"}\\
\text{number of occurances of "bad"}
\end{array}
\right]$$

**Claim:** It is impossible for a linear classifier to get zero error on the dataset above.

_Proof:_ Suppose we had a linear classifier given by weight vector
$w$, and that it gets zero error on the training set above. Remember that the prediction
for input review $r$ would be $\operatorname{sign}(w\cdot\phi(r))$.
Then we'd have
- $w_2 > 0$
- $w_1 + w_2 < 0$
- $w_3<0$
- $w_1+w_3>0$

From the first two we have
$$ w_1 < -w_2 < 0, $$
and from the last two we have
$$ w_1 > -w_3 > 0, $$
a contradiction. $\square$

# Problem 2

(To cut down on notation I write $x$ where I should really write $\phi(x)$)

(a)
$\text{Loss}(x,y,w)=(\sigma(w\cdot x) - y)^2$

(b)
$\nabla_w \text{Loss} = 2(\sigma(w\cdot x)-y)\sigma(w\cdot x)(1-\sigma(w\cdot x)) x$

or if we write $p=\sigma(w\cdot x)$ then it's
$\nabla_w \text{Loss} = 2(p-y)p(1-p) x$

(c) Suppose we make the vector $w$ be $\frac{\alpha}{||x||^2} x $, where $\alpha$ is a very large positive number.
Then $w\cdot x$ is $\alpha$ and so $p= \sigma(w\cdot x) = \sigma(\alpha)\approx 1$. If $y$ is $\pm 1$ then this makes $\nabla_w \text{Loss}$ very close to $0$. This is the vanishing gradient problem.

(d) Assume $y$ is $1$. Then
$$
\begin{align*}
|| \nabla_w \text{Loss} || 
&= 2 |(p-1)p(1-p)|  \, ||x||\\
&= 2 p (1-p)^2  \, ||x||
\end{align*}
$$
Using basic calculus to optimize $p (1-p)^2$ over the domain $p\in (0,1)$ (i.e. the range of the sigmoid),
we get $p=\frac{1}{3}$ as the maximizing value. There certainly exists $w$ that makes $\sigma(w\cdot x)$ be $\frac{1}{3}$. Hence the best bound we can get is
$$
\begin{align*}
|| \nabla_w \text{Loss} || 
&\leq 2 \frac{1}{3} (1-\frac{1}{3})^2  \, ||x||\\
&=\frac{8}{27}||x||
\end{align*}
$$

(e) Set $y_n'=\sigma^{-1}(y_n)$ and choose $w^*=w$... if I understood the problem correctly.