# DSCI 572 Lab 2

#### Extra dependencies:

- autograd: `pip install autograd`

#### Notes: 

- This lab has a lot of questions, but many are optional.
- Roughly speaking, Exercises 1-4 pertain to Monday's lecture, and the rest to Wednesday's.

In [2]:
import autograd 
from autograd import elementwise_grad as egrad  # for functions that vectorize over inputs
from autograd import grad
import autograd.numpy as np

import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

import scipy.optimize as spo
import scipy.special
import sklearn.datasets
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

## Instructions
rubric={mechanics:20}

Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

## Exercise 1: floating point errors

For each of the code snippets below, explain the result. The first three are answered for you, so you can get a sense of what types of explanations we're looking for. Note that in both the second and third case we get the "expected" result (zero), but the explanations are very different.

In [3]:
0.3 - 0.2 - 0.1

-2.7755575615628914e-17

**Sample solution**: The result is not zero because 0.3, 0.2, and 0.1 are not represented exactly as floating point numbers.

In [4]:
0.5 - 0.25 - 0.125 - 0.125

0.0

**Sample solution**: The result is zero because 0.5, 0.25, and 0.125 are powers of 2 and there they _are_ represented exactly as floating point numbers. There is no roundoff error present. 

In [5]:
0.4 - 0.2 - 0.2

0.0

**Sample solution**: The result is correct (zero) despite the fact that 0.4 and 0.2 are not represented exactly as floating point numbers. This is a case of good luck: while 0.4 and 0.2 are not represented exactly, the roundoff errors happened to cancel out during the subtractions.

### snippet (a)
rubric={reasoning:5}

In [6]:
30 - 20 - 10

0

### snippet (b)
rubric={reasoning:5}

In [7]:
30.0 - 20.0 - 10.0

0.0

### snippet (c)
rubric={reasoning:5}

In [8]:
(10.0**100 + 1) == 10.0**100

True

### snippet (d)
rubric={reasoning:5}

In [9]:
(10.0**100000 + 1) - 10.0**100000

OverflowError: (34, 'Result too large')

### snippet (e)
rubric={reasoning:5}

In [10]:
np.exp(1000) - np.exp(1000)

  return f_raw(*args, **kwargs)
  """Entry point for launching an IPython kernel.


nan

### snippet (f)
rubric={reasoning:5}

In [11]:
1/np.exp(1000) == np.exp(-1000)

  return f_raw(*args, **kwargs)


True

### snippet (g)
rubric=reasoning(5)

In [12]:
1/np.exp(100) == np.exp(-100)

False

### snippet (h)
rubric=reasoning(5)

In [13]:
np.exp(1000)==np.exp(10000)

  return f_raw(*args, **kwargs)


True

### snippet (i)
rubric={reasoning:5}

In [14]:
sum(np.zeros(10)+0.1)

0.9999999999999999

### snippet (j)
rubric={reasoning:5}

In [15]:
np.sin(np.pi)

1.2246467991473532e-16

### (optional) snippet (k)
rubric={reasoning:1}

In [16]:
x = np.ones(100000)
x[0] = 1e20

y = np.ones(100000)
y[-1] = 1e20

sum(x) == sum(y)

False

### (optional) snippet (l)
rubric={reasoning:1}

In [3]:
f = lambda x: np.sqrt(1+x**2)
g = lambda x: x * np.sqrt(1+1/x**2)

print(f(10)  == g(10))
print(f(100) == g(100))
print(f(300) == g(300))

True
False
True


### (optional) snippet (m)
rubric={reasoning:1}

In [21]:
x = np.zeros(10)+0.1
sum(x) == np.sum(x)

False

Hint for the above: see [Pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation). 

## Exercise 2: floating point max/min
rubric={reasoning:20}

As discussed in lecture, IEEE double precision floating point numbers use 53 bits for the mantissa (one of which is a sign bit) and 11 bits for the exponent (again, one of which is a sign bit). Given thus, calculate the largest (furthest from zero) and smallest (closest to zero) possible representable floating point numbers. Then empirically check your results with Python. Are they what you expected? Discuss.

NOTE: Python has a special behaviour that we need to watch out for. If you do something like `10**1000` you will get a giant integer. That's because Python has a dynamically expanding integer type. This has nothing to do with floating point representations, which are the thing we really care about for scientific computation (not to mention that most other languages, including R, do not do this). So, when playing around, make sure you write `10**1000.0` or `10.0*1000` to ensure it's a floating point. Or `1e1000`... but that doesn't work for other bases besides 10. 

(Also, and this is _REALLY_ out of scope but just FYI if anyone cares, in some languages `1eX` and `10^X` will return slightly different answers, if the language uses a special routine for `1eX` that is more optimized than the generral power function. I cannot imagine this ever mattering to any of us.)

## (optional) Exercise 3: $\log(1+\exp(z))$, continued
rubric={accuracy:1,reasoning:1}

In lecture we discussed computing $\log(1+\exp(z))$. We wrote a better version of the function for when $z>>1$, reproduced below.

In [28]:
def log_1_plus_exp(z):
    return np.log(1+np.exp(z))

def log_1_plus_exp_safe(z):
    if z > 100:
        return z
    else:
        return np.log(1+np.exp(z))

Now let's consider the case of $z\ll -1$, i.e. when $z$ is a large negative number. In that case, we get underflow with both of the above implementations:

In [29]:
print(log_1_plus_exp(-100))

0.0


In [30]:
print(log_1_plus_exp_safe(-100))

0.0


Your tasks:

1. Investigate why this is happening. Is the problem that $\exp(-100)$ itself underflows?
2. Write a function `log_1_plus_exp_safer` that works well when $z\gg 1$ and $z\ll -1$.
3. For what range of values of $z$ does your `log_1_plus_exp_safer` function give reasonable results?
4. Your code presumably contains two thresholds, the upper and lower cutoffs for $z$ at which the approximations are invoked. Can you reason about the "optimal" or "best" values for these thresholds?

## Exercise 4: softmax logistic regression and log-sum-exp
rubric={reasoning:15,accuracy:5}

In the "multinomial" (aka softmax) approach to multi-class logistic regression, your loss ends up having one term for each class, so you get something of the form $\log \sum_{i=1}^n \exp(x_i)$. We can rewrite this as 

$$\log \displaystyle\sum_{i=1}^n \exp(x_i) = a+ \log \displaystyle\sum_{i=1}^n \exp(x_i-a)$$

for any $a$. Make sure you understand why this is the case before proceeding.

1. Explain why this formulation might be more numerically stable and why $a=\max \{x_1,x_2,\ldots,x_n\}$ is a sensible choice.
2. If $a=\max \{x_1,x_2,\ldots,x_n\}$, this trick seems to rely on the fact that overflow is more of a danger than underflow, because we may now compute $\exp(z)$ for $z\gg 1$. Explain why overflow is more of a problem than underflow here.
3. (optional) Earlier, we used the approximation $\log(1+\exp(x))\approx x$ when $x\gg 1$. Is that just a specific case of what we have here? It seems that what we did earlier was an approximation, whereas what we did here is mathematically exact. Do you think there is any significance to this distinction, in practice?
4. Write a python function `log_sum_exp` that takes in an array `x` and computes the above expression. Try it out for a few values: does it help with the overflow problems? Also, compare your implementation to In fact, [`scipy.special.logsumexp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.logsumexp.html#scipy.special.logsumexp) for one or two cases.

(FYI: you'll see this trick in real implementations of ML algorithms! )

## Exercise 5: finite differences

### 5(a) Optimization without a gradient: experiments
rubric={reasoning:10}

The `scipy.optimize.minimize` function takes in the derivative information through the optional argument `jac` (for Jacobean). As shown below, the code works even if you don't pass in the derivative. 


In [3]:
def fun(x):
    return (x[0]-2)**2 + (x[1]+4)**4

In [4]:
spo.minimize(fun, np.zeros(2)).x

array([ 1.99999996, -3.98923058])

When the gradient is not provided, `scipy.optimize.minimize` uses finite differences to estimate the gradient.

In lab 1, you implemented logistic regression "from scratch" using `scipy.optimize.minimize`, passing in the gradient. Modify your code so that the gradient is not passed in. Compare the accuracy and speed to your implementation from lab 1. We'll use a slightly smaller version of the dataset from lab 1, with 2000 examples and 100 feature, prepared below.

In [69]:
imdb_df = pd.read_csv('../lab1/imdb_master.csv', encoding = "ISO-8859-1")

# Only keep the reviews with pos and neg labels
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

# Train/test split
imdb_df_train = imdb_df[imdb_df["type"] == "train"]
imdb_df_test  = imdb_df[imdb_df["type"] == "test"]

# Sample 5000 rows from the dataframe. 
imdb_df_subset_train = imdb_df_train.sample(n=2000, random_state=0)
imdb_df_subset_test  = imdb_df_test.sample(n=2000, random_state=0)

# Vectorizer
movie_vec = CountVectorizer(max_features=100, stop_words='english', binary = True)
movie_vec.fit(imdb_df_subset_train['review'])

# Create X and y
X_train = movie_vec.transform(imdb_df_subset_train['review']).toarray() 
y_train = (imdb_df_subset_train.label.values == "pos")*2-1

X_test = movie_vec.transform(imdb_df_subset_test['review']).toarray() 
y_test = (imdb_df_subset_test.label.values == "pos")*2-1

### 5(b) Optimization without a gradient: pondering
rubric=reasoning{10}

The code should be slower when you don't provide the gradient. How much slower do you expect it to be, in terms of the size of the data set? Does theory align with experiment here?

### 5(c) Gradient checkers
rubric={reasoning:20}

One big issue with implementing the gradient by hand is that we often make mistakes. In Lab 1 you were given the following code, which calls the "gradient checker" from `scipy`:

In [40]:
def check_grad_and_print(fun, fun_grad, dims=5):
    x0 = np.random.rand(dims)
    diff = spo.check_grad(fun, fun_grad, np.array(x0))
    if diff < 1e-5:
        print('Success (probably)')
    else:
        print('Gradient incorrect (probably)')

Under what circumstances might the gradient checker not work as intended, or give misleading results? Find a "false positive" and a "false negative", i.e. a case in which the gradient is correct but your gradient checker thinks the gradient is wrong, and a case in which the gradient is wrong but your gradient checker test passes.

### (optional) 5(d): implementing a gradient checker
rubric={accuracy:1,quality:1}

Earlier, you used the [scipy gradient checker](https://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.optimize.check_grad.html). Implement your own gradient checker, that has the same interface as the scipy one (i.e. takes arguments `func`, `grad`, `x0` and returns the norm of the difference.) So long as you **cite your sources**, you may refer to, or even take code from, the lecture notes and/or the [scipy gradient checker source code](https://github.com/scipy/scipy/blob/v0.17.0/scipy/optimize/optimize.py#L625-L671) and/or the [scipy finite differences source code](https://github.com/scipy/scipy/blob/v0.18.1/scipy/optimize/optimize.py#L633-L688).


## Exercise 6: AutoGrad

### 6(a) Running AutoGrad
rubric={accuracy:5,quality:5}

Install [AutoGrad](https://github.com/HIPS/autograd) using `pip install autograd`. Then, make a plot like the `tanh` example on the AutoGrad README, but with a different (differentiable) function of your choosing. To make things easier, I suggest a scalar function, so that we're dealing with derivatives rather than gradients (though AutoGrad can handle gradients too). 

### (optional) 6(b): AutoGrad for logistic regression
rubric={reasoning:1}

In lab 1, you fit parameters for logistic regression using a gradient function that was derived by hand. Here, copy over the loss function `loss_lr` and use AutoGrad to handle the gradient. Compare the result to `loss_lr_grad` from lab 1. Does it work, i.e. does it give you the same gradient? You can generate some random training data for this experiment.

Note: in real situations, you would need to worry about the numerical issues discussed in Exercise 4, since autograd uses the loss and the loss isn't implemented in a stable way. You can solve the problem by using `autograd.scipy.misc.logsumexp`. However, you shouldn't need that here.