# DSCI 572 Lab 1
Meta-comment: in the first half of this course, we break away from the typical MDS approach of using/interpreting data science software and start digging into the implementation of such software. If you don't like this, please keep in mind that it's a small part of the program. Also, I find it hard to imagine that the concepts learned here will turn out to be completely useless. 

In [5]:
import numpy as np
import pandas as pd

import scipy.optimize as spo

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

### (optional) Exercise 0: gradients of mathematical functions
rubric={raw:7}

Compute the gradient of each of the following mathematical functions. The notation for the gradient of a function $f$ is $\nabla f(x)$. Note that $x$ may be a vector but $f$ returns a scalar in all the cases below. The gradient is also a vector.

In some cases, the dimensionality of $x$ is provided: for example

> $x \in \mathbb{R}^3$

means that $x$ is a vector with 3 elements. In other cases, you should be able to infer the dimension from the context (for example, for $f_2$ we can infer that $x \in \mathbb{R}^2$ since otherwise the matrix multiplication wouldn't make sense). Finally, in some cases (like $f_6$) the dimension is unknown but you should be able to give an answer that holds regardless of the dimension of $x$. 

Hint: for $\nabla f_5(x)$ you can write $x^\top A x$ as a sum of a few terms. Try taking the derivative, and then putting it back into vector notation at the end.

1. $f_1(x) = \sin(x)$ where $x\in \mathbb{R}$
2. $f_2(x) = [0\;\; 1]x$
3. $f_3(x) = \exp(x_1 + x_2x_3)$ where $x \in \mathbb{R}^3$
4. $f_4(x) = \exp(x_1 + x_2x_3)$ where $x \in \mathbb{R}^4$
5. $f_5(x) = x^\top A x$ where $A=\left[ \begin{array}{cc}1 & 2 \\0 & -3 \end{array} \right]$
6. $f_6(x) = x^\top x$
7. $f_7(x) = x_1^2\sin(x_2)$ where $x \in \mathbb{R}^2$

### Exercise 1: gradients of Python functions

Write a Python function that computes the gradient of each of the following Python functions. 

Notes: 

- All of the functions we deal with here return a scalar, regardless of the size of the input vector `x`. We will not consider the case where the output itself is a vector, since it is not relevant for our context of loss function minimization. 
- You do not need to consider the case where `x.ndim` is 2 (or higher); assume `x.ndim` is always 1.
- You can use [scipy gradient checker](http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.optimize.check_grad.html) to check your results for a few values of the inputs, as shown in the example below. 

In [6]:
# EXAMPLE
def example(x):
    return np.sum(x**2)

def example_grad(x):
    return 2.0*x

def check_grad_and_print(fun, fun_grad, dims=5):
    x0 = np.random.rand(dims)
    diff = spo.check_grad(fun, fun_grad, np.array(x0))
    if diff < 1e-5:
        print('Success (probably)')
    else:
        print('Gradient incorrect (probably)')
        
check_grad_and_print(example, example_grad)

Success (probably)


### 1(a)
rubric={accuracy:5}

In [None]:
def foo(x):
    return np.sum(x)

def foo_grad(x):
    pass # TODO

check_grad_and_print(foo, foo_grad)

### 1(b)
rubric={accuracy:5}

In [None]:
def pin(x):
    return np.sin(x[1])

def pin_grad(x):
    pass # TODO

check_grad_and_print(pin, pin_grad)

### 1(c)
rubric={accuracy:5}

In [None]:
w = np.random.rand(100)

def zap(x):
    return w@x # this is matrix multiplication; equivalent to %*% in R.

def zap_grad(x):
    pass # TODO

check_grad_and_print(zap, zap_grad, dims=100)

### (optional) 1(d)
rubric={accuracy:1}

In [None]:
def baz(x):
    result = 0
    for i in range(len(x)):
        result = result + x[i]**i
    return result

def baz_grad(x):
    pass # TODO

check_grad_and_print(baz, baz_grad)

### (optional) 1(e)
rubric={accuracy:1}

In [None]:
################
### OPTIONAL ###
################

def bar(x):
    if np.abs(x[1]) > 2:
        return 0
    else:
        return -(x[0]*x[0]+1)*np.cos(x[1]-1)

def bar_grad(x):
    pass # TODO
    
check_grad_and_print(bar, bar_grad)

### Exercise 2: gradient descent
rubric={accuracy:10,quality:10}

Write a function `gradient_descent` that takes in a function `f`, its gradient `f_grad`, and a starting point `x0`, and returns a minimizer of `f`.

The code below tests your function by comparing the results to [`scipy.optimize.minimize`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize).

In [None]:
def gradient_descent(f, f_grad, x0, α=0.001, ϵ=0.01):
    """DOCSTRING GOES HERE"""
    x = x0
    
    n = 0
    while np.linalg.norm(f_grad(x)) > ϵ:        
        # YOUR CODE GOES HERE (hint: it's not a lot of code!)
        
        n += 1
    return x

In [None]:
# testing, using a function from Exercise 1b
print(gradient_descent(pin, pin_grad, np.zeros(2)))
print(spo.minimize(pin, np.zeros(2), jac=pin_grad).x)

### Exercise 3: logistic regression 
rubric={accuracy:20,quality:20,reasoning:10}

The code below loads the IMDB dataset used in DSCI 571, with positive reviews encoded as $y=+1$ and negative reviews encoded as $y=-1$. You'll need the file `imdb_master.csv` in the current directory, or you can modify the path in `read_csv` below to where you have the file stored. As a reminder, **please do not commit/push this file**. I have attempted to seed your repos with a `.gitignore` file to this effect, but I'm not sure if it will work... let's see!

In [None]:
imdb_df = pd.read_csv('imdb_master.csv', encoding = "ISO-8859-1")

# Only keep the reviews with pos and neg labels
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

# Train/test split
imdb_df_train = imdb_df[imdb_df["type"] == "train"]
imdb_df_test  = imdb_df[imdb_df["type"] == "test"]

# Sample 5000 rows from the dataframe. 
imdb_df_subset_train = imdb_df_train.sample(n=5000, random_state=0)
imdb_df_subset_test  = imdb_df_test.sample(n=5000, random_state=0)

# Vectorizer
movie_vec = CountVectorizer(max_features=200, 
                            stop_words='english', 
                            binary = True)
movie_vec.fit(imdb_df_subset_train['review'])

# Create X and y
X_train = movie_vec.transform(imdb_df_subset_train['review']).toarray() 
y_train = (imdb_df_subset_train.label.values == "pos")*2-1

X_test = movie_vec.transform(imdb_df_subset_test['review']).toarray() 
y_test = (imdb_df_subset_test.label.values == "pos")*2-1

In this exercise, you'll implement a logistic regression classifier "from scratch". You'll implement the `fit` function using [`scipy.optimize.minimize`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), which does something similar to your `gradient_descent` function above (but fancier). You should pass in the gradient using the `jac` argument as shown earlier in the lab.

**Your tasks:**

1. Complete the `fit` function.
2. Report your training and test error. 
3. Report a confusion matrix for both train and test.
4. Find one false positive case and one false negative case in the test set. Print out the corresponding reviews. Does the model trip up on reviews that seem more neutral? Or can you explain the results in some other way?
5. (optional) Can you get the code to work using your `gradient_descent` function instead of `scipy.optimize.minimize`? This may be a pain because your `gradient_descent` probably uses a constant learning rate, and it's hard to pick the learning rate. Superior methods use a learning rate that is picked adaptively at every iteration. I haven't tried it yet on this data set but I'm happy to help you if you're interested.

Notes: 

- There's no "intercept". This is just for simplicity. It is easy to add in a way that will shortly be explained in DSCI 573.
- I suggest initializing your weights to all zeros.

In [None]:
def loss_lr(w, X, y):
    return np.sum(np.log(1 + np.exp(-y*(X@w))))

def loss_lr_grad(w, X, y):
    return -X.T @ (y/(1+np.exp(y*(X@w))))

class MyLogisticRegression:

    def predict(self, X):
        return np.sign(X@self.w)

    # Fits the regression coefficients for a logistic regression model given the data X, y
    def fit(self, X, y):
        # YOUR CODE HERE (hint: it's not a lot of code!)
        pass

In [None]:
# check gradient of loss
grad_checker = spo.check_grad(lambda w: loss_lr(w, X_train, y_train), 
                              lambda w: loss_lr_grad(w, X_train, y_train), 
                              np.zeros(X_train.shape[1]))
print(grad_checker, "<-- should be small")

### Exercise 4: scikit-learn logistic regression
rubric={reasoning:15}

The scikit-learn implementation of logistic regression, which we looked at in DSCI 571, can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Compare your implementation to the sklearn one, both in terms of speed and accuracy, on the same problem as above. For a fair comparison, use the following hyperparameters with scikit-learn's `LogisticRegression`:

- `C=1e8` to (mostly) disable regularization for sklearn, since your implementation doesn't use regularization. 
- `fit_intercept=False` since your code above doesn't fit an intercept term.
- `solver="lbfgs"` since that will become the new default in scikit-learn v0.22, and that's probably what `minimize` is using in your code anyway.