# DSCI 572 Lab 1


In [None]:
import numpy as np
import pandas as pd

from scipy.optimize import minimize, check_grad

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

### (optional) Exercise 0: gradients of mathematical functions
rubric={raw:7}

Compute the gradient of each of the following mathematical functions. The notation for the gradient of a function $f$ is $\nabla f(x)$. Note that $x$ may be a vector but $f$ returns a scalar in all the cases below. The gradient is also a vector.

In some cases, the dimensionality of $x$ is provided: for example

> $x \in \mathbb{R}^3$

means that $x$ is a vector with 3 elements. In other cases, you should be able to infer the dimension from the context (for example, for $f_2$ we can infer that $x \in \mathbb{R}^2$ since otherwise the matrix multiplication wouldn't make sense). Finally, in some cases (like $f_6$) the dimension is unknown but you should be able to give an answer that holds regardless of the dimension of $x$. 

Hint: for $\nabla f_5(x)$ you can write $x^\top A x$ as a sum of a few terms. Try taking the derivative, and then putting it back into vector notation at the end.

1. $f_1(x) = \sin(x)$ where $x\in \mathbb{R}$
2. $f_2(x) = [0\;\; 1]x$
3. $f_3(x) = \exp(x_1 + x_2x_3)$ where $x \in \mathbb{R}^3$
4. $f_4(x) = \exp(x_1 + x_2x_3)$ where $x \in \mathbb{R}^4$
5. $f_5(x) = x^\top A x$ where $A=\left[ \begin{array}{cc}1 & 2 \\0 & -3 \end{array} \right]$
6. $f_6(x) = x^\top x$
7. $f_7(x) = x_1^2\sin(x_2)$ where $x \in \mathbb{R}^2$

### Exercise 1: gradients of Python functions

Write a Python function that computes the gradient of each of the following Python functions. 

Notes: 

- All of the functions we deal with here return a scalar, regardless of the size of the input vector `x`. We will not consider the case where the output itself is a vector, since it is not relevant for our context of loss function minimization. 
- You do not need to consider the case where `x.ndim` is 2 (or higher); assume `x.ndim` is always 1.
- You do not need to write docstrings for your functions.
- You can use [scipy gradient checker](http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.optimize.check_grad.html) to check your results for a few values of the inputs, as shown in the example below. 

In [None]:
# EXAMPLE
def example(x):
    return np.sum(x**2)

def example_grad(x):
    return 2.0*x

def check_grad_and_print(fun, fun_grad, dims=5):
    x0 = np.random.rand(dims)
    fg = fun_grad(x0)
    if not isinstance(fg, np.ndarray):
        print("Gradient should be a numpy array")
        return
    if len(fg) != dims:
        print("Gradient is the wrong size")
        return
    
    diff = check_grad(fun, fun_grad, np.array(x0))
    if diff < 1e-5:
        print('Success (probably)')
    else:
        print('Gradient incorrect (probably)')
        
check_grad_and_print(example, example_grad)

### 1(a)
rubric={accuracy:5}

In [None]:
def foo(x):
    return np.sum(x)

def foo_grad(x):
    

check_grad_and_print(foo, foo_grad)

### 1(b)
rubric={accuracy:5}

In [None]:
def pin(x):
    return np.sin(x[1])

def pin_grad(x):
    

check_grad_and_print(pin, pin_grad)

### 1(c)
rubric={accuracy:5}

In [None]:
a = np.random.rand(100)

def zap(x):
    return a@x # this is matrix multiplication; equivalent to %*% in R.

def zap_grad(x):
    

check_grad_and_print(zap, zap_grad, dims=100)

### (optional) 1(d)
rubric={accuracy:1}

In [None]:
def baz(x):
    result = 0
    for i in range(len(x)):
        result = result + x[i]**i
    return result

def baz_grad(x):
    

check_grad_and_print(baz, baz_grad)

### (optional) 1(e)
rubric={accuracy:1}

In [None]:
def bar(x):
    if np.abs(x[1]) > 2:
        return 0
    else:
        return -(x[0]*x[0]+1)*np.cos(x[1]-1)

def bar_grad(x):
    
    
check_grad_and_print(bar, bar_grad)

### Exercise 2: gradient descent
rubric={accuracy:10}

Complete the `gradient_descent` function below.

In [None]:
def gradient_descent(f, f_grad, x0, α=0.001, ϵ=0.01, verbose=False):
    """
    Minimizes the function f given the function itself, 
    its gradient, and a starting point.
    
    Parameters
    ----------
    f : function
        the objective
    f_grad : function
        the gradient
    x0 : numpy.ndarray
        the starting point
    α : float (optional, default=0.01)
        the learning rate
    ϵ : float (optional, default=0.01)
        the tolerance for termination
    verbose : bool (optional, default=False)
        whether or not to print out extra output
    
    Returns
    -------
    numpy.ndarray
        the minimizer of f
        
    Example
    -------
    >>> gradient_descent(lambda x: x**2, lambda x: 2*x, 1.0, ϵ=1e-6)
    4.990720528955806e-07
    """
    x = x0
    
    n = 0
    while np.linalg.norm(f_grad(x)) > ϵ:
        # YOUR CODE HERE
        
        
        if verbose and n % 100 == 0:
            print("----")
            print("Iter =", n)
            print("||∇f|| =", np.linalg.norm(f_grad(x)))
            print("f =", f(x))
        
        n += 1
    return x

The code below tests your function by comparing the results to [`scipy.optimize.minimize`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize). You should get similar results.

In [None]:
print(gradient_descent(pin, pin_grad, np.zeros(2), ϵ=1e-6))
print(minimize(pin, np.zeros(2), jac=pin_grad).x)

### Exercise 3: logistic regression 

The code below loads [the IMDB dataset](https://www.kaggle.com/utathya/imdb-review-dataset) from DSCI 571, with positive reviews encoded as $y=+1$ and negative reviews encoded as $y=-1$. You'll need the file `imdb_master.csv` in the current directory, or you can modify the path in `read_csv` below to where you have the file stored. As a reminder, **please do not commit/push this file**. I have attempted to seed your repos with a `.gitignore` file to this effect.

In [None]:
imdb_df = pd.read_csv('imdb_master.csv', encoding = "ISO-8859-1")

In [None]:
def transform_df(df, split, sample_size=5000,
                   labels = ('pos','neg')):
    """
    Wrangles the imdb_master data set into a test or train
    split and takes a sample size. Returns a dataframe of the
    single column "review"and an array of the numerical 
    representation of it's class label.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        the objective
    split : str
        the split desired either train or test.
    sample_size : int (default : 5000)
        the sample size from the split
    labels : tuple, optional
        the classification labels (default : ('pos','neg'))
    (default : "review" )
    
    Returns
    -------
    X: pandas.core.frame.DataFrame
        a dataframe of the imdb review column and index
    y: numpy.ndarray
        the class labels 
    """
    
    # get only the positive and negative reviews
    df = df[df["label"].str.startswith(labels)]
    
    # grab either the train or the test
    df = df[df["type"] == split] 
    
    # subset the rows
    df_subset = df.sample(n=sample_size, random_state=0)
    
    # X is the reviews
    X = df_subset["review"]
    
    # y is the labels, as either -1 or +1
    y = (df_subset["label"].values == labels[0])*2-1
    return X , y

In [None]:
X_train_reviews, y_train = transform_df(imdb_df, "train")
X_test_reviews,   y_test  = transform_df(imdb_df, "test" )

countvec = CountVectorizer(max_features=200, stop_words='english', binary = True)

X_train = countvec.fit_transform(X_train_reviews).toarray()
X_test = countvec.transform(X_test_reviews).toarray()

# Note: dataset is small enough that we don't need to bother with a sparse matrix, hence toarray()

In this exercise, you'll implement a logistic regression classifier "from scratch". You'll implement the `fit` function using [`scipy.optimize.minimize`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), which does something similar to your `gradient_descent` function above (but fancier).

#### 3(a)
rubric={accuracy:10,quality:10}

Complete the `fit` function, using `scipy.optimize.minimize` to solve the optimization problem. 

Notes: 

- There's no "intercept". This is just for simplicity. It is easy to add in a way that will shortly be explained in DSCI 573.
- I suggest initializing your weights to all zeros.
- You should pass in the gradient to `minimize` using the `jac` argument.

In [None]:
def loss_lr(w, X, y):
    return np.sum(np.log(1 + np.exp(-y*(X@w))))

def loss_lr_grad(w, X, y):
    return -X.T @ (y/(1+np.exp(y*(X@w))))

class MyLogisticRegression:

    def predict(self, X):
        return np.sign(X@self.w)

    # Fits the regression coefficients for a logistic regression model given the data X, y
    def fit(self, X, y):
        # YOUR CODE AND DOCSTRING HERE
        
        

In [None]:
# check gradient of loss
grad_checker = check_grad(lambda w: loss_lr(w, X_train, y_train), 
                              lambda w: loss_lr_grad(w, X_train, y_train), 
                              np.zeros(X_train.shape[1]))
print(grad_checker, "<-- should be small")

#### 3(b)

rubric={reasoning:2}

Report your training and test error. 

### 3(c)
rubric={reasoning:2}

Report a confusion matrix for both train and test.

### 3(d)
rubric={reasoning:5}

Find one false positive case and one false negative case in the test set. Print out the corresponding reviews. Discuss your results.

### (optional) 3(e)
rubric={reasoning:1}

To explore this further, pick a false positive and false negative example. Then, for each one, see what words (features) are present, and then print out the weights from the model corresponding to those words. Does your false positive example indeed contain many "positive" words, and vice versa for your false negative example? Discuss.

#### 3(f)
rubric={accuracy:5,reasoning:10}

Adapt your code to use your `gradient_descent` function instead of `scipy.optimize.minimize`. Try using $\alpha=10^{-4}$ and $\epsilon=1$. Do you get similar results to what you got with `scipy.optimize.minimize`?

In [None]:
class MyLogisticRegressionGD:
    
    def __init__(self, verbose=False):
        self.verbose=verbose # whether to print stuff out during gradient descent

    def predict(self, X):
        return np.sign(X@self.w)

    # Fits the regression coefficients for a logistic regression model given the data X, y
    def fit(self, X, y):
        

### Exercise 4: scikit-learn logistic regression
rubric={reasoning:15}

The scikit-learn implementation of logistic regression, which we looked at in DSCI 571, can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Compare your implementation to the sklearn one, both in terms of speed and accuracy, on the same problem as above. For a fair comparison, use the following hyperparameters with scikit-learn's `LogisticRegression`:

- `C=1e8` to (mostly) disable regularization for sklearn, since your implementation doesn't use regularization. 
- `fit_intercept=False` since your code above doesn't fit an intercept term.
- `solver="lbfgs"` just in case your scikit-learn version is older than 0.22 (`minimize` is probably using [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) also.)