# Logistic Regression

## Task 1
Recall the data processing routines from the last lab course. The following excercises build on top of the extracted feature representations, but instead of the prebuilt classifier, we want to implement logistic regression by hand. To this end, make sure, that the variables `train`, `test`, `train_data_features` and `test_data_features` are loaded to your IPython shell.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

#%% download nltk stopwords
# import nltk
# ntlk.download('stopwords')

# load stopwords
stops = set(stopwords.words('english'))

# function for preprocessing the data
def review_prepro(data, remove_stopwords=False):
    # remove HTML tags
    review_text = BeautifulSoup(data, 'lxml').get_text()
    # remove non-letters and numbers
    letters_only = re.sub( '[^a-zA-Z]',
                          ' ',
                          review_text )
    # make all characters lower case and split the documents into single words
    words = letters_only.lower().split()
    
    if remove_stopwords:
        # remove stop words
        meaningful_words = [ w for w in words if not w in stops ]
        # return concatenated single string
        return ' '.join(meaningful_words)
    else:
        # or don't and concatenate to single string
        return ' '.join(words)

# load data as pandas dataframe
train = pd.read_csv('labeledTrainData.tsv', 
                    header=0,
                    delimiter="\t", 
                    quoting=3 )

test = pd.read_csv('labeledTestData.tsv', 
                   header=0,
                   delimiter="\t",
                   quoting=3 )


# preprocess train and test data
num_reviews = train['review'].size

clean_train_reviews = []
for i in range(num_reviews):
  #  if (i+1)%1000 == 0:
    #    print('Review {} of {}\n'.format(i+1, num_reviews))
    clean_train_reviews.append( review_prepro(train['review'][i], remove_stopwords=True) )
    
num_test_reviews = test['review'].size

clean_test_reviews = []
for i in range(num_test_reviews):
    #if (i+1)%1000 == 0:
     #   print('Review {} of {}\n'.format(i+1, num_test_reviews))
    clean_test_reviews.append( review_prepro(test['review'][i], remove_stopwords=True) )
    

#%% create BoW
# Documentation:
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vectorizer = CountVectorizer(analyzer = 'word',
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = stops,
                             max_features = 5000)

# fit the vectorizer and return transformed reviews
vectorizer.fit(clean_train_reviews)
train_data_features = vectorizer.transform(clean_train_reviews)
clean_train_reviews=None 
# convert to numpy array
train_data_features = train_data_features.toarray()

# create BoW representation of test data
test_data_features = vectorizer.transform(clean_test_reviews)
clean_test_reviews=None
test_data_features = test_data_features.toarray()

a) Write a PYTHON function `logistic_gradient` that expects a training set matrix `X_train`, a ground truth label vector `y_train` and a current weight vector `w` as its input and returns the gradient `g` of the negative log-likelihood function of the logistic regression. Refer to the lecture notes for the mathematical definition.

In [2]:
import numpy as np

def sigmoid(x):
    # https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
    z = np.exp(-np.abs(x))
    return np.where(x>=0.0,1.0/(1.0+z),z/(1.0+z))

def logistic_gradient(X_train,y_train,w,reg=0.0):
    # Rows are variables, Columns are samples
    g=np.dot(X_train,sigmoid(np.dot(X_train.T, w))-y_train)+reg*w
    return g

b )Write a PYTHON function `logistic_hessian` that expects a training set matrix `X_train` and a current weight vector `w` as its input and returns the Hessian `H` of the negative log-likelihood function of the logistic regression. Refer to the lecture notes for the mathematical definition.

In [3]:
def logistic_hessian(X_train,w, reg=0.0):
    S=sigmoid(np.dot(X_train.T,w))
    diagB=S*(1.0-S)
    XS=X_train*np.expand_dims(diagB,axis=0)
    H=np.dot(XS, X_train.T)+reg*np.eye(w.shape[0])
    return H

c) Write a PYTHON function `find_w` that expects a training set matrix `X_train`, a ground truth label vector `y_train`, a fixed step size `h` and a maximum iteration number `max_it` that determines the optimal logistic regression weight vector `w_star` by performing gradient descent via calling `logistic_gradient` in each iteration. Make sure to include the affine offset $w_0$ in your model.

In [4]:
def find_w(X_train, y_train, max_it, reg=0.0,w_init=None):
    X_offs=np.ones((X_train.shape[0]+1,X_train.shape[1]))
    X_offs[1:,:]=X_train
    w=np.zeros((X_offs.shape[0]))
    if w_init is not None:
        w=w_init
    cost=np.inf
    for i in range(max_it):
        delta_w=np.linalg.lstsq(logistic_hessian(X_offs, w,reg=reg),logistic_gradient(X_offs, y_train, w,reg=reg))[0]
        w-=np.array(delta_w)
        S=sigmoid(np.dot(w.T,X_offs))
        oldcost=cost
        cost=-(np.dot(np.log(np.where(S<np.finfo('float32').eps,np.finfo('float32').eps,S)),y_train)
              +np.dot(np.log(np.where((1.0-S)<np.finfo('float32').eps,np.finfo('float32').eps,1.0-S)),1.0-y_train)
              -reg*np.sum(w**2))
        if cost>=oldcost:
            break
        w_star=w
        print('Iteration:', i+1, ' Current cost:', cost)
    return w_star

d) Write a function `classify_log` that expects a weight vector `w` and a test set matrix `X_test` and classifies the samples in `X_test` via logistic regression, returning a label vector `y_test`. Test your implementation on `train_data_features` and `test_data_features` with one iteration and with 10 iterations. What do you observe?

In [5]:
from sklearn.metrics import roc_auc_score as AUC

def classify_log(w, X_test):
    w0=w[0]
    y_test=sigmoid(w0+np.dot(X_test.T,w[1:]))
    return y_test

#Testing implementation
y_train=train['sentiment'].values
y_test=test['sentiment'].values
train=None
test=None
w=find_w(train_data_features.T,y_train,1,reg=0)
y_pred=classify_log(w,test_data_features.T)
auc = AUC( y_test, y_pred )
print('AUC score after 1 iteration:',auc)
w=find_w(train_data_features.T,y_train,10,reg=0)
y_pred=classify_log(w,test_data_features.T)
auc = AUC( y_test, y_pred )
print('AUC score after 10 iteration:',auc)


Iteration: 1  Current cost: 6165.48143183
AUC score after 1 iteration: 0.906356178126
Iteration: 1  Current cost: 6165.48143183
Iteration: 2  Current cost: 3961.1406912
Iteration: 3  Current cost: 2634.09296401
Iteration: 4  Current cost: 1738.31616292
Iteration: 5  Current cost: 1054.12358097
Iteration: 6  Current cost: 514.98272427
Iteration: 7  Current cost: 196.676386247
Iteration: 8  Current cost: 71.7497222914
Iteration: 9  Current cost: 26.299656854
Iteration: 10  Current cost: 9.67265985876
AUC score after 10 iteration: 0.875910041024


e) Logistic regression is prone to *overfitting*. To prevent this, regularizing parameters can be used. Adjust your implementation in such a way that instead of minimizing $L(\mathbf{w})$, it minimizes the term
    	\begin{equation}
    	L(\mathbf{w})+\alpha\|\mathbf{w}\|^2,
    	\end{equation}
    	where $\alpha$ is a non-negative regularization parameter. Test your implementation with $\alpha=1$ with one iteration and with 10 iterations.

In [6]:
w=find_w(train_data_features.T,y_train,1,reg=1)
y_pred=classify_log(w,test_data_features.T)
auc = AUC( y_test, y_pred )
print('AUC score after 1 iteration:',auc)
w=find_w(train_data_features.T,y_train,10,reg=1)
y_pred=classify_log(w,test_data_features.T)
auc = AUC( y_test, y_pred )
print('AUC score after 10 iteration:',auc)

Iteration: 1  Current cost: 6304.09530912
AUC score after 1 iteration: 0.910314236004
Iteration: 1  Current cost: 6304.09530912
Iteration: 2  Current cost: 4330.85401281
Iteration: 3  Current cost: 3437.94965462
Iteration: 4  Current cost: 3205.4784196
AUC score after 10 iteration: 0.92835827158
