# Logistic regression

## Model
We use linear regression as the baseline, but we map the results (from -infinity to +infinity) to the interval (0,1) and interpret the results as probability that the given data point belongs to a claLogistic regression has a linear score with a logistic link function.ss.

We call the function which does the mapping a "link function".
Logistic regression has a linear score with a logistic  (sigmoid) ( cumulative logistic distribution)link function.

$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}
$$

Logistic regression is a type of linear classifier, since the output is the weighted sum of inputs.
For linear classifiers, the decision boundary is a line/plane/hyperplane.

Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression. 


## Learning

We choose the coefficients to maximize the likelihood function. For the negative data points we want to maximize the probability that the output y is 0, and for the positive data points we want to maximize the probability that the output y is 1.
The probabilities of the specific data points are combined with multiplication operation.
$$\ell(\mathbf{w}) = \prod_{i=1}^N P(y_i | \mathbf{x}_i, \mathbf{w})$$

In order to simplify math, we use the log-likelihood:

ln of the above

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

In order to find the max, we use gradient ascent, which is based on calculating the derivative. (There is no close form solution).

The log-likelihood derivative is:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

Gradient ascent: the w(t+1) = w(t) + step_size * gradient of ll with respect to w


###choosing the step size
If too small - slow convergence
if too big - divergence or oscillations

To choose, try several values, exponentially spaced


## Regularization
As usual, regularization allows us to avoid overfitting. 
Overfitting symptoms for logistic regression are:
* large coefficient values
* overconfident estimations (very high(small) probability values, close to 1(0), basically no incertainty zone
* test set accuracy worse than training set accuracy

### L2 Regularization



* The per-coefficient derivative of log likelihood:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j }
$$

We do not regularize the intercept term, so that for $w_0$ we have:
$$
\frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$



### L1 Regularization
%TODO

## Implementing logistic regression  without regularization from scratch

In [16]:
import math
import numpy as np

'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def calcProbability(feature_matrix, coefficients):
    score =  np.dot(feature_matrix, coefficients)
    p = []
    for s in score:
        prob = 1 / (1 + math.exp(-s))
        p.append(prob)
    return p

'''
Computes the derivative of the sigmoid function
'''
def calcDerivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    return derivative

'''
Computes the log likelihood
'''
def calcLogLikelihood(features, labels, coefficients):
    indicator = (labels==+1)
    scores = np.dot(features, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    lp = np.sum((indicator-1)*scores - logexp)
    return lp

'''
features: a matrix of data points 
labels: an array containing true labels associated with data points
'''
def logisticRegression(features, labels, initial_coefficients, step_size, max_iter):
    
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in range(max_iter):

        # Predict P(y_i = +1|x_i,w) using your predict_probability() function
        predictions = calcProbability(features, coefficients)
        
        # Compute indicator value for (y_i = +1)
        indicator = (labels==+1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        
        for j in xrange(len(coefficients)): # loop over each coefficient
            
            # feature_matrix[:,j] is the feature column associated with coefficients[j].
            # Compute the derivative for coefficients[j]. Save it in a variable called derivative
            derivative = np.dot(features[:,j], errors)
            
            # go up the hill
            coefficients[j] =  coefficients[j] + step_size * derivative
            
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = calcLogLikelihood(features, labels, coefficients)
            print ('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
    return coefficients

## Running logistic regression on example data

In [19]:
#coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
 #                                  step_size=1e-7, max_iter=301)

# TODO use realistic data
features = [[1, 2, 3, 4], [5,6,6,8],[1, 2, 3, 4]]
labels = [1,2,3]
coefficients = logisticRegression(features, labels, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)


ValueError: shapes (3,4) and (194,) not aligned: 4 (dim 1) != 194 (dim 0)

## Implementing logistic regression with L2 regularization


In [None]:
%TODO