# Advanced Logistic Regression

## Logistic Regression using Scikit-Learn

Here we shall learn how to perform modeling using scikit-learn.
```python
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
```

Instantiate a logistic regression model as:
```python
log_sci_model = LogisticRegression()
```

Train the model with 'Fare':
```python
log_sci_model = log_sci_model.fit(train_data['Fare'], train_data['Survived'])
```
Measure the performance of the trained model over the training set:
```python
log_sci_model.score(train_data['Fare'], train_data['Survived'])
0.66554433221099885
```

<br/>

## Exercise:

Train the model with all possible features.

- Assign all the list of features to the variable, features.
- Train using scikit learn logistic regression module.
- Get the prediction on the training set and print out the score.

In [15]:
# Here is the distplot used to generate Age plot. Modify features variable for fare.
import pandas as pd
import numpy as np
import seaborn as sns
import math
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import time


train_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv")

girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()
boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()
man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()
man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()

def impute_age(row):
    if math.isnan(row[5]):
        if ((('Miss. ') in row[3]) and (row[7] == 1)):
            return girl_child_est
        elif ((('Master. ') in row[3]) and (row[7] == 1)):
            return boy_child_est
        elif ((('Miss. ') in row[3]) and (row[7] == 0)):
            return woman_adult_est
        elif (('Mrs. ') in row[3]):
            return woman_married_est
        else:
            return man_married_est
    else:
        return row[5]

train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)
test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)

try:
    train_embarked = pd.get_dummies(train_data['Embarked'])
    train_sex = pd.get_dummies(train_data['Sex'])
    train_data = train_data.join([train_embarked, train_sex])

    test_embarked = pd.get_dummies(test_data['Embarked']) 
    test_sex = pd.get_dummies(test_data['Sex'])
    test_data = test_data.join([test_embarked, test_sex])
except:
    print("The columns have categorical variables")

#modify the below code
features = ['Pclass']



0.7968574635241302


### Solution

```python

features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']

log_sci_model = LogisticRegression()
log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])
log_score = log_sci_model.score(train_data[features], train_data['Survived'])
print(log_score)
```


## LogisticRegression Hyperparameters

SKlearn's LogisticRegression class takes several parameters to tune the LogisticRegression module.


Here are some of the parameters:
<li>dual : Dual or primal formulation. The dual formulation is only implemented for l2 penalty with liblinear solver. 

<li>Prefer dual=False when n_samples > n_features.

<li>max_iter : Maximum number of iterations taken to converge.

<li>solver : Used to specify the optimization algorithm.

<li>penalty : Used to specify the norm used in the penalization (regularization).


We need to use GridSearchCV to find the best hyperparameters.

## Exercise

First let us try to find the best option between dual and max_iter.

hit run to execute.

In [39]:
from sklearn.model_selection import GridSearchCV

dual=[True,False]
max_iter=[1, 10, 20, 40, 80, 100,110,120,130,140]

param_grid = dict(dual=dual,max_iter=max_iter)


log_sci_model = LogisticRegression()
grid = GridSearchCV(estimator=log_sci_model, param_grid=param_grid, cv = 3, n_jobs=-1)

X = train_data[features]
y = train_data['Survived']

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

Best: 0.786756 using {'dual': False, 'max_iter': 10}
Execution time: 0.3925168514251709 ms


### Solution

Just press Run to execute

## Penalty Hyperparameter


<li>penalty : Used to specify the norm used in the penalization (regularization).</li>


Penalty can take values 'l1', 'l2' or 'elasticnet'


## Exercise

Now let us try to find the best option for penalty with values 'l1', 'l2'.

hit run to execute.

In [65]:
penalty = ['l1', 'l2' ]
param_grid = dict(penalty=penalty)


log_sci_model = LogisticRegression()
grid = GridSearchCV(estimator=log_sci_model, param_grid=param_grid, cv = 3, n_jobs=-1)

X = train_data[features]
y = train_data['Survived']

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

Best: 0.784512 using {'penalty': 'l2'}
Execution time: 0.21001815795898438 ms


### Solution

Just press Run to execute

## Penalty as Elasticnet

Elasticnet works only with 'saga' solver. We will look at solvers later.

## Exercise

Now let us try to find out if this works with penalty values 'elasticnet'.

hit run to execute.

In [62]:
C = np.linspace(0.1, 1.0, num=5)
param_grid = dict(max_iter=max_iter,C=C)

log_sci_model = LogisticRegression(penalty='elasticnet',solver='saga')
grid = GridSearchCV(estimator=log_sci_model, param_grid=param_grid, cv = 5, n_jobs=-1)

X = train_data[features]
y = train_data['Survived']

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')















Best: 0.690236 using {'C': 0.1, 'max_iter': 140}
Execution time: 2.6450631618499756 ms




### Solution

Just press Run to execute

## Optimization Algorithms

Recall from the Linear regression section of this course, that one of the main tasks of the algorithm is to find the loss function and find the co-efficients to minimize the loss. This is done by optimization function.

In Logistic regression using sklearn, these are called solvers and some of the optimization algorithms used are described below:

### Newton Conjugate Gradient 

Newton-cg mthod is a constrained optimization using Hessian Matrix to find the minima, maxima, or saddle points. Newton's method is applied to the derivative f′ of a twice-differentiable function f to find the roots of the derivative, also known as the stationary points of f. One of these will be the optima. The Newton's method converges much faster towards a local maximum or minimum than gradient descent. 

It’s computationally expensive algorithm and hence not suitable for large feature space.


### Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm (L-BFGS)

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.

The BFGS method belongs to quasi-Newton methods, a class of hill-climbing optimization techniques that seek a stationary point of a (preferably twice continuously differentiable) function. For such problems, a necessary condition for optimality is that the gradient be zero. Newton's method and the BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum. However, BFGS can have acceptable performance even for non-smooth optimization instances. The Hessian matrix of second derivatives is not computed. Instead, the Hessian matrix is approximated using updates specified by gradient evaluations.

Based on the BFGS(Broyden–Fletcher–Goldfarb–Shanno Algorithm), L-BFGS uses an estimation to the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense  n*n approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with a large number of variables. Instead of the inverse Hessian Hk, L-BFGS maintains a history of the past m updates of the position x and gradient ∇f(x), where generally the history size m can be small (m<10). These updates are used to implicitly do operations requiring the Hk-vector product.

This uses less memory, and hence when dataset is small (small number of samples), L-BFGS relatively performs better than others.

### Liblinear - Library for Large Linear Classification

It’s a linear classification that supports logistic regression.

The solver uses a coordinate descent (CD) algorithm that solves optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes.

LIBLINEAR provides Automatic parameter selection - L1 Regularization and it’s recommended when you have high dimension dataset - with lot os features.

It cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary classifiers are trained for all classes.

### SAG - Stochastic Average Gradient:

SAG method optimizes the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods.

For large datasets, the number of samples and the number of features are large, SAG performs better.

### SAGA:

The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization). This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large dataset.

Like SAG, for large datasets SAGA is preferred.


## Exercise

Let us now try different options for solver for the same dataset


In [None]:
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga' ]
C = np.linspace(0.1, 1.0, num=5)

param_grid = dict(max_iter=max_iter, solver=solver, C=C)

l1_ratio = np.linspace(0, 1.0, num=20)
log_sci_model = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=log_sci_model, param_grid=param_grid, cv = 3, n_jobs=-1)

X = train_data[features]
y = train_data['Survived']

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

### Solution

Just press Run to execute