## What is Logistic Regression?
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary. Logistic regression generates a probabilityâ€”a value between 0 and 1. When we are dealing with a problem such as spam email detection with linear regression, we have to put a threshold. So that whenever the predicted value is larger than the threshold, we say that email is a spam email. Problem appears when we have significant differences between the predicted values, the threshold would be sensitive to the values. Thus we say that linear regression is unbounded. Logistic regression strictly constraints the predicted values between 0 and 1.

In this tutorial, we are going to see how to do logictic regression using scikit learn.

First, we load our data. The first two rows are the parameters and the third column is the dependent value.

https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.datasets import load_iris
# print(os.listdir("./input"))

Now we load out dataset.
Note that iris dataset is s dictionary, we set x and y as the data and the target.
And then we split the data in to test and train.

In [2]:
from sklearn.datasets import load_iris
iris=load_iris()#print(iris)
x=iris.data # print(x.shape)
y=iris.target # print(y.shape)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.3)

It is clealy seen that there are five features for eath Iris flower data point

In [3]:
print(x_train[:5])

[[4.6 3.2 1.4 0.2]
 [6.9 3.1 5.4 2.1]
 [5.5 2.6 4.4 1.2]
 [4.6 3.4 1.4 0.3]
 [5.6 2.5 3.9 1.1]]


### Normalization (Scaling)
Next, we apply scaling on the test dataset, and apply the same transform to the test dataset.
After fitting, the scale and offset used with your training data is stored. We use that on the test dataset with scaler.transform..

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform( x_train )
x_test = scaler.transform( x_test )

### Cross Validation with Regularization
Any model has its own hyperparameter. These paramters could change the training results significantly. Thus we have to do hyper paramter tuning via cross validation.

One method is called grid search cross validation. One exhaustively navigates through the n parameters values (the grids). So we have a n squared parameter combination.

The grid term is the parameter set where we include all the hyper parameters that the model uses and the range of values that we would like to test. In logistic regression, there are two hyperparameters we could tune, one is the regularization coefficient C and the second is the regularization method L.

Finally, one uses K-folds cross validation to test the accuracy of the model. the cv argument in GridSearchCV allows one to decide how many folds one wants to use. an average score for each model is then generated. Only the best model and its hyperparameters are shown in the end.

source:
1. https://www.youtube.com/watch?v=IXPgm1e0IOo
2. https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

In [5]:
# Simply remove warnings
import warnings
warnings.filterwarnings("ignore")

# Grid search cross validation
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(x_train,y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'C': 10.0, 'penalty': 'l1'}
accuracy : 0.9619047619047619


### Predictions
Finally we do predictions.

In [6]:
logreg2=LogisticRegression(C=1,penalty="l2")
logreg2.fit(x_train,y_train)
print("score",logreg2.score(x_test,y_test))

score 0.9111111111111111


### end

### P.S.1

One could manually find the cv scores by the following lines. This example shows when we have selected C=10 and L1 morn as regularization, the average score of the 10 folds cross validation is 96.16%

In [9]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(LogisticRegression(C=10,penalty="l1"), x_train, y_train, scoring='accuracy', cv = 10).mean())

0.9616666666666667
