
# Logistic Regression


## About

Logistic Regression is a ‘Statistical Learning’ technique categorized in ‘Supervised’ Machine Learning (ML) methods dedicated to ‘Classification’ tasks.

In [None]:
from IPython.display import Image
Image('/Users/avimanur0/logistic_regression.png')

## Implementation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
%matplotlib inline

### Import data

In [None]:
#read the data
df = pd.read_csv('conversion_data.csv')

In [None]:
df.columns

In [None]:
#sneak peek into the data
df.head()

In [None]:
#more details about df
df.info()

In [None]:
#target class frequency
df.converted.value_counts()

In [None]:
10200/(306000+10200)

### Preparing Data For Modeling

In [None]:
#dummy data for categorical variables
df = pd.get_dummies(df, columns=['country','source'])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
input_columns = [column for column in df.columns if column != 'converted']
output_column = 'converted'
print (input_columns)
print (output_column)

In [None]:
#input data
X = df.loc[:,input_columns].values
#output data 
y = df.loc[:,output_column]
#shape of input and output dataset
print (X.shape, y.shape)

### Modeling : Logistic Regression

In [None]:
#import model specific libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
#Split the data into training and test data (70/30 ratio)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=100, stratify=y)


In [None]:
#validate the shape of train and test dataset
print (X_train.shape)
print (y_train.shape)

print (X_test.shape)
print (y_test.shape)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
#check on number of positive classes in train and test data set
print(np.sum(y_train))
print(np.sum(y_test))

### Model Training

In [None]:
#fit the logisitc regression model on training dataset 
logreg = LogisticRegression().fit(X_train,y_train)

In [None]:
y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)

### Model Evaluation

**Classification accuracy:** percentage of correct predictions

In [None]:
print(metrics.accuracy_score(y_train, y_train_pred))
print(metrics.accuracy_score(y_test, y_test_pred))

In [None]:
print(metrics.roc_auc_score(y_train, y_train_pred))
print(metrics.roc_auc_score(y_test, y_test_pred))

**Confusion Matrix Basics**

- **True Positives (TP):** we *correctly* predicted that they *do* have diabetes
- **True Negatives (TN):** we *correctly* predicted that they *don't* have diabetes
- **False Positives (FP):** we *incorrectly* predicted that they *do* have diabetes (a "Type I error")
- **False Negatives (FN):** we *incorrectly* predicted that they *don't* have diabetes (a "Type II error")

In [None]:
train_confusion = metrics.confusion_matrix(y_train, y_train_pred)
train_TP = train_confusion[1, 1]
train_TN = train_confusion[0, 0]
train_FP = train_confusion[0, 1]
train_FN = train_confusion[1, 0]

In [None]:
test_confusion = metrics.confusion_matrix(y_test, y_test_pred)
test_TP = test_confusion[1, 1]
test_TN = test_confusion[0, 0]
test_FP = test_confusion[0, 1]
test_FN = test_confusion[1, 0]

Target -  If user converted or not

0 - user not converted

1 - converted

Actuals:

0 - 214200

1 - 7140

predictions:

0 - 213355

1 - 4910

In [None]:
213355+845

In [None]:
2230+4910

In [None]:
print(train_confusion)

In [None]:
print(test_confusion)

**True and False Positive Rates**

- **True Positive Rate (tpr):** When the actual value is positive, how often is the prediction correct?

                           tpr = TP / FN + TP
   

- **False Positive Rate (fpr):** When the actual value is negative, how often is the prediction incorrect?

                           fpr = FP / TN + FP

In [None]:
train_fpr, train_tpr, train_thresholds = metrics.roc_curve(y_train, y_train_pred)

In [None]:
test_fpr, test_tpr, test_thresholds = metrics.roc_curve(y_test, y_test_pred)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(train_fpr, train_tpr, 'b')
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(test_fpr, test_tpr, 'b')
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()