# Logistic Regression with Scikit-Learn: Practical ML #2

In this notebook, we will cover logistic regression using Scikit-Learn. The dataset being used is [Diabetes Dataset](https://www.kaggle.com/kandij/diabetes-dataset), where we will predict a person will have diabetes based on their blood pressure, BMI, Glucose using **Logistic Regression**.


**Link to Part 1:** [Regression with Scikit-Learn: Practical ML #1](https://www.kaggle.com/aadhavvignesh/regression-with-scikit-learn-practical-ml-1)

# Inspecting Data

Let us load the dataset and save it in a DataFrame:

In [None]:
import pandas as pd

df = pd.read_csv("../input/diabetes-dataset/diabetes2.csv")
df.head()

Let us see our dataframe contains the count of values and their datatypes using `.info()`.

In [None]:
df.info()

Fortunately, we do not have any missing values in our dataset. We can now proceed with visualizations and making predictions.

# Visualizing Data

Let us visualize the dataset using various plots.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

## Distribution of Ages

Let us plot the **distribution of ages** of the patients.

In [None]:
sns.distplot(df.Age)

Here, we can see that most patients have the age of 20 - 30 years. 

Hence, we can say that adults having age in the range of 20-30 years are more prone to diabetes due to lack of exercise, unhealthy diet, etc.

## Plot relation of blood sugar levels and age

The distribution clearly shows that most patients lie in the range of 20-30 years with blood sugar levels being approximately equal to 100 mg/dL.

**Note:** Normal sugar levels lie in the range of 60 - 90 mg/dL. For more info, check out the [link.](https://www.webmd.com/diabetes/how-sugar-affects-diabetes)

In [None]:
sns.jointplot(df.Age, df['Glucose'], kind = 'kde')

## Pair plot

Let us plot pairwise relationships in a dataset using `seaborn`'s `pairplot` function.

In [None]:
sns.pairplot(df, hue = 'Outcome')

Let us create a helper function to evaluate predictions and calculate accuracy.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate(true, pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    print("Confusion Matrix:\n", confusion_matrix(true, pred))
    cm = pd.crosstab(true, pred)
    sns.heatmap(cm, annot=True)
    print("Accuracy Score:", accuracy_score(true, pred))
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)

Now, let us prepare the data by dropping the target variable, `Outcome`, and setting it as the target variable.

In [None]:
X = df.drop(['Outcome'], axis = 1)
y = df['Outcome']

Let us now split our data into training and test data sets. We use the training set in order to fit the model and the test set to evaluate our model's predictions.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 7)

# Logistic Regression

Logistic regression is a **classification algorithm** used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

It is used when the target variable is binary categorical (0 or 1).

![Linear vs Logisitc Regression](https://www.machinelearningplus.com/wp-content/uploads/2017/09/linear_vs_logistic_regression.jpg)

### Sigmoid Function:

The sigmoid function is represented by:

$\large S(z) = \frac{1} {1 + e^{-z}}$

where, 
$\large S(z)$ = output between 0 and 1 (probability estimate),

$\large z$ = input to the function (your algorithm’s prediction e.g. mx + b),

$\large e$ = base of natural log

We'll create a logistic regression model using scikit-learn's `LogisticRegression`:

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver = 'liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [None]:
evaluate(y_test, y_pred)

# Measuring Performance

### 1. Confusion Matrix

This is what a confusion matrix looks like:
![Confusion Matrix](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)

Now, let us understand what TP, TN, FP, FN denote in this matrix:


- **True Positives (TP):** These are cases in which we predicted yes (they have the disease), and they do have the disease.
- **True Negatives (TN):** We predicted no, and they don't have the disease.
- **False Positives (FP):** We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
- **False Negatives (FN):** We predicted no, but they actually do have the disease. (Also known as a "Type II error.")


### 2. Precision

Precision is defined as the number of true positives (TP) over the number of true positives plus the number of false positives (FP).

![Precision](https://miro.medium.com/max/948/1*HGd3_eAJ3-PlDQvn-xDRdg.png)

### 3. Recall

Recall is defined as the number of true positives (TP) over the number of true positives plus the number of false negatives (FN).

![Recall](https://miro.medium.com/max/836/1*dXkDleGhA-jjZmZ1BlYKXg.png)

### 4. F1-Score

F1-score is the harmonic mean of precision and recall.

![F1_Score](https://miro.medium.com/max/564/1*T6kVUKxG_Z4V5Fm1UXhEIw.png)

## PR Curve

A PR curve is simply a graph with Precision values on the y-axis and Recall values on the x-axis. 

It is desired that the algorithm should have both high precision, and high recall. However, most machine learning algorithms often involve a trade-off between the two. **A good PR curve has greater AUC (area under curve). **

In [None]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred)

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(precisions, recalls)
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.title("PR Curve: precisions/recalls tradeoff");

## ROC Curve

ROC curve plots the true positive rate (another name for recall) against the false positive rate. The false positive rate (FPR) is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative.

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(fpr, tpr, linewidth=2)

plt.plot([0, 1], [0, 1], "k--")
plt.axis([0, 1, 0, 1])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)

# Hyperparameter Tuning

Hyperparamter tuning is choosing the set of optimal hyperparameters for our models. The tuning works by changing several parameters like loss function, etc.

In [None]:
from sklearn.model_selection import GridSearchCV

penalty = ['l1', 'l2']
C = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
class_weight = [{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}]
solver = ['liblinear', 'saga']

param_grid = dict(penalty=penalty, C=C, class_weight=class_weight, solver=solver)

grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='roc_auc',
                    verbose=1, n_jobs=-1, cv=10, iid=True)
grid_result = grid.fit(X_train, y_train)

In [None]:
y_pred = grid_result.predict(X_test)

evaluate(y_test, y_pred)

# Summary:

In this notebook, you got to learn about:

- Inspecting Data
- Visualizing Data
- Splitting Data into Training and Test Sets
- Logistic Regression
- Measuring Performance
- PR, ROC Curve
- Hyperparameter Tuning

# Link to Part 1: [Regression with Scikit-Learn: Practical ML #1](https://www.kaggle.com/aadhavvignesh/regression-with-scikit-learn-practical-ml-1)

## Part 3 coming soon!