<a href="https://colab.research.google.com/github/creatorTrinity/ML_RND/blob/main/Binary_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A Python example for binary classification
For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor observations and corresponding labels for whether the tumor was malignant or benign.

First, we'll import a few libraries and then load the data. When loading the data, we'll specify as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).

In [2]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer


#The dataset contains a DataFrame for the observation data and a Series for the target data.Let's see what the first few rows of observations look like:

In [20]:
datasets = load_breast_cancer(as_frame=True)
datasets['data'].head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
datasets['data'].info()
# iterating the columns
for col in datasets['data'].columns:
    print(col)

list(datasets['data'].columns)

In [49]:
datasets['target'].count()

569

#The output shows five observations with a column for each feature we'll use to predict malignancy. Now, for the targets:

In [48]:
datasets['target'].head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

#The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many malignant and benign tumors are in our dataset:

In [51]:
datasets['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

In [None]:
datasets_list = list(datasets)

datasets_list.count('radius')

So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary classification problem.

To perform binary classification using logistic regression with sklearn, we must accomplish the following steps.

#Step 1: Define explanatory and target variables

We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y.

In [47]:
x = datasets['data']
y = datasets['target']

#Step 2: Split the dataset into training and testing sets

We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are the same as ours.

In [52]:
#Step 2: Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(x, y, test_size = 0.25, random_state=0)

#Step 3: Normalize the data for numerical stability

Note that we normalize after splitting the data. It's good practice to apply any data transformations to training and testing data separately to prevent data leakage.

In [55]:
from sklearn.preprocessing import StandardScaler

ss_train = StandardScaler();
X_train = ss_train.fit_transform(X_train)

ss_test = StandardScaler();
X_test = ss_test.fit_transform(X_test)

#Step 4: Fit a logistic regression model to the training data

This step effectively trains the model to predict the targets from the data.

In [57]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train,y_train)


LogisticRegression()

#Step 5: Make predictions on the testing data

With the model trained, we now ask the model to predict targets based on the test data.

In [59]:
pred_y = model.predict(X_test)


#Step 6: Calculate the accuracy score by comparing the actual values and predicted values.

We can now calculate how well the model performed by comparing the model's predictions to the true target values, which we reserved in the y_test variable.

First, we'll calculate the confusion matrix to get the necessary parameters:

In [66]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,pred_y)
print(cm)

TN, FP, FN, TP = confusion_matrix(y_test, pred_y).ravel()
print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

[[51  2]
 [ 4 86]]
True Positive(TP)  =  86
False Positive(FP) =  2
True Negative(TN)  =  51
False Negative(FN) =  4


#With these values, we can now calculate an accuracy score:

In [69]:
accuracy =  (TP + TN) / (TP + FP + TN + FN)

print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))
accuracy = model.score(X_test,y_test)
print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))

Accuracy of the binary classifier = 0.958
Accuracy of the binary classifier = 0.958


#Other binary classifiers in the scikit-learn library

Logistic regression is just one of many classification algorithms defined in Scikit-learn. We'll compare several of the most common, but feel free to read more about these algorithms in the sklearn docs here.

We'll also use the sklearn Accuracy, Precision, and Recall metrics for performance evaluation. See the docs here if you'd like https://scikit-learn.org/stable/supervised_learning.html
 to read more about the available metrics.

Initializing each binary classifier
To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:

In [74]:
model = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
model['Logistic Regression'] = LogisticRegression()

# Support Vector Machines
from sklearn.svm import LinearSVC
model ['Support Vector Machines'] = LinearSVC()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
model['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
model['K-Nearest Neighbors'] = KNeighborsClassifier()

