# Model Summary 
Results from the dataset will enable us to predict a malignant tumour in the breast or a benign tumour. Since there are two possible outcomes based on the model to be implememented below, this will be a logistic regression problem. We will assign 1 to a malignant tumour, and 0 to a bening tumour.

Importing the necessary libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer

Initialize our dataset to a pandas dataframe

In [11]:
breast_cancer_dataset = load_breast_cancer()


# Begin preprocessing of our data
breast_cancer_df = pd.DataFrame(data=breast_cancer_dataset['data'], columns=breast_cancer_dataset['feature_names'])

breast_cancer_df['target'] = pd.Series(breast_cancer_dataset.target)

# We will reassign the values to work with our dataset columns 
breast_cancer_dataset.target_names = ['benign', 'malignant']

breast_cancer_df['diagnosis'] = breast_cancer_df.target.replace(dict(enumerate(breast_cancer_dataset.target_names)))


Get a view of dataset fields we will be working with

In [None]:
breast_cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,benign
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,benign
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,benign
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,benign
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,benign


Verify that the extracted data does not have a bias zero value for every row with respect to the target column

In [None]:
breast_cancer_df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,diagnosis
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,1.176,1.256,7.673,158.7,0.0103,0.02891,0.05198,0.02454,0.01114,0.004239,25.45,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0,benign
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,0.7655,2.463,5.203,99.04,0.005769,0.02423,0.0395,0.01678,0.01898,0.002498,23.69,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0,benign
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,0.4564,1.075,3.425,48.55,0.005903,0.03731,0.0473,0.01557,0.01318,0.003892,18.98,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0,benign
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,0.726,1.595,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0,benign
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,0.3857,1.428,2.548,19.15,0.007189,0.00466,0.0,0.0,0.02676,0.002783,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1,malignant


The next step would be to identify the dimesion of the dataset

In [None]:
print(breast_cancer_df.shape)

(569, 32)


The dimesions are m x n are 569 and 30 respectively where m is number of rows and n number of columns of the subjected dataframe / matrix

Next we check for any null values

In [None]:
breast_cancer_df.isnull().sum()
breast_cancer_df.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
diagnosis                  0
dtype: int64

Here we aim to produce categorical data for our model to understand better through encoding


In [31]:

from sklearn.preprocessing import LabelEncoder
X = breast_cancer_dataset.data
Y = breast_cancer_df['diagnosis']
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print(Y)

0         benign
1         benign
2         benign
3         benign
4         benign
         ...    
564       benign
565       benign
566       benign
567       benign
568    malignant
Name: diagnosis, Length: 569, dtype: object


We then split the data into training and test data. 25% for testing, 75% for training

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

We then scale the data to transform it within certain magnitudes using StandardScaler


In [22]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

We then choose the model for classification. In our case it will be Logistic Regression. We start by making a classifier instance for the training data
 and fitting it with the data

In [33]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(random_state = 0)
logReg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We then make predictions on the test data and calculate the accuracy score

In [34]:
preds = logReg.predict(X_test)
score = logReg.score(X_test, Y_test)
print(score)

0.958041958041958


We then use the confusion matrix below to evaluate the  model. As said earlier 0 is benign and 1 is malignant

In [38]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, preds)
cm
confusion_df = pd.DataFrame(confusion_matrix(Y_test,preds),
             columns=["Predicted Class " + str(breast_cancer_dataset.target_names) for breast_cancer_dataset.target_names in [0,1]],
             index = ["Class " + str(breast_cancer_dataset.target_names) for breast_cancer_dataset.target_names in [0,1]])
print(confusion_df)

         Predicted Class 0  Predicted Class 1
Class 0                 50                  3
Class 1                  3                 87


We can then print the classification report

In [40]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, preds))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94        53
           1       0.97      0.97      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143

