## Breast Cancer Data

This dataset is originally from the UCI Machine Learning Repository on Breast Cancer. The objective of the dataset is to diagnostically predict whether or not a patient has Breast Cancer, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database.

## Introduction

Breast cancer is a disease in which abnormal breast cells grow out of control and form tumours. If left unchecked, the tumours can spread throughout the body and become fatal.

Breast cancer cells begin inside the milk ducts and/or the milk-producing lobules of the breast. The earliest form (in situ) is not life-threatening. Cancer cells can spread into nearby breast tissue (invasion). This creates tumours that cause lumps or thickening.

Invasive cancers can spread to nearby lymph nodes or other organs (metastasize). Metastasis can be fatal.

Treatment is based on the person, the type of cancer and its spread. Treatment combines surgery, radiation therapy and medications.

This data set includes 201 instances of one class and 85 instances of another class.  The instances are described by 9 attributes, some of which are linear and some are nominal.

1. Age
2. Menopause
3. inv-nodes
4. node-caps
5. deg-malig
6. breast
7. breast-quad
8. irradiat
9. Outcome (no-recurrence-events, recurrence-events)

In [1]:
# Import librabies
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",None)

In [2]:
# Read the File
data = pd.read_csv('breast-cancer.data', header=None, delimiter=' *, *')

In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [4]:
# Checking for Number of Rows and Column
data.shape

(286, 10)

In [5]:
data.columns = ['Class','age','menopause','tumor',
                'inv-nodes','node-caps','deg-malig',
                'breast','breast-quad','irradiat']
 
data.head()

Unnamed: 0,Class,age,menopause,tumor,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [6]:
# Checking all Values including Null.
data.describe(include="all")

Unnamed: 0,Class,age,menopause,tumor,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
count,286,286,286,286,286,286,286.0,286,286,286
unique,2,6,3,11,7,3,,2,6,2
top,no-recurrence-events,50-59,premeno,30-34,0-2,no,,left,left_low,no
freq,201,96,150,60,213,222,,152,110,218
mean,,,,,,,2.048951,,,
std,,,,,,,0.738217,,,
min,,,,,,,1.0,,,
25%,,,,,,,2.0,,,
50%,,,,,,,2.0,,,
75%,,,,,,,3.0,,,


# Pre-Processing the Data

In [7]:
# Creating the Copy of DataFrame
data_rev = pd.DataFrame.copy(data)

In [8]:
data.duplicated().sum()

14

In [9]:
# Handle the missing value
# isnull().sum() --> only detect NAN, NA missing value. Not detect the special character as missing values.

In [10]:
data_rev.isnull().sum()

Class          0
age            0
menopause      0
tumor          0
inv-nodes      0
node-caps      0
deg-malig      0
breast         0
breast-quad    0
irradiat       0
dtype: int64

In [11]:
data_rev.dtypes

Class          object
age            object
menopause      object
tumor          object
inv-nodes      object
node-caps      object
deg-malig       int64
breast         object
breast-quad    object
irradiat       object
dtype: object

In [12]:
# int --> not present missing value
# object --> missing value is present in the form of spacial characters

In [13]:
for i in data_rev.columns:
    print({i:data_rev[i].unique()})

{'Class': array(['no-recurrence-events', 'recurrence-events'], dtype=object)}
{'age': array(['30-39', '40-49', '60-69', '50-59', '70-79', '20-29'], dtype=object)}
{'menopause': array(['premeno', 'ge40', 'lt40'], dtype=object)}
{'tumor': array(['30-34', '20-24', '15-19', '0-4', '25-29', '50-54', '10-14',
       '40-44', '35-39', '5-9', '45-49'], dtype=object)}
{'inv-nodes': array(['0-2', '6-8', '9-11', '3-5', '15-17', '12-14', '24-26'],
      dtype=object)}
{'node-caps': array(['no', 'yes', '?'], dtype=object)}
{'deg-malig': array([3, 2, 1], dtype=int64)}
{'breast': array(['left', 'right'], dtype=object)}
{'breast-quad': array(['left_low', 'right_up', 'left_up', 'right_low', 'central', '?'],
      dtype=object)}
{'irradiat': array(['no', 'yes'], dtype=object)}


In [14]:
# Replace the ? (special character) with NAN

In [15]:
data_rev.replace('?',np.nan,inplace=True)

In [16]:
data_rev.isnull().sum()

Class          0
age            0
menopause      0
tumor          0
inv-nodes      0
node-caps      8
deg-malig      0
breast         0
breast-quad    1
irradiat       0
dtype: int64

In [17]:
# Since all the three variable are catogerical, we will used Mode approach

In [18]:
# replace the missing values with mode approach
for value in ['node-caps', 'breast-quad']:
    data_rev[value].fillna(data_rev[value].mode()[0],inplace=True)
    
# The value will go to each variable and it will replace the missing value to unique value of 0 index

In [19]:
data_rev.isnull().sum()

Class          0
age            0
menopause      0
tumor          0
inv-nodes      0
node-caps      0
deg-malig      0
breast         0
breast-quad    0
irradiat       0
dtype: int64

The above code helped to remove null values using mode because the value was in categorical in nature

In [20]:
data_rev.shape

(286, 10)

In [21]:
colname=[]
for x in data_rev.columns:
    if data_rev[x].dtype=='object':
        colname.append(x)
colname

['Class',
 'age',
 'menopause',
 'tumor',
 'inv-nodes',
 'node-caps',
 'breast',
 'breast-quad',
 'irradiat']

In [22]:
# For preprocessing the data
from sklearn.preprocessing import LabelEncoder        # import OneHotEncoder()
 
le=LabelEncoder()
 
for x in colname:
    data_rev[x]=le.fit_transform(data_rev[x])

In [23]:
data_rev=data_rev.iloc[:,[1,2,3,4,5,6,7,8,9,0]]

In [24]:
data_rev.head()

Unnamed: 0,age,menopause,tumor,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,Class
0,1,2,5,0,0,3,0,1,0,0
1,2,2,3,0,0,2,1,4,0,0
2,2,2,3,0,0,2,0,1,0,0
3,4,0,2,0,0,2,1,2,0,0
4,2,2,0,0,0,2,1,3,0,0


In [25]:
data_rev.dtypes

age            int32
menopause      int32
tumor          int32
inv-nodes      int32
node-caps      int32
deg-malig      int64
breast         int32
breast-quad    int32
irradiat       int32
Class          int32
dtype: object

In [26]:
# Create X and Y
X = data_rev.values[:,0:-1]  #--> 0 to -2
Y = data_rev.values[:,-1]

# .values will return an array whereas .loc will return a df object
# arrays are lighter in weigth which indirectly help the model to build faster.

In [27]:
print(X.shape)
print(Y.shape)

(286, 9)
(286,)


In [28]:
# Before Building the model, Do Scaling because it will improve the model.
from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler()
 
scaler.fit(X)
X = scaler.transform(X)

# Running a basic model

In [29]:
from sklearn.model_selection import train_test_split
 
#Split the data into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,  # Default value -> test_size= 0.25
                                                    random_state=10)

In [30]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(228, 9)
(228,)
(58, 9)
(58,)


# Model using Logistic regression

In [31]:
from sklearn.linear_model import LogisticRegression
#create a model object
classifier = LogisticRegression()
#train the model object
classifier.fit(X_train,Y_train)      # fit is the function that is used for training the data

Y_pred = classifier.predict(X_test)
print(Y_pred)

[0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0]


In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[36  5]
 [10  7]]
Classification report: 
              precision    recall  f1-score   support

           0       0.78      0.88      0.83        41
           1       0.58      0.41      0.48        17

    accuracy                           0.74        58
   macro avg       0.68      0.64      0.66        58
weighted avg       0.72      0.74      0.73        58

Accuracy of the model:  0.7413793103448276


## Tunning the Logistic Regression

## Adjusting the Threshold

In [33]:
# store the predicted probabilities
y_pred_prob = classifier.predict_proba(X_test)
print(y_pred_prob)

[[0.84698552 0.15301448]
 [0.54054138 0.45945862]
 [0.90912491 0.09087509]
 [0.26985198 0.73014802]
 [0.40004512 0.59995488]
 [0.88476483 0.11523517]
 [0.93735829 0.06264171]
 [0.35556159 0.64443841]
 [0.36764321 0.63235679]
 [0.86426454 0.13573546]
 [0.9079807  0.0920193 ]
 [0.87573519 0.12426481]
 [0.85299619 0.14700381]
 [0.88067238 0.11932762]
 [0.25209489 0.74790511]
 [0.48179132 0.51820868]
 [0.52166145 0.47833855]
 [0.53207216 0.46792784]
 [0.7257515  0.2742485 ]
 [0.83487598 0.16512402]
 [0.76139611 0.23860389]
 [0.7434792  0.2565208 ]
 [0.24012395 0.75987605]
 [0.91775707 0.08224293]
 [0.77891858 0.22108142]
 [0.64157927 0.35842073]
 [0.94351163 0.05648837]
 [0.76833288 0.23166712]
 [0.39977181 0.60022819]
 [0.805799   0.194201  ]
 [0.91035718 0.08964282]
 [0.84939799 0.15060201]
 [0.9147948  0.0852052 ]
 [0.92025752 0.07974248]
 [0.61666383 0.38333617]
 [0.84875543 0.15124457]
 [0.61448    0.38552   ]
 [0.8632688  0.1367312 ]
 [0.91035718 0.08964282]
 [0.36635359 0.63364641]


In [34]:
y_pred_class=[]
for value in y_pred_prob[:,1]:
    if value > 0.45:
        y_pred_class.append(1)
    else:
        y_pred_class.append(0)
#print(y_pred_class)

In [35]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
cfm=confusion_matrix(Y_test,y_pred_class)
print(cfm)
acc=accuracy_score(Y_test, y_pred_class)
print("Accuracy of the model: ",acc)
print(classification_report(Y_test, y_pred_class))

[[35  6]
 [ 8  9]]
Accuracy of the model:  0.7586206896551724
              precision    recall  f1-score   support

           0       0.81      0.85      0.83        41
           1       0.60      0.53      0.56        17

    accuracy                           0.76        58
   macro avg       0.71      0.69      0.70        58
weighted avg       0.75      0.76      0.75        58



**Always try to search for optimum threshold where the overall error is minumum and the type 2 error is also lower.**

In [36]:
# Trial and Error Approch --> decide you to take the best threshold value with the lowest Type 2 Error
for a in np.arange(0.4,0.61,0.01):
    predict_mine = np.where(y_pred_prob[:,1] > a, 1, 0) # acts as an if else statement
    cfm=confusion_matrix(Y_test, predict_mine)
    total_err=cfm[0,1]+cfm[1,0]  # Addition of Type 1 and Type 2 error
    print("Errors at threshold ", a, ":",total_err, " , type 2 error :", 
          cfm[1,0]," , type 1 error:", cfm[0,1])

Errors at threshold  0.4 : 16  , type 2 error : 8  , type 1 error: 8
Errors at threshold  0.41000000000000003 : 16  , type 2 error : 8  , type 1 error: 8
Errors at threshold  0.42000000000000004 : 15  , type 2 error : 8  , type 1 error: 7
Errors at threshold  0.43000000000000005 : 14  , type 2 error : 8  , type 1 error: 6
Errors at threshold  0.44000000000000006 : 14  , type 2 error : 8  , type 1 error: 6
Errors at threshold  0.45000000000000007 : 14  , type 2 error : 8  , type 1 error: 6
Errors at threshold  0.4600000000000001 : 15  , type 2 error : 9  , type 1 error: 6
Errors at threshold  0.4700000000000001 : 14  , type 2 error : 9  , type 1 error: 5
Errors at threshold  0.4800000000000001 : 15  , type 2 error : 10  , type 1 error: 5
Errors at threshold  0.4900000000000001 : 15  , type 2 error : 10  , type 1 error: 5
Errors at threshold  0.5000000000000001 : 15  , type 2 error : 10  , type 1 error: 5
Errors at threshold  0.5100000000000001 : 15  , type 2 error : 10  , type 1 error: 

# Model using Decision Tree

In [37]:
# predicting using Decision Tree Classifier.
from sklearn.tree import DecisionTreeClassifier

model_DT = DecisionTreeClassifier(random_state=10,
                                   criterion="gini")

# fit the model on data and predict the values
model_DT.fit(X_train,Y_train)      # fit is the function that is used for training the data
Y_pred = model_DT.predict(X_test) # Validation Data
#print(Y_pred)
print(list(zip(Y_test,Y_pred)))

[(0, 0), (1, 1), (0, 0), (0, 0), (1, 0), (0, 1), (0, 0), (0, 1), (0, 1), (1, 0), (0, 0), (0, 0), (0, 1), (0, 0), (1, 1), (1, 0), (1, 0), (0, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1), (0, 0), (0, 1), (0, 1), (0, 0), (0, 0), (1, 1), (0, 1), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1), (1, 0), (0, 0), (0, 0), (1, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1), (1, 0), (1, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 0), (0, 0), (1, 0), (1, 1), (0, 1), (1, 0), (1, 1)]


In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[29 12]
 [11  6]]
Classification report: 
              precision    recall  f1-score   support

           0       0.72      0.71      0.72        41
           1       0.33      0.35      0.34        17

    accuracy                           0.60        58
   macro avg       0.53      0.53      0.53        58
weighted avg       0.61      0.60      0.61        58

Accuracy of the model:  0.603448275862069


## Tuned Decision Tree Model

In [39]:
# predicting using Decision Tree Classifier.
from sklearn.tree import DecisionTreeClassifier

model_DT = DecisionTreeClassifier(random_state=10,
                                   criterion="gini",
                                 splitter="best",
                                 min_samples_leaf=3,
                                 min_samples_split=5,
                                 )

# hypo-parameter :-   min_samples_leaf, min_samples_split, max_depth, max_features, max_leaf_nodes

# fit the model on data and predict the values
model_DT.fit(X_train,Y_train)      # fit is the function that is used for training the data
Y_pred = model_DT.predict(X_test) # Validation Data
#print(Y_pred)
print(list(zip(Y_test,Y_pred)))

[(0, 0), (1, 0), (0, 0), (0, 0), (1, 1), (0, 1), (0, 0), (0, 1), (0, 1), (1, 0), (0, 0), (0, 0), (0, 0), (0, 0), (1, 0), (1, 0), (1, 0), (0, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 0), (0, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1), (0, 1), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1), (1, 0), (0, 0), (0, 0), (1, 0), (0, 1), (0, 0), (0, 1), (0, 0), (0, 0), (0, 1), (1, 0), (1, 0), (0, 0), (0, 1), (0, 0), (0, 0), (1, 0), (0, 0), (1, 0), (1, 0), (0, 0), (1, 0), (1, 1)]


In [40]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[30 11]
 [14  3]]
Classification report: 
              precision    recall  f1-score   support

           0       0.68      0.73      0.71        41
           1       0.21      0.18      0.19        17

    accuracy                           0.57        58
   macro avg       0.45      0.45      0.45        58
weighted avg       0.54      0.57      0.56        58

Accuracy of the model:  0.5689655172413793


# Model using Random Forest

In [41]:
#predicting using the Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier
 
model_RandomForest=RandomForestClassifier(n_estimators=1050,                  # estimator --> Default value -> 100
                                          random_state=10, bootstrap=True,   # Bootstrap -> Always will be True --> Vaules will be repeated in each Bag
                                         n_jobs=-1)                          # no of jobs --> -1 -> Special value --> To speed up the Process
 
#fit the model on the data and predict the values
model_RandomForest.fit(X_train,Y_train)
 
Y_pred=model_RandomForest.predict(X_test)

In [42]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[37  4]
 [ 9  8]]
Classification report: 
              precision    recall  f1-score   support

           0       0.80      0.90      0.85        41
           1       0.67      0.47      0.55        17

    accuracy                           0.78        58
   macro avg       0.74      0.69      0.70        58
weighted avg       0.76      0.78      0.76        58

Accuracy of the model:  0.7758620689655172


# Model using Extra_Trees_Classifier

In [43]:
#predicting using the Model using Extra_Trees_Classifier
from sklearn.ensemble import ExtraTreesClassifier
 
model_EXT=ExtraTreesClassifier(n_estimators=200,                  # estimator --> Default value -> 100 --> Buiding the 100 Trees Parallely. -> how many DT you want to build behing the scene
                                          random_state=10, bootstrap=True,   # Bootstrap -> Always will be True --> Vaules will be repeated in each Bag
                                         n_jobs=-1)                          # no of jobs --> -1 -> Special value --> To speed up the Process
 
#fit the model on the data and predict the values
model_EXT.fit(X_train,Y_train)
 
Y_pred=model_EXT.predict(X_test)

In [44]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[36  5]
 [10  7]]
Classification report: 
              precision    recall  f1-score   support

           0       0.78      0.88      0.83        41
           1       0.58      0.41      0.48        17

    accuracy                           0.74        58
   macro avg       0.68      0.64      0.66        58
weighted avg       0.72      0.74      0.73        58

Accuracy of the model:  0.7413793103448276


# Using Multiple Models :- Decision_Tree, SVC, KNeighbors and Logistic Regression

In [45]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
 
# first, initialize the classificators
tree= DecisionTreeClassifier(random_state=10) # using the random state for reproducibility
knn= KNeighborsClassifier(n_neighbors=5,metric='euclidean')
svm= SVC(kernel="rbf", gamma=0.1, C=1,random_state=10)    # This is Base SVM
logreg=LogisticRegression(multi_class="multinomial",random_state=10)

In [46]:
# now, create a list with the objects 
models= [tree, knn, svm, logreg]

In [47]:
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
 
for model in models:
    model.fit(X_train, Y_train) # fit the model
    Y_pred= model.predict(X_test) # then predict on the test set
    accuracy= accuracy_score(Y_test, Y_pred) 
    clf_report= classification_report(Y_test, Y_pred) 
    print(confusion_matrix(Y_test,Y_pred))
    print("The accuracy of the ",type(model).__name__, " model is ", accuracy*100 )
    print("Classification report:\n", clf_report)
    print("\n")

[[29 12]
 [11  6]]
The accuracy of the  DecisionTreeClassifier  model is  60.3448275862069
Classification report:
               precision    recall  f1-score   support

           0       0.72      0.71      0.72        41
           1       0.33      0.35      0.34        17

    accuracy                           0.60        58
   macro avg       0.53      0.53      0.53        58
weighted avg       0.61      0.60      0.61        58



[[37  4]
 [10  7]]
The accuracy of the  KNeighborsClassifier  model is  75.86206896551724
Classification report:
               precision    recall  f1-score   support

           0       0.79      0.90      0.84        41
           1       0.64      0.41      0.50        17

    accuracy                           0.76        58
   macro avg       0.71      0.66      0.67        58
weighted avg       0.74      0.76      0.74        58



[[38  3]
 [11  6]]
The accuracy of the  SVC  model is  75.86206896551724
Classification report:
               pr

# Optimization Technique: Implementation of SMOTE

In [48]:
print("Before OverSampling, counts of label '1': ", (sum(Y_train == 1)))
print("Before OverSampling, counts of label '0': ", (sum(Y_train == 0)))
  
# import SMOTE from imblearn library
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 10,k_neighbors=5)
X_train_res, Y_train_res = sm.fit_resample(X_train, Y_train)
  
print('After OverSampling, the shape of train_X: ', (X_train_res.shape))
print('After OverSampling, the shape of train_y: ', (Y_train_res.shape))
  
print("After OverSampling, counts of label '1': ", (sum(Y_train_res == 1)))
print("After OverSampling, counts of label '0': ", (sum(Y_train_res == 0)))

Before OverSampling, counts of label '1':  68
Before OverSampling, counts of label '0':  160
After OverSampling, the shape of train_X:  (320, 9)
After OverSampling, the shape of train_y:  (320,)
After OverSampling, counts of label '1':  160
After OverSampling, counts of label '0':  160


In [49]:
from sklearn.linear_model import LogisticRegression
#create a model object
classifier = LogisticRegression()
#train the model object
classifier.fit(X_train,Y_train)      # fit is the function that is used for training the data

Y_pred = classifier.predict(X_test)
print(Y_pred)

[0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0]


In [50]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))
 
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[36  5]
 [10  7]]
Classification report: 
              precision    recall  f1-score   support

           0       0.78      0.88      0.83        41
           1       0.58      0.41      0.48        17

    accuracy                           0.74        58
   macro avg       0.68      0.64      0.66        58
weighted avg       0.72      0.74      0.73        58

Accuracy of the model:  0.7413793103448276


As we observed that the results before implementation on SMOTE technique were similar to that of the results observed after implementing the SMOTE technique.
The recall value doesn't showed any change in its value.

# Conclusion

The tuned logistic regression model exhibits promising performance with an accuracy of 75.86%, indicating robust overall prediction. Notably, the low Type II error rate of 8 underscores its effectiveness in minimizing instances of false negatives, making it a strong choice for applications where identifying positive cases is crucial.