# [9660] Exercise # 5 - Logistic Regression
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/breast_cancer_diagnosis.csv

## Exercise # 5 Requirements
* Load data into dataframe
* Examine data
  * Check for missing values
* Prepare data for model training and testing
  * Drop non-numeric columns
  * Replace missing values (use any technique)
  * Separate independent and dependent variables
  * Split train and test sets
* Train logistic regression model
  * If you get errors, change appropriate hyperparameters to eliminate errors
* Caluculate and display model performance metrics
  * accuracy
  * classification report
  * confusion matrix

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 10/10/24 22:44:31


### Import libraries

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np

### Load data

In [3]:
# Read data from file (breast_cancer_diagnosis.csv) into dataframe
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/breast_cancer_diagnosis.csv')

### Examine data

In [4]:
# Review dataframe shape
df.shape

(569, 13)

In [5]:
# Display first few rows of dataframe
df.head()

Unnamed: 0,id,name,radius,texture,perimeter,area,smoothness,compactness,concavity,symmetry,fractal_dimension,age,diagnosis
0,ID842302,Glynnis Munson,,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,35,1
1,ID842517,Lana Behrer,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,27,1
2,ID84300903,Devondra Vanvalkenburgh,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,31,1
3,ID84348301,Glory Maravalle,,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,49,1
4,ID84358402,Mellie Mccurdy,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,20,1


Note: NaN in radius column

### Prepare data

#### Drop non-numeric variables
Remember to use "inplace=True"

In [6]:
df.drop(['name','id'], axis=1, inplace=True)

In [7]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0,radius,texture,perimeter,area,smoothness,compactness,concavity,symmetry,fractal_dimension,age,diagnosis
0,,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,35,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,27,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,31,1
3,,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,49,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,20,1


#### Check for missing values

In [8]:
df.isnull().sum()

Unnamed: 0,0
radius,71
texture,0
perimeter,0
area,0
smoothness,0
compactness,0
concavity,0
symmetry,0
fractal_dimension,0
age,0


#### Replace missing values

In [9]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [10]:
cols_to_impute = ['radius']

In [11]:
df[cols_to_impute] = imp_mean.fit_transform(df[cols_to_impute])

In [12]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0,radius,texture,perimeter,area,smoothness,compactness,concavity,symmetry,fractal_dimension,age,diagnosis
0,14.326635,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,35,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,27,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,31,1
3,14.326635,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,49,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,20,1


#### Check for missing values again

In [13]:
df.isnull().sum()

Unnamed: 0,0
radius,0
texture,0
perimeter,0
area,0
smoothness,0
compactness,0
concavity,0
symmetry,0
fractal_dimension,0
age,0


### Separate independent and dependent variables
* Independent variables: All remaining variables except Diagnosis
* Dependent variable: Diagnosis

In [14]:
X= df.drop('diagnosis', axis=1)
y= df['diagnosis']

In [15]:
# Display first few rows of independent variables
X.head()


Unnamed: 0,radius,texture,perimeter,area,smoothness,compactness,concavity,symmetry,fractal_dimension,age
0,14.326635,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,35
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,27
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,31
3,14.326635,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,49
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,20


In [16]:
# Display first few rows of dependent variable
y.head()

Unnamed: 0,diagnosis
0,1
1,1
2,1
3,1
4,1


### Split data into training and test sets

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    test_size=0.3,
                                                    random_state=42)

### Train Logistic Regression model

In [26]:
model = LogisticRegression()

### If above results in error, review error message, look up documentation for LogisticRegression, and change model hyperparameter appropriately

In [27]:
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

### Test model

In [29]:
# Generate predictions against the test set
# Test the model
predictions = model.predict(X_test)

# Print predictions
print(predictions)

[0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1
 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0
 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0
 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1]


### Model evaluation

In [22]:
# Print model accuracy
accuracy = model.score(X_test, y_test)
print("accuracy =", round((accuracy * 100), 2), "%")

accuracy = 90.35 %


In [23]:
# Print classification report
target_names = ['1', '0']
print(classification_report(y_test, predictions, target_names=target_names))

              precision    recall  f1-score   support

           1       0.92      0.93      0.92        72
           0       0.88      0.86      0.87        42

    accuracy                           0.90       114
   macro avg       0.90      0.89      0.90       114
weighted avg       0.90      0.90      0.90       114



In [24]:
# Print confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions)
cnf_matrix

array([[67,  5],
       [ 6, 36]])