# Breast cancer prediction | Logistic Regression

Conducting predictive analysis with logistic regression using the scikit-learn breast cancer dataset. Leveraging the scikit-learn library, this study aims to employ logistic regression for predictive modeling on a dataset specifically designed for breast cancer research. By utilizing logistic regression, a statistical method well-suited for binary classification tasks, we seek to develop a predictive model capable of distinguishing between malignant and benign cases. The scikit-learn breast cancer dataset provides a rich set of features, enabling a comprehensive examination of the logistic regression model's performance in predicting breast cancer outcomes.

### 1. Imports

In [8]:
import pandas as pd # data processing

from sklearn.datasets import load_breast_cancer #dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


### 2. The dataset

In [9]:
# Dataset and variables
data = load_breast_cancer()
x = pd.DataFrame(data.data, columns=[data.feature_names])
y = pd.Series(data.target)

x.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [11]:
print(y.shape, x.shape)

(569,) (569, 30)


### 3. Model

In [None]:
#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=9)

#Logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

- train_test_split is a function from scikit-learn used for splitting datasets into training and testing sets.
- x is the input data, and y is the target variable.
- test_size=0.3 indicates that 30% of the data will be used as the test set, and the remaining 70% will be used for training.
- random_state=9 sets a seed for the random number generator, ensuring reproducibility.

- LogisticRegression() initializes a logistic regression model from scikit-learn.
- fit is used to train the model on the training data.
- X_train is the feature matrix of the training set, and y_train is the corresponding target variable.

- predict is used to make predictions on new or unseen data.
- X_test is the feature matrix of the test set for which predictions are being made.
- y_pred contains the predicted values for the target variable based on the features in X_test.

In [39]:
y_pred

array([1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1])

### 4. Performance 

In [38]:

accuracy = accuracy_score(y_test, y_pred)  #computes the accuracy of the model predictions
conf_matrix = confusion_matrix(y_test, y_pred)  #The confusion matrix is a table that summarizes the performance of a classification algorithm. It has four entries: true positive (TP), true negative (TN), false positive (FP), and false negative (FN)
class_report = classification_report(y_test, y_pred)  #main classification metrics

print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

Accuracy: 0.9532163742690059
Confusion Matrix:
[[ 56   6]
 [  2 107]]
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.90      0.93        62
           1       0.95      0.98      0.96       109

    accuracy                           0.95       171
   macro avg       0.96      0.94      0.95       171
weighted avg       0.95      0.95      0.95       171



Based on these metrics, it seems that the logistic regression model is effective for the given classification task. It correctly identifies instances of both classes, with a small number of misclassifications.