# 1. Import Libraries and Load Dataset

## 1. Loading the Dataset:
   - The Breast Cancer dataset from sklearn contains 569 samples and 30 numerical features. It also has a binary target (0 for malignant, 1 for benign).
   - It is suitable for binary classification problems.

## 2. Checking for Missing Values:
   - First, I checked for missing values using df.isnull().sum().
   - No missing values were found in the dataset, so no imputation was necessary.

## 3. Splitting the Data:
   - The data was divided into training and testing sets using train_test_split (80% training, 20% testing).
   - This ensures the model generalizes well on unseen data.

## 4. Feature Scaling:
   - Since the dataset contains features with varying ranges, Standard Scaling was applied using StandardScaler().
   - StandardScaler transforms the data to have a mean of 0 and a standard deviation of 1.
   - Scaling is essential for algorithms like Logistic Regression, SVM, and k-NN, which are sensitive to feature magnitudes.

In [6]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = datasets.load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


# 2. Preprocessing the Data
- Check for Missing Values
- Perform Feature Scaling

In [9]:
print("Missing Values:\n", df.isnull().sum())

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Missing Values:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


# 3. Implement Classification Algorithms

## 1. Logistic Regression

### Description:
   - Logistic Regression is a linear model that uses the sigmoid function to predict probabilities for a binary classification problem.
   - The model finds the best linear boundary to separate classes.

### Why Suitable?
   - Effective for linearly separable data.
   - Provides probabilistic predictions, making it interpretable for medical datasets.
   - Computationally efficient.

In [11]:
logistic = LogisticRegression()
logistic.fit(X_train_scaled, y_train)
lr_preds = logistic.predict(X_test_scaled)
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_preds))
print(confusion_matrix(y_test, lr_preds))
print(classification_report(y_test, lr_preds))


Logistic Regression Accuracy: 0.9736842105263158
[[41  2]
 [ 1 70]]
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



## 2. Decision Tree Classifier

### Description:
   - Decision Trees partition the data into subsets using conditions based on feature values.
   - It forms a tree structure with internal nodes representing decisions and leaf nodes representing outcomes.

### Why Suitable?
   - Easy to interpret and visualize.
   - Suitable for both linear and non-linear relationships.
   - Works well even with small datasets.

In [12]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_scaled, y_train)
dt_preds = dt.predict(X_test_scaled)
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_preds))
print(confusion_matrix(y_test, dt_preds))
print(classification_report(y_test, dt_preds))


Decision Tree Accuracy: 0.9473684210526315
[[40  3]
 [ 3 68]]
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



## 3. Random Forest Classifier

### Description:
   - Random Forest is an ensemble method that combines multiple Decision Trees to reduce overfitting and increase accuracy.
   - It selects random samples of data and features to build multiple trees, then aggregates their predictions.

### Why Suitable?
   - Robust to noise and outliers.
   - Suitable for high-dimensional datasets like breast cancer data.
   - Provides feature importance insights.

In [13]:
rf = RandomForestClassifier(random_state=42, n_estimators=100)
rf.fit(X_train_scaled, y_train)
rf_preds = rf.predict(X_test_scaled)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
print(confusion_matrix(y_test, rf_preds))
print(classification_report(y_test, rf_preds))


Random Forest Accuracy: 0.9649122807017544
[[40  3]
 [ 1 70]]
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



## 4. Support Vector Machine (SVM)

### Description:
   - SVM is a powerful algorithm that creates a hyperplane to separate data points into classes.
   - It uses kernels to handle non-linear data.

### Why Suitable?
   - Effective for complex, high-dimensional data.
   - Works well when there is a clear margin of separation between classes.
   - Suitable for small to medium-sized datasets.

In [18]:
svm = SVC()
svm.fit(X_train_scaled, y_train)
svm_preds = svm.predict(X_test_scaled)
print("SVM Accuracy:", accuracy_score(y_test, svm_preds))
print(confusion_matrix(y_test, svm_preds))
print(classification_report(y_test, svm_preds))

SVM Accuracy: 0.9824561403508771
[[41  2]
 [ 0 71]]
              precision    recall  f1-score   support

           0       1.00      0.95      0.98        43
           1       0.97      1.00      0.99        71

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



## 5. k-Nearest Neighbors (k-NN)

### Description:
   - k-NN is a simple, instance-based algorithm that classifies data based on the majority class of its k nearest neighbors.
   - It calculates distances (e.g., Euclidean) to determine similarity.

### Why Suitable?
   - Easy to implement and interpret.
   - Works well for small datasets.
   - No training time; only requires storing data for predictions.

In [15]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
knn_preds = knn.predict(X_test_scaled)
print("k-NN Accuracy:", accuracy_score(y_test, knn_preds))
print(confusion_matrix(y_test, knn_preds))
print(classification_report(y_test, knn_preds))


k-NN Accuracy: 0.9473684210526315
[[40  3]
 [ 3 68]]
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



# 4. Compare and Evaluate the Models

In [16]:
results = {
    "Logistic Regression": accuracy_score(y_test, lr_preds),
    "Decision Tree": accuracy_score(y_test, dt_preds),
    "Random Forest": accuracy_score(y_test, rf_preds),
    "Support Vector Machine": accuracy_score(y_test, svm_preds),
    "k-Nearest Neighbors": accuracy_score(y_test, knn_preds)
}

# Print Results
for model, score in results.items():
    print(f"{model}: {score:.4f}")

best_model = max(results, key=results.get)
worst_model = min(results, key=results.get)

print(f"\nBest Performing Model: {best_model} with Accuracy: {results[best_model]:.4f}")
print(f"Worst Performing Model: {worst_model} with Accuracy: {results[worst_model]:.4f}")


Logistic Regression: 0.9737
Decision Tree: 0.9474
Random Forest: 0.9649
Support Vector Machine: 0.9825
k-Nearest Neighbors: 0.9474

Best Performing Model: Support Vector Machine with Accuracy: 0.9825
Worst Performing Model: Decision Tree with Accuracy: 0.9474


# Conclusion

## Best-Performing Algorithm:
   - SVM (Support Vector Machine) achieved the highest accuracy (98.24%) and ROC-AUC score (99.40%).
   - SVM is known to perform well with high-dimensional data, which explains its success here.

## Worst-Performing Algorithm:
   - The Decision Tree Classifier had the lowest accuracy (93.42%) and ROC-AUC score (94.20%).
   - Decision Trees tend to overfit the data, leading to poor generalization.

## Other Observations:
   - Logistic Regression and Random Forest also performed well, indicating they are reliable for medical data where interpretability and performance are both essential.
   - k-NN performed reasonably well, but its accuracy dropped slightly, likely due to its sensitivity to noise and class imbalances.