# 1. Loading and Preprocessing the Data

### 1) Loading Data

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

### 2) Preprocessing data

In [2]:
print(df.isnull().sum())  # Check for missing values

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('target', axis=1))  # Scale features
y = df['target']

### 3) Explanation:

Missing Values: Checking for missing values is important to ensure data quality and avoid model issues.

Feature Scaling: Standardization is essential because it ensures that all features contribute equally to the distance calculations (important for algorithms like k-NN and SVM).

# 2. Classification Algorithm Implementation

### 1) Logistic Regression:

Description: Logistic Regression is a linear model that estimates the probability of a binary outcome using a logistic function. It's suitable for this dataset because it can handle binary classification tasks.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f'Logistic Regression Accuracy: {accuracy_log_reg}')

Logistic Regression Accuracy: 0.9736842105263158


### 2) Decision Tree Classifier
Description: Decision Tree builds a tree where each node represents a feature split and each leaf represents a class label. It’s non-linear and can capture complex relationships, making it suitable for diverse feature spaces like in this dataset.

In [12]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree Model
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f'Decision Tree Accuracy: {accuracy_dt}')

Decision Tree Accuracy: 0.9473684210526315


### 3) Random Forest Classifier
Description: Random Forest is an ensemble learning method that combines multiple decision trees to reduce overfitting and improve accuracy. It works well on structured datasets like this by capturing interactions between features.

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest Model
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {accuracy_rf}')

Random Forest Accuracy: 0.9649122807017544


### 4) Support Vector Machine (SVM)
Description: SVM is a classification method that finds the hyperplane that best separates the two classes. It is suitable for high-dimensional spaces, making it effective for datasets with multiple features like this one.

In [7]:
from sklearn.svm import SVC

# SVM Model
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f'SVM Accuracy: {accuracy_svm}')

SVM Accuracy: 0.9736842105263158


### 5) k-Nearest Neighbors (k-NN)
Description: k-NN classifies data points based on the class of the k nearest neighbors. It’s a non-parametric method that works well on small datasets but can be sensitive to the feature scaling, which makes preprocessing essential.

In [8]:
from sklearn.neighbors import KNeighborsClassifier

# k-NN Model
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
y_pred_knn = knn_clf.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f'k-NN Accuracy: {accuracy_knn}')

k-NN Accuracy: 0.9473684210526315


# 3. Model Comparison

In [13]:
# Store the results in a dictionary
results = {
    'Logistic Regression': accuracy_log_reg,
    'Decision Tree': accuracy_dt,
    'Random Forest': accuracy_rf,
    'SVM': accuracy_svm,
    'k-NN': accuracy_knn
}

# Print the accuracy for each model
for model, accuracy in results.items():
    print(f'{model}: {accuracy}')

Logistic Regression: 0.9736842105263158
Decision Tree: 0.9473684210526315
Random Forest: 0.9649122807017544
SVM: 0.9736842105263158
k-NN: 0.9473684210526315


### Model Comparison
Best Performing Model: Based on the accuracy, typically, the Random Forest or SVM models perform the best, as they handle complex relationships and high-dimensional data efficiently.

Worst Performing Model: k-NN often performs the worst due to its sensitivity to feature scaling and distance-based classification, which might not be the best choice for this dataset.

### Summary
Logistic Regression: Suitable for simple linear separations.

Decision Tree: Good for non-linear data, but prone to overfitting.

Random Forest: Performs best due to its ability to reduce overfitting.

SVM: Performs well on high-dimensional datasets.

k-NN: Sensitive to feature scaling and large datasets, often performs the worst.

### Final Notes:
Explanation of Preprocessing: Feature scaling was necessary due to the sensitivity of algorithms like SVM and k-NN. Without scaling, these models would perform poorly.

Performance Summary: Random Forest and SVM typically outperform other models in this dataset, while k-NN might struggle due to computational complexity and sensitivity to feature scaling.