# Classification Algorithms Assignment



## 1. Loading & Preprocessing

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
data = load_breast_cancer(as_frame=True)
df = data.frame.copy()

# Features and target
X = df.drop(columns=['target']).copy()
y = df['target'].copy()

# Check missing values
missing_count = X.isna().sum().sum()

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print('Shape:', X.shape)
print('Missing values count:', missing_count)


Shape: (569, 30)
Missing values count: 0


**Explanation:**

- **Missing values:** The dataset contains no missing values. Handling missing values would be necessary if present.
- **Feature scaling:** StandardScaler is applied because algorithms such as Logistic Regression, SVM, and k-NN depend on feature scales for optimization or distance calculations.
- **Train-test split:** Stratified split ensures class distribution is preserved in train and test sets.


## 2. Classification Algorithms

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(probability=False, random_state=42),
    'k-NN': KNeighborsClassifier(n_neighbors=5)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    results[name] = {
        'accuracy': acc,
        'report': classification_report(y_test, pred, output_dict=True)
    }

# Display accuracies
for name, res in results.items():
    print(f"{name}: Accuracy = {res['accuracy']:.4f}")


Logistic Regression: Accuracy = 0.9825
Decision Tree: Accuracy = 0.9123
Random Forest: Accuracy = 0.9561
SVM: Accuracy = 0.9825
k-NN: Accuracy = 0.9649


### Description

- **Logistic Regression:** Linear model estimating probability via sigmoid. Suitable for binary classification and interpretable coefficients.

- **Decision Tree:** Tree-based model splitting on features. Handles non-linear relationships and is interpretable but can overfit.

- **Random Forest:** Ensemble of decision trees; reduces overfitting and often yields strong performance.

- **SVM:** Finds a maximum-margin hyperplane; effective in high-dimensional spaces.

- **k-NN:** Instance-based, predicts from nearest neighbors; simple and works well with scaled data.


## 3. Model Comparison


In [4]:
import pandas as pd
acc_df = pd.DataFrame([{ 'Model': name, 'Accuracy': res['accuracy'] } for name, res in results.items()])
acc_df = acc_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)
acc_df


Unnamed: 0,Model,Accuracy
0,Logistic Regression,0.982456
1,SVM,0.982456
2,k-NN,0.964912
3,Random Forest,0.95614
4,Decision Tree,0.912281


**Conclusion:**

- The top performers are Logistic Regression and SVM models with highest accuracy(0.982456).
- The worst performer is the Decision Tree model with lowest accuracy(0.912281).

