# ML Classification Experiments (scikit-learn)

This notebook demonstrates a clean, end-to-end classification workflow:
- Load CSV data
- Prepare features/target
- Train/test split
- Train multiple models (DT, KNN, SVM, RF, MLP)
- Compare models using test accuracy and 5-fold cross-validation
- Detailed evaluation for the best model (Random Forest)

**Author:** Ahmad Abdulla  
**GitHub:** https://github.com/fanshaa


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier


## 1) Load data
Place your dataset file in the same folder as this notebook, then set the filename below.

> If your dataset contains sensitive information, do **not** upload it to GitHub.


In [None]:
DATA_PATH = "supplier1.csv"  # change if needed
data = pd.read_csv(DATA_PATH)
data.head()

In [None]:
data.info()

In [None]:
data.describe(include='all').T.head(15)

## 2) Quick EDA (optional)


In [None]:
sns.countplot(x=data['Class'])
plt.title('Class Distribution')
plt.show()

## 3) Prepare features and target
Adjust the dropped columns based on your dataset.


In [None]:
drop_cols = ['Class', 'datevalid', 'offerdate']
drop_cols = [c for c in drop_cols if c in data.columns]  # drop only existing columns

X = data.drop(drop_cols, axis=1)
y = data['Class']

X.shape, y.shape

## 4) Train/test split


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.33,
    random_state=44,
    shuffle=True,
    stratify=y
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 5) Define models
We use `Pipeline` with `StandardScaler` for models that benefit from scaling (KNN, SVM, MLP).


In [None]:
models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=6, random_state=44),

    "KNN": Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsClassifier(n_neighbors=29))
    ]),

    "SVM (linear)": Pipeline([
        ("scaler", StandardScaler()),
        ("model", SVC(kernel="linear"))
    ]),

    "Random Forest": RandomForestClassifier(
        n_estimators=300,
        random_state=44,
        n_jobs=-1
    ),

    "MLP": Pipeline([
        ("scaler", StandardScaler()),
        ("model", MLPClassifier(
            hidden_layer_sizes=(150, 100, 50),
            max_iter=300,
            random_state=44
        ))
    ])
}

list(models.keys())

## 6) Train & compare models
We report **Test Accuracy** and **5-fold CV Accuracy** (mean) on the training set.


In [None]:
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    test_acc = accuracy_score(y_test, preds)

    cv_acc = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    results.append([name, test_acc, cv_acc])

results_df = pd.DataFrame(results, columns=["Model", "Test Accuracy", "CV Accuracy (5-fold)"])
results_df.sort_values(by="Test Accuracy", ascending=False)

## 7) Detailed evaluation for the best model
Here we evaluate **Random Forest** (you can change this to the top model from the table).


In [None]:
best_model_name = "Random Forest"
best_model = models[best_model_name]

best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

print("Best model:", best_model_name)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.title(f"{best_model_name} â€“ Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

## 8) Next steps (optional)
- Add hyperparameter tuning (GridSearchCV) for one model (e.g., SVM or Random Forest).
- Add a results plot.
- Save the trained model with `joblib`.
