This project applies machine learning techniques to predict software bugs using the dataset Antfile17.CSV. The main goal is to classify whether a software module contains a bug (1) or not (0) based on 20 features.
- File:
Antfile17.CSV - Features: 20 software metrics (columns)
- Label:
bug(0 = No Bug, 1 = Bug)
Two approaches are shown (legacy sweep, plus improved pipeline + CV):
- Legacy: sweep
k = 1..30and pick best on a simple split (kept for reference). - Improved: Pipeline (SMOTETomek → StandardScaler → KNN) with Stratified 5-fold GridSearchCV on train only, then evaluation on an untouched test set.
- Load dataset and check class distribution
- Train/test split: 80% train / 20% test (stratified)
- Imbalanced-learn Pipeline:
SMOTETomek→StandardScaler→KNeighborsClassifier - Hyperparameter tuning via
GridSearchCV(CV on train only) - Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
- Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
- Persist best model:
models/knn_pipeline.joblib
Pipeline (SMOTETomek → StandardScaler → LogisticRegression) with Stratified 5-fold GridSearchCV on train only.
- Load dataset
- Train/test split: 80% train / 20% test (stratified)
- Pipeline:
SMOTETomek→StandardScaler→LogisticRegression(max_iter=1000) - Hyperparameter tuning via
GridSearchCV(CV on train only) - Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
- Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
- Persist best model:
models/logreg_pipeline.joblib
Install all required packages:
pip install scikit-learn
pip install imbalanced-learn
pip install seaborn
pip install matplotlib
pip install joblibBoth notebooks save their best cross-validated pipelines under models/:
models/knn_pipeline.joblibmodels/logreg_pipeline.joblib
Example: load a saved pipeline and predict on new data X_new (same columns/order as training features):
from joblib import load
import pandas as pd
# Load one of the persisted pipelines
pipe = load('models/logreg_pipeline.joblib') # or 'models/knn_pipeline.joblib'
# X_new must be a DataFrame with the same feature columns as training
# X_new = pd.DataFrame([...], columns=[...])
pred = pipe.predict(X_new)
proba = pipe.predict_proba(X_new)[:, 1]
print(pred, proba)Notes:
- The Pipeline encapsulates resampling and scaling; pass raw feature columns matching training schema.