Skip to content

dustyAlgo/BugClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bug Prediction using KNN and Logistic Regression

This project applies machine learning techniques to predict software bugs using the dataset Antfile17.CSV. The main goal is to classify whether a software module contains a bug (1) or not (0) based on 20 features.

Dataset

  • File: Antfile17.CSV
  • Features: 20 software metrics (columns)
  • Label: bug (0 = No Bug, 1 = Bug)

Project Structure

🔹 KNN Classifier

Two approaches are shown (legacy sweep, plus improved pipeline + CV):

  1. Legacy: sweep k = 1..30 and pick best on a simple split (kept for reference).
  2. Improved: Pipeline (SMOTETomek → StandardScaler → KNN) with Stratified 5-fold GridSearchCV on train only, then evaluation on an untouched test set.

Updated workflow:

  • Load dataset and check class distribution
  • Train/test split: 80% train / 20% test (stratified)
  • Imbalanced-learn Pipeline: SMOTETomek → StandardScaler → KNeighborsClassifier
  • Hyperparameter tuning via GridSearchCV (CV on train only)
  • Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
  • Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
  • Persist best model: models/knn_pipeline.joblib

🔹 Logistic Regression

Pipeline (SMOTETomek → StandardScaler → LogisticRegression) with Stratified 5-fold GridSearchCV on train only.

Updated workflow:

  • Load dataset
  • Train/test split: 80% train / 20% test (stratified)
  • Pipeline: SMOTETomek → StandardScaler → LogisticRegression(max_iter=1000)
  • Hyperparameter tuning via GridSearchCV (CV on train only)
  • Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
  • Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
  • Persist best model: models/logreg_pipeline.joblib

Requirements

Install all required packages:

pip install scikit-learn
pip install imbalanced-learn
pip install seaborn
pip install matplotlib
pip install joblib

Model persistence and inference

Both notebooks save their best cross-validated pipelines under models/:

  • models/knn_pipeline.joblib
  • models/logreg_pipeline.joblib

Example: load a saved pipeline and predict on new data X_new (same columns/order as training features):

from joblib import load
import pandas as pd

# Load one of the persisted pipelines
pipe = load('models/logreg_pipeline.joblib')  # or 'models/knn_pipeline.joblib'

# X_new must be a DataFrame with the same feature columns as training
# X_new = pd.DataFrame([...], columns=[...])
pred = pipe.predict(X_new)
proba = pipe.predict_proba(X_new)[:, 1]
print(pred, proba)

Notes:

  • The Pipeline encapsulates resampling and scaling; pass raw feature columns matching training schema.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published