Bug Prediction using KNN and Logistic Regression

This project applies machine learning techniques to predict software bugs using the dataset Antfile17.CSV. The main goal is to classify whether a software module contains a bug (1) or not (0) based on 20 features.

Dataset

File: Antfile17.CSV
Features: 20 software metrics (columns)
Label: bug (0 = No Bug, 1 = Bug)

Project Structure

🔹 KNN Classifier

Two approaches are shown (legacy sweep, plus improved pipeline + CV):

Legacy: sweep k = 1..30 and pick best on a simple split (kept for reference).
Improved: Pipeline (SMOTETomek → StandardScaler → KNN) with Stratified 5-fold GridSearchCV on train only, then evaluation on an untouched test set.

Updated workflow:

Load dataset and check class distribution
Train/test split: 80% train / 20% test (stratified)
Imbalanced-learn Pipeline: SMOTETomek → StandardScaler → KNeighborsClassifier
Hyperparameter tuning via GridSearchCV (CV on train only)
Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
Persist best model: models/knn_pipeline.joblib

🔹 Logistic Regression

Pipeline (SMOTETomek → StandardScaler → LogisticRegression) with Stratified 5-fold GridSearchCV on train only.

Updated workflow:

Load dataset
Train/test split: 80% train / 20% test (stratified)
Pipeline: SMOTETomek → StandardScaler → LogisticRegression(max_iter=1000)
Hyperparameter tuning via GridSearchCV (CV on train only)
Evaluate on test: Accuracy, F1, ROC AUC, Confusion Matrix, full classification report
Diagnostics: Precision–Recall (Average Precision) and Calibration (Brier score + reliability curve)
Persist best model: models/logreg_pipeline.joblib

Requirements

Install all required packages:

pip install scikit-learn
pip install imbalanced-learn
pip install seaborn
pip install matplotlib
pip install joblib

Model persistence and inference

Both notebooks save their best cross-validated pipelines under models/:

models/knn_pipeline.joblib
models/logreg_pipeline.joblib

Example: load a saved pipeline and predict on new data X_new (same columns/order as training features):

from joblib import load
import pandas as pd

# Load one of the persisted pipelines
pipe = load('models/logreg_pipeline.joblib')  # or 'models/knn_pipeline.joblib'

# X_new must be a DataFrame with the same feature columns as training
# X_new = pd.DataFrame([...], columns=[...])
pred = pipe.predict(X_new)
proba = pipe.predict_proba(X_new)[:, 1]
print(pred, proba)

Notes:

The Pipeline encapsulates resampling and scaling; pass raw feature columns matching training schema.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
Antfile17.csv		Antfile17.csv
KNN Classifier.ipynb		KNN Classifier.ipynb
Logistic Regression.ipynb		Logistic Regression.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bug Prediction using KNN and Logistic Regression

Dataset

Project Structure

🔹 KNN Classifier

Updated workflow:

🔹 Logistic Regression

Updated workflow:

Requirements

Model persistence and inference

About

Uh oh!

Releases

Packages

Languages

dustyAlgo/BugClassifier

Folders and files

Latest commit

History

Repository files navigation

Bug Prediction using KNN and Logistic Regression

Dataset

Project Structure

🔹 KNN Classifier

Updated workflow:

🔹 Logistic Regression

Updated workflow:

Requirements

Model persistence and inference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages