# Objective:


- Build a baseline Logistic Regression model using default hyperparameters.
- Evaluate accuracy and understand initial performance.
- This serves as a benchmark for later optimization techniques (GridSearch, PSO, ACO).


`Import Libraries`

In [6]:
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

`Load and Preprocess Dataset`

In [8]:
# In Depth Ananlysis in File 02_Data_Preprocessing_and_Feature_Scaling for feature scaling and data preproxessing

In [9]:
# Load dataset
df = pd.read_csv('C:/Users/ajayr/Desktop/Projects to upload/Metaheuristic_Optimization_for_Logistic_Regression/data/data.csv')

# Drop irrelevant columns
if 'id' in df.columns:
    df.drop(columns=['id'], inplace=True)

# Encode target variable
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

# Split features and target
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

# Normalize features using MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

`Train-Test Split`

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y)
print("Training samples:", len(X_train), "| Testing samples:", len(X_test))

Training samples: 455 | Testing samples: 114


`Train Baseline Logistic Regression`

In [13]:
# Start timer
start_time = time.time()

In [14]:
# Initialize Logistic Regression with default hyperparameters
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Fit the model
log_reg.fit(X_train, y_train)

In [15]:
# End timer
end_time = time.time()
execution_time = end_time - start_time
print(f"Training Time: {execution_time:.4f} seconds")

Training Time: 0.0308 seconds


`Evaluate the Model`

In [17]:
# Predict on test data
y_pred = log_reg.predict(X_test)

In [18]:
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Baseline Logistic Regression Accuracy: {accuracy:.4f}")

Baseline Logistic Regression Accuracy: 0.9737


In [19]:
# Classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98        72
           1       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



In [20]:
# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Confusion Matrix:
 [[72  0]
 [ 3 39]]


# Conclusion 

The baseline Logistic Regression model achieved an accuracy of 97%, which is strong for a medical dataset. For class 0 (Benign), precision and recall are excellent (0.96 and 1.00), indicating almost all benign cases were correctly identified. For class 1 (Malignant), precision is perfect (1.00), meaning no false positives, but recall is slightly lower (0.93), with 3 malignant cases misclassified as benign (seen in the confusion matrix). While accuracy is high, improving recall for malignant cases is critical because missing a cancer diagnosis is risky. This baseline sets a solid foundation for applying hyperparameter tuning and metaheuristic optimization next.

# Summary

- Built a **baseline Logistic Regression model** using default hyperparameters.
- Achieved **Accuracy:** 97% on the test set.
- **Class 0 (Benign):** Precision = 0.96, Recall = 1.00 → Excellent detection.
- **Class 1 (Malignant):** Precision = 1.00, Recall = 0.93 → Slightly lower recall; 3 malignant cases missed.
- **Confusion Matrix:** [[72, 0], [3, 39]] → No false positives, but a few false negatives.
- **Insight:** The model is strong, but recall for malignant cases needs improvement.
- **Next Step:** Apply hyperparameter tuning (GridSearchCV) to boost recall and overall performance.
