# Baseline Models

We evaluated four baseline models for our project: Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost.

**Logistic Regression** serves as a simple, interpretable baseline to assess the linear separability of the data.  
**Support Vector Machine (SVM)** is included for its robustness to noise and ability to model non-linear decision boundaries using kernel functions.  
**Random Forest** and **XGBoost** are tree-based models that offer greater flexibility, are less sensitive to class imbalance, and provide transparency through feature importance analysis.

To address class imbalance:  
- We trained all baseline models on a SMOTEd version of the training data.  
- Each model was evaluated using class-based metrics: **Precision**, **Recall**, and **F1 score**. Less focus was placed on **Accuracy**, as it may be misleading due to the class imbalance.


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


# Load datasets
train_df = pd.read_csv("../data/processed/train_set_SMOTEd.csv")
test_df = pd.read_csv("../data/processed/test_set.csv")

### Data Preparation

We encode the credit status variable, assigning `'Good' = 1` and `'Bad' = 0`.  
A `MinMaxScaler` is applied to unscaled numerical features to normalize them to the [0, 1] range, ensuring compatibility with models sensitive to feature scales, such as Logistic Regression and SVM.

In [2]:
# Define feature columns (all except 'id' and 'credit_status')
feature_columns = [col for col in train_df.columns if col not in ["credit_status", "id", "log_annual_income", "log_years_employed"]]

# Split features and target
X_train = train_df[feature_columns]
y_train = LabelEncoder().fit_transform(train_df["credit_status"])
X_test = test_df[feature_columns]
y_test = LabelEncoder().fit_transform(test_df["credit_status"])

# Standardize features
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 1. Logistic regression Model

In [3]:
# Train logistic regression model
logreg = LogisticRegression(max_iter=1000, random_state=42, class_weight="balanced") #
logreg.fit(X_train_scaled, y_train);

# Predict on test set
y_pred = logreg.predict(X_test_scaled)

# Attach predictions to customer IDs
results_df = test_df[["id"]].copy()
results_df["predicted_credit_status"] = y_pred

# Display results
print(results_df.head())

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

        id  predicted_credit_status
0  5052720                        0
1  5087861                        1
2  5068206                        0
3  5137255                        0
4  5023163                        1
Accuracy: 0.849424026330225

Confusion Matrix:
 [[  61  580]
 [ 518 6133]]

Classification Report:
               precision    recall  f1-score   support

           0       0.11      0.10      0.10       641
           1       0.91      0.92      0.92      6651

    accuracy                           0.85      7292
   macro avg       0.51      0.51      0.51      7292
weighted avg       0.84      0.85      0.85      7292



The model demonstrates strong performance on the majority class (Good credit), with a **precision of 0.91** and **recall of 0.92**. In contrast, it performs poorly on the minority class (Bad credit), with both **precision and recall at just 0.11 and 0.10**. This means the model correctly identifies only a small fraction of individuals with bad credit, which is a critical failure in credit risk applications.

These results suggest that the data is likely **not linearly separable**, and that a linear model like logistic regression is insufficient to capture the underlying complexity of the problem. More flexible, non-linear models — such as kernelized SVMs or tree-based models — may be better suited for this task.

## 2. SVM Model

We performed hyperparameter tuning for the SVM model, as its performance is highly sensitive to both the **kernel choice** and the **regularization parameter (C)**. These hyperparameters directly influence the model’s capacity to handle our non-linearly separable data:

- The **kernel function** determines how the input space is transformed into a higher-dimensional feature space, enabling the SVM to learn non-linear decision boundaries.
- The **regularization parameter (C)** controls the trade-off between maximizing the margin and minimizing classification error. A smaller C encourages a wider margin at the cost of some misclassifications, while a larger C prioritizes correct classification of training points.

In [4]:
# Suppress ConvergenceWarnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# List of kernels to compare
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

# Dictionary to store results
f1_scores = {}

# Train and evaluate SVM for each kernel
for kernel in kernels:
    print(f" Training SVM with kernel = '{kernel}'")
    model = SVC(kernel=kernel, probability=False, random_state=42, max_iter=1000)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    score = f1_score(y_test, y_pred)
    f1_scores[kernel] = score
    print(f" F1 Score ({kernel}): {score:.4f}\n")

# Summary
print(" F1 Score Comparison:")
for kernel, score in f1_scores.items():
    print(f" - {kernel}: {score:.4f}")

 Training SVM with kernel = 'linear'
 F1 Score (linear): 0.8624

 Training SVM with kernel = 'rbf'
 F1 Score (rbf): 0.5911

 Training SVM with kernel = 'poly'
 F1 Score (poly): 0.9121

 Training SVM with kernel = 'sigmoid'
 F1 Score (sigmoid): 0.9276

 F1 Score Comparison:
 - linear: 0.8624
 - rbf: 0.5911
 - poly: 0.9121
 - sigmoid: 0.9276


The sigmoid kernel yielded the best performance based on F1 score, suggesting it may be better suited to the complexity of this dataset compared to the other kernels tested.

In [8]:
# Train and evaluate SVM with sigmoid kernel for each C value
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]
}

grid_search = GridSearchCV(
    estimator = SVC(kernel = 'sigmoid', probability = False, random_state = 42, max_iter = 1000),
    param_grid = param_grid,
    cv = 3,
    scoring = 'f1',
    n_jobs = 2,
    verbose = 2
)

grid_search.fit(X_train_scaled, y_train)
print(" Best C:", grid_search.best_params_['C'])

Fitting 3 folds for each of 5 candidates, totalling 15 fits
 Best C: 0.01


Based on the cross-validation results, we selected C = 0.01 to train our final SVM baseline model.

In [10]:
# Train final SVM model using best C and kernel
svm_final = SVC(kernel = 'sigmoid', C = 0.01, random_state = 42, max_iter = 1000)
svm_final.fit(X_train_scaled, y_train);

# Predict on test data
y_pred = svm_final.predict(X_test_scaled)

# Attach predictions to customer IDs
results_df = test_df[["id"]].copy()
results_df["predicted_credit_status"] = y_pred

# Display results
print(results_df.head())

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

        id  predicted_credit_status
0  5052720                        1
1  5087861                        1
2  5068206                        1
3  5137255                        1
4  5023163                        1
Accuracy: 0.8974218321448162

Confusion Matrix:
 [[  15  626]
 [ 122 6529]]

Classification Report:
               precision    recall  f1-score   support

           0       0.11      0.02      0.04       641
           1       0.91      0.98      0.95      6651

    accuracy                           0.90      7292
   macro avg       0.51      0.50      0.49      7292
weighted avg       0.84      0.90      0.87      7292



The model predicted nearly all instances as class 1 (Good credit), with a **precision of 0.91** and **recall of 0.98** for that class. In contrast, the minority class (Bad credit) was poorly classified, with a **precision of 0.11**, **recall of 0.02**, and **F1-score of just 0.04**. 

These results indicate that the sigmoid kernel was not effective in capturing the underlying structure of the data — particularly for the minority class. Despite having tested various kernel types and performing a grid search over the regularization parameter \(C\), the SVM continues to underperform on class 0.

This reinforces the need for more effective handling of class imbalance via alternative model architectures, such as tree-based models.

## 3. Random Forest Model

In [11]:
# Building the Random Forest model
rf = RandomForestClassifier(random_state=42, class_weight='balanced')
rf.fit(X_train, y_train);

# Predictions and Evaluation
y_pred = rf.predict(X_test)
# Attach predictions to customer IDs
results_df = test_df[["id"]].copy()
results_df["predicted_credit_status"] = y_pred

# Display results
print(results_df.head())

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

        id  predicted_credit_status
0  5052720                        1
1  5087861                        1
2  5068206                        1
3  5137255                        1
4  5023163                        1
Accuracy: 0.8775370268787712

Confusion Matrix:
 [[ 228  413]
 [ 480 6171]]

Classification Report:
               precision    recall  f1-score   support

           0       0.32      0.36      0.34       641
           1       0.94      0.93      0.93      6651

    accuracy                           0.88      7292
   macro avg       0.63      0.64      0.64      7292
weighted avg       0.88      0.88      0.88      7292



The model attained a **precision of 0.32** and **recall of 0.36** for class 0, resulting in a **F1-score of 0.34**. While performance on the minority class is still limited, it marks a notable improvement over the linear and SVM models. For class 1 (Good credit), the model maintained high performance, with an F1-score of 0.93.

These results indicate that the Random Forest model is better able to handle the non-linear relationships and class imbalance, making it a more effective baseline than linear classifiers in this context.


## 4. XGBoost Model

In [12]:
# Initialize and train XGBoost classifier
xgb_model = XGBClassifier(eval_metric='logloss')
xgb_model.fit(X_train, y_train);

# Predict on test set
y_pred = xgb_model.predict(X_test)

# Attach predictions to customer IDs
results_df = test_df[["id"]].copy()
results_df["predicted_credit_status"] = y_pred

# Display results
print(results_df.head())

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

        id  predicted_credit_status
0  5052720                        1
1  5087861                        1
2  5068206                        1
3  5137255                        1
4  5023163                        1
Accuracy: 0.8809654415798135

Confusion Matrix:
 [[ 119  522]
 [ 346 6305]]

Classification Report:
               precision    recall  f1-score   support

           0       0.26      0.19      0.22       641
           1       0.92      0.95      0.94      6651

    accuracy                           0.88      7292
   macro avg       0.59      0.57      0.58      7292
weighted avg       0.86      0.88      0.87      7292



The XGBoost model also demonstrates strong performance on the majority class (Good credit), with a **precision of 0.92**, **recall of 0.95**, and **F1-score of 0.94**.

For the minority class (Bad credit), XGBoost underperformed compared to Random Forest, with a **recall of 0.19** and **F1-score of 0.22**. While this marks a decline from Random Forest, it still represents a substantial improvement over the linear models (Logistic Regression and SVM), which struggled to detect class 0 almost entirely.