# Model Development

**Author:**  Cesar Merino

**Goal:** Train multiple predictive models for coronary heart disease.

**Input:**
* Cleaned DataFrame from [Normalization & Correlation](./2_normalization_correlation.ipynb)

**Tasks:**  
• Split data into training and testing sets.  
• Train Logistic Regression (mandatory).  
• Train KNN model.  
• Train one tree-based model (Random Forest or similar).  
• Address class imbalance if needed.  
• Save trained models.  
• Record performance metrics.
  
**Deliverables:**  
• Trained models.  
• Initial performance metrics.

---

## 0. Imports and Global Configuration

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [2]:
cleaned_df = pd.read_csv('../data/cleaned_df.csv')

final_features = ['sex',
 'age',
 'education_level',
 'current_smoker',
 'bp_meds',
 'prevalent_stroke',
 'prevalent_hypertension',
 'diabetes',
 'total_cholesterol',
 'systolic_bp',
 'diastolic_bp',
 'bmi',
 'heart_rate',
 'glucose',
 'ten_year_chd',
 'smoker_intensity',
 'pulse_pressure']

X_train, X_test, y_train, y_test = train_test_split(cleaned_df[final_features].drop('ten_year_chd', axis=1),
                                                    cleaned_df['ten_year_chd'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=cleaned_df['ten_year_chd'])

---

## 1. General models

In [3]:
models = {
    "logistic": LogisticRegression(),
    "random_forest": RandomForestClassifier(random_state=42),
    "svm": SVC(probability=True),
    "knn": KNeighborsClassifier()
}

In [4]:
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- logistic ---
Accuracy: 0.86
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       620
           1       0.73      0.14      0.24       112

    accuracy                           0.86       732
   macro avg       0.80      0.57      0.58       732
weighted avg       0.84      0.86      0.82       732

--- random_forest ---
Accuracy: 0.84
              precision    recall  f1-score   support

           0       0.85      0.97      0.91       620
           1       0.36      0.08      0.13       112

    accuracy                           0.84       732
   macro avg       0.61      0.53      0.52       732
weighted avg       0.78      0.84      0.79       732

--- svm ---
Accuracy: 0.85
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       620
           1       0.40      0.02      0.03       112

    accuracy                           0.85       732
   macro avg       0.62      0.5

As classes are unblanced the general results are pretty bad. Even if the accuracy might be higher than 0.8, the recall is low, meaning that the model is always predicting 'healthy'. In order to address this issue, the models usually have a parameter to give the same importance to each value of the classes.

---

## 2. Balanced models

In [5]:
models = {
    "logistic": LogisticRegression(class_weight='balanced'),
    "random_forest": RandomForestClassifier(random_state=42, class_weight='balanced'),
    "svm": SVC(probability=True, class_weight='balanced'),
    "knn": KNeighborsClassifier()
}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- logistic ---
Accuracy: 0.69
              precision    recall  f1-score   support

           0       0.92      0.70      0.79       620
           1       0.28      0.67      0.40       112

    accuracy                           0.69       732
   macro avg       0.60      0.68      0.60       732
weighted avg       0.82      0.69      0.73       732

--- random_forest ---
Accuracy: 0.85
              precision    recall  f1-score   support

           0       0.86      1.00      0.92       620
           1       0.73      0.07      0.13       112

    accuracy                           0.85       732
   macro avg       0.79      0.53      0.53       732
weighted avg       0.84      0.85      0.80       732

--- svm ---
Accuracy: 0.71
              precision    recall  f1-score   support

           0       0.92      0.72      0.81       620
           1       0.29      0.63      0.40       112

    accuracy                           0.71       732
   macro avg       0.60      0.6

As it can be seen, the accuracies on logistic regression and SVM has dropped down. Despite this, the results are better as the recall is now over 60%, meaning that the models are able to detect and differentiate some of the cases. Now the hyperparameters will be changed to try to obtain better results.

---

## 3. Tweak models

In [6]:
models_logistic = {
    "LR_C01": LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000), # less regularization
    "LR_C10": LogisticRegression(C=10, class_weight='balanced', max_iter=1000), # more regularization
    "LR_Hard_Penalization": LogisticRegression(class_weight={0: 1, 1: 10}, max_iter=1000), # hard penalization on positive class
    "LR_Liblinear": LogisticRegression(solver='liblinear', class_weight='balanced') # different solver
}
for name, model in models_logistic.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- LR_C01 ---
Accuracy: 0.69
              precision    recall  f1-score   support

           0       0.92      0.70      0.79       620
           1       0.28      0.66      0.39       112

    accuracy                           0.69       732
   macro avg       0.60      0.68      0.59       732
weighted avg       0.82      0.69      0.73       732

--- LR_C10 ---
Accuracy: 0.69
              precision    recall  f1-score   support

           0       0.92      0.70      0.79       620
           1       0.28      0.67      0.40       112

    accuracy                           0.69       732
   macro avg       0.60      0.68      0.60       732
weighted avg       0.82      0.69      0.73       732

--- LR_Hard_Penalization ---
Accuracy: 0.52
              precision    recall  f1-score   support

           0       0.94      0.46      0.62       620
           1       0.22      0.85      0.35       112

    accuracy                           0.52       732
   macro avg       0.58 

In [7]:
models_rf = {
    "RF_Balanced": RandomForestClassifier(class_weight='balanced', random_state=42), # balanced classes
    "RF_Balanced_Subsample": RandomForestClassifier(class_weight='balanced_subsample', random_state=42), # balanced per bootstrap sample
    "RF_Limited_Depth": RandomForestClassifier(max_depth=5, class_weight='balanced', random_state=42), # limit depth to reduce overfitting
    "RF_More_Trees": RandomForestClassifier(n_estimators=500, class_weight='balanced', random_state=42) # more trees
}

for name, model in models_rf.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- RF_Balanced ---
Accuracy: 0.85
              precision    recall  f1-score   support

           0       0.86      1.00      0.92       620
           1       0.73      0.07      0.13       112

    accuracy                           0.85       732
   macro avg       0.79      0.53      0.53       732
weighted avg       0.84      0.85      0.80       732

--- RF_Balanced_Subsample ---
Accuracy: 0.85
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       620
           1       0.67      0.07      0.13       112

    accuracy                           0.85       732
   macro avg       0.76      0.53      0.52       732
weighted avg       0.83      0.85      0.80       732

--- RF_Limited_Depth ---
Accuracy: 0.73
              precision    recall  f1-score   support

           0       0.90      0.77      0.83       620
           1       0.30      0.54      0.38       112

    accuracy                           0.73       732
   macro

In [8]:
models_svm = {
    "SVM_RBF_Standard": SVC(kernel='rbf', class_weight='balanced', probability=True), # standard RBF kernel
    "SVM_Linear": SVC(kernel='linear', class_weight='balanced', probability=True), # linear kernel
    "SVM_Polynomial": SVC(kernel='poly', degree=3, class_weight='balanced', probability=True), # polynomial kernel
    "SVM_C_High": SVC(C=10, class_weight='balanced', probability=True) # high C value for less regularization
}

for name, model in models_svm.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- SVM_RBF_Standard ---
Accuracy: 0.71
              precision    recall  f1-score   support

           0       0.92      0.72      0.81       620
           1       0.29      0.63      0.40       112

    accuracy                           0.71       732
   macro avg       0.60      0.68      0.60       732
weighted avg       0.82      0.71      0.75       732

--- SVM_Linear ---
Accuracy: 0.67
              precision    recall  f1-score   support

           0       0.93      0.67      0.78       620
           1       0.28      0.71      0.40       112

    accuracy                           0.67       732
   macro avg       0.60      0.69      0.59       732
weighted avg       0.83      0.67      0.72       732

--- SVM_Polynomial ---
Accuracy: 0.77
              precision    recall  f1-score   support

           0       0.91      0.82      0.86       620
           1       0.34      0.53      0.41       112

    accuracy                           0.77       732
   macro avg    

In [9]:
models_knn = {
    "KNN_3": KNeighborsClassifier(n_neighbors=3), # number of neighbors = 3
    "KNN_7": KNeighborsClassifier(n_neighbors=7), # number of neighbors = 7
    "KNN_Weight_Distance": KNeighborsClassifier(n_neighbors=5, weights='distance'), # weight by distance
    "KNN_Algorithm_BallTree": KNeighborsClassifier(algorithm='ball_tree') ## algorithm choice
}

for name, model in models_knn.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
    print(classification_report(y_test, predictions, zero_division=0))

--- KNN_3 ---
Accuracy: 0.83
              precision    recall  f1-score   support

           0       0.86      0.96      0.91       620
           1       0.36      0.13      0.19       112

    accuracy                           0.83       732
   macro avg       0.61      0.55      0.55       732
weighted avg       0.78      0.83      0.80       732

--- KNN_7 ---
Accuracy: 0.84
              precision    recall  f1-score   support

           0       0.86      0.98      0.91       620
           1       0.46      0.11      0.17       112

    accuracy                           0.84       732
   macro avg       0.66      0.54      0.54       732
weighted avg       0.80      0.84      0.80       732

--- KNN_Weight_Distance ---
Accuracy: 0.84
              precision    recall  f1-score   support

           0       0.86      0.97      0.91       620
           1       0.41      0.11      0.17       112

    accuracy                           0.84       732
   macro avg       0.64    

---

## 4. Results

On the preliminary metrics obtained, logistic regression and SVM are the models which obtain the most acceptable results, with recall higher than 0,6. The tweaks on the hypermparameters have achieved some improvement although it's not remarkable.  On the other hand, KNN and Random forest aren't making good predictions.
In the end, the choice will depend on the needs of the predictions and the cost of false positives and negatives.

---

## 5. Summary

In this section 4 different architechtures have been used to develop different models in order to predict the coronary disease. 

To obtain the metrics for each model, the dataset has been split into two datasets in a 0.8/0.2 proportion, so no cross-validation has been done and the metrics might be optimistic.
Class inbalanced has been adressed, and different models changing hyperparameters have been tested to try to obtain the best results. The trained models are saved in dictionaries for each architechture and the results for each model are displayed. These results include accuracy, recall and other important statistics.

The results obtained show that none of the models used is perfect and some more advanced and complex models might be needed in order to obtain the best results. 