## Model Training

Importing packages

In [34]:
import pandas as pd

from datetime import datetime

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold

from sklearn.metrics import make_scorer, accuracy_score, f1_score, recall_score, precision_score

from sklearn.svm import LinearSVC
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from xgboost import XGBClassifier

seed = 31 # defined to ensure that the random processes are reproducible

We will apply the same data cleaning and preparation steps used in the first notebook (1. EDA CARDIO.ipynb) in this repository.

Specifically, we will:

- Remove the id column.
- Map the numerical inputs gluc and cholesterol, which have possible values 1, 2, and 3, into 'normal', 'above normal', and 'well above normal'. This mapping will then be converted into one-hot encoding to avoid introducing bias related to the levels of these features.
- Convert age to years for better interpretability.

In [9]:
def get_cardio_df():
    return pd.read_csv('data/cardio.csv', sep=";")

def drop_id_column(input_df):
    df_without_id = input_df.copy()
    df_without_id = df_without_id.drop(columns=['id'])
    return df_without_id

def map_features_categories(input_df):
    df_with_categories = input_df.copy()

    level_dict = {1: 'normal', 2: 'above normal', 3: 'well above normal'}
    df_with_categories['cholesterol'] = df_with_categories['cholesterol'].map(level_dict)
    df_with_categories['gluc'] = df_with_categories['gluc'].map(level_dict)

    return df_with_categories

def convert_age_to_years(input_df):
    df_age_formatted = input_df.copy()
    df_age_formatted['age'] = df_age_formatted['age'] / 365
    return df_age_formatted


def get_formatted_df():
    df0 = get_cardio_df()
    df1 = drop_id_column(df0)
    df2 = map_features_categories(df1)
    df3 = convert_age_to_years(df2)
    return df3

df = get_formatted_df()
df.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,50.391781,2,168,62.0,110,80,normal,normal,0,0,1,0
1,55.419178,1,156,85.0,140,90,well above normal,normal,0,0,1,1
2,51.663014,1,165,64.0,130,70,well above normal,normal,0,0,0,1
3,48.282192,2,169,82.0,150,100,normal,normal,0,0,1,1
4,47.873973,1,156,56.0,100,60,normal,normal,0,0,0,0


Now, we will create the X (features) and y (target) variables

In [10]:
X = df.drop(columns=['cardio'],axis=1)
y = df['cardio']

In [11]:
X.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,50.391781,2,168,62.0,110,80,normal,normal,0,0,1
1,55.419178,1,156,85.0,140,90,well above normal,normal,0,0,1
2,51.663014,1,165,64.0,130,70,well above normal,normal,0,0,0
3,48.282192,2,169,82.0,150,100,normal,normal,0,0,1
4,47.873973,1,156,56.0,100,60,normal,normal,0,0,0


In [12]:
y

0        0
1        1
2        1
3        1
4        0
        ..
69995    0
69996    1
69997    1
69998    1
69999    0
Name: cardio, Length: 70000, dtype: int64

We will perform a train/test split with an 80%/20% ratio. The test set will remain untouched until the model selection phase is completed.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=seed)
X_train.shape, X_test.shape

((56000, 11), (14000, 11))

We will now create a data processing pipeline consisting of two components: a numerical pipeline and a categorical pipeline. The numerical pipeline will first impute missing values using the mean and then apply standard scaling. The categorical pipeline will start by imputing missing values with the most frequent categories and will follow with one-hot encoding.

Although the exploratory data analysis (EDA) revealed no missing values in the current dataset, these imputation steps are included to account for potential missing values in future data.

In [14]:
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

num_pipeline = Pipeline(
    steps=[
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
    ]
)
cat_pipeline = Pipeline(
        steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                ("one_hot_encoder", OneHotEncoder()),
        ]
)        

preprocessor = ColumnTransformer(
    [
        ("num_pipeline", num_pipeline, num_features),
         ("cat_pipeline", cat_pipeline, cat_features),        
    ]
)

We fit and transform the pipeline on the training set. The test set will be transformed only, to prevent data leakage.

In [15]:
X_train = preprocessor.fit_transform(X_train)
X_train.shape

(56000, 15)

The number of features has increased from 10 to 22 due to the one-hot encoding applied to the categorical variables.

Next, we'll create an evaluation function to provide all relevant metrics after model training.

In [16]:
def evaluate_model(true, predicted):
    accuracy = accuracy_score(true, predicted)
    precision = precision_score(true, predicted)
    recall = recall_score(true, predicted, zero_division=0)
    f1 = f1_score(true, predicted)
    return accuracy, precision, recall, f1

Below, we define the various models to be trained on the training set:

- **Dummy:** A baseline model that makes predictions based on simple rules, such as predicting the most frequent class.
- **Logistic Regression:** A linear model used for binary classification that estimates probabilities using a logistic function.
- **LinearSVM:** A Support Vector Machine model that finds the optimal hyperplane for class separation in a linear manner.
- **CatBoost:** An efficient and scalable implementation of gradient boosting that focuses on speed and performance. It is capable of handling large datasets with high computational efficiency, supports regularization to prevent overfitting, and is widely used for its robust handling of missing data and flexibility in model tuning.
- **Random Forest:** An ensemble model that constructs multiple decision trees and aggregates their predictions to improve accuracy.

In [27]:
initial_models = {
    "Dummy": DummyClassifier(random_state=seed),
    "Logistic Regression": LogisticRegression(random_state=seed, verbose=False),
    "LinearSVM": LinearSVC(random_state=seed),
    "KNN": KNeighborsClassifier(),
    "XGBoost": XGBClassifier(random_state=seed),
    "Random Forest": RandomForestClassifier(random_state=seed)
}

For each model, we define a set of hyperparameters. A hyperparameter is a parameter that is set before the training process begins and controls various aspects of the model's learning process. Unlike model parameters, which are learned from the data during training, hyperparameters are specified manually and influence how well the model performs.

In [26]:
initial_params={
    "Dummy": {
        "strategy": ["most_frequent", "uniform"]
    },
    "Logistic Regression": {
        'C': [0.1, 1, 10],
    },
    "LinearSVM": {
        'C': [0.1, 1, 10]
    },
    'KNN': {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance']
    },
    "XGBoost": {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5],
        'learning_rate': [0.01, 0.1, 0.2]
    },
    "Random Forest": {
        'min_samples_leaf': [1, 20],
        'n_estimators': [100, 200],
        'max_features': ['sqrt', 'log2']
    }
}

Finally, we will perform a grid search with 10-fold cross-validation to identify the best hyperparameters for each model. After determining the optimal hyperparameters, we will evaluate the models' accuracy and select the one with the highest performance.

In [28]:
def train_and_evaluate_models(models_dic, params_dic):
    best_model_name = None
    best_model_params = None
    highest_r2 = -float('inf')  # R² can be negative, hence initialized to a very low value

    for model_name, model in models_dic.items():
        start_time = datetime.now()

        param = params_dic.get(model_name, {})
        cv = KFold(n_splits=4, shuffle=True, random_state=seed)
        
        gs = GridSearchCV(
            model, param, cv=cv, 
            scoring={
                'r2': 'r2',
                'neg_mean_absolute_error': 'neg_mean_absolute_error',
                'neg_mean_squared_error': 'neg_mean_squared_error',
                'neg_root_mean_squared_error': 'neg_root_mean_squared_error'
            }, 
            refit='r2'
        )
        gs.fit(X_train, y_train)

        best_estimator = gs.best_estimator_
        best_estimator.fit(X_train, y_train)

        r2 = gs.cv_results_['mean_test_r2'].mean()
        mae = -gs.cv_results_['mean_test_neg_mean_absolute_error'].mean()
        mse = -gs.cv_results_['mean_test_neg_mean_squared_error'].mean()
        rmse = -gs.cv_results_['mean_test_neg_root_mean_squared_error'].mean()

        end_time = datetime.now()
        elapsed_time = (end_time - start_time).total_seconds()
        
        print('Model:', model_name)
        print("- Best parameters:", gs.best_params_)
        print("- Time Elapsed: {:.2f} seconds".format(elapsed_time))
        print("- R²: {:.4f}".format(r2))
        print("- MAE: {:.4f}".format(mae))
        print("- MSE: {:.4f}".format(mse))
        print("- RMSE: {:.4f}".format(rmse))
        print("\n\n")

        if r2 > highest_r2:
            highest_r2 = r2
            best_model_name = model_name
            best_model_params = gs.best_params_

    print("\n\n\nBest Model")
    print("Model:", best_model_name)
    print("Parameters:", best_model_params)
    print("Highest R²: {:.4f}".format(highest_r2))

train_and_evaluate_models(initial_models, initial_params)

Model:  Dummy
- Best parameters:  {'strategy': 'most_frequent'}
- Time Elapsed: 0.17 seconds
- Accuracy: 0.4996
- Precision: 0.2491
- Recall: 0.2485
- F1-score: 0.2488



Model:  Logistic Regression
- Best parameters:  {'C': 10}
- Time Elapsed: 1.05 seconds
- Accuracy: 0.7201
- Precision: 0.7405
- Recall: 0.6767
- F1-score: 0.7072



Model:  LinearSVM
- Best parameters:  {'C': 0.1}
- Time Elapsed: 1.35 seconds
- Accuracy: 0.6519
- Precision: 0.6630
- Recall: 0.6165
- F1-score: 0.6389



Model:  KNN
- Best parameters:  {'n_neighbors': 7, 'weights': 'uniform'}
- Time Elapsed: 103.76 seconds
- Accuracy: 0.6416
- Precision: 0.6470
- Recall: 0.6214
- F1-score: 0.6339



Model:  XGBoost
- Best parameters:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 300}
- Time Elapsed: 33.54 seconds
- Accuracy: 0.7346
- Precision: 0.7549
- Recall: 0.6938
- F1-score: 0.7230



Model:  Random Forest
- Best parameters:  {'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 200}
- Time Ela

After thorough evaluation, we concluded that XGBoost is the best model for our needs. We will now perform another round of hyperparameter tuning, focusing only on XGBoost model.

In [40]:
final_models = {
    "XGBoost": XGBClassifier(random_state=seed)
}

final_params = {
    "XGBoost": {
        'n_estimators': [300, 400, 500],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.02, 0.05, 0.1, 0.15],
    }
}

train_and_evaluate_models(final_models, final_params)

Model:  XGBoost
- Best parameters:  {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
- Time Elapsed: 77.08 seconds
- Accuracy: 0.7361
- Precision: 0.7554
- Recall: 0.6975
- F1-score: 0.7253






Best Model
Model:  XGBoost
Parameters:  {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
Highest Accuracy: 0.7361


After thorough evaluation, we concluded that XGBoost is the best model for our needs. The model, with parameters {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}, achieved the highest accuracy of 0.7361. We will now use this model to get the accuracy on the test set.








In [41]:
model = XGBClassifier(random_state=seed, max_depth=4, n_estimators=300, learning_rate=0.05, verbose=False)
model.fit(X_train, y_train)

In [42]:
X_test_prepared = preprocessor.transform(X_test)
y_test_pred = model.predict(X_test_prepared)

print(accuracy_score(y_test, y_test_pred))

0.7309285714285715


### Results

The model achieved an accuracy of 0.73 on the test set, indicating that it correctly predicted 73% of the instances in the test data. This performance metric is a measure of how well the model classified the cardiovascular disease presence or absence based on the features provided.

- **Contextual Interpretation**: Depending on the problem domain, a 73% accuracy might be satisfactory or may require further improvement. For medical diagnostics, for instance, the accuracy threshold may need to be higher due to the critical nature of the decisions.

- **Next Steps**: To further enhance the model's performance, consider:
  - **Feature Engineering**: Refining input features or creating new ones.
  - **Model Tuning**: Optimizing hyperparameters further.
  
In conclusion, while achieving 73% accuracy is a positive indication of the model's capability, ongoing refinement and evaluation are essential to enhance its predictive power and reliability in practical applications.