
    Introduction: Describes the data and where you got the data. Describe the question being answered and the method(s) being used to answer the question.
    Data pre-processing: What's needed to load the data, clean the data, normalize, etc.
    Model setup: Setup one or more models
    Hyperparameter tuning: Do some playing with the model hyperparameters (learning rate, optimizer, batch size, epochs, whatever makes sense)
    Results: How did the model do
    Discussion: Summarize what worked, what didn't etc.


Here, we are working with the Heart Disease dataset from the UCI Machine Learning Repository. We are using pandas to load the data, specifically the Cleveland dataset, from the UCI ML repository.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sklearn

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
df = pd.read_csv(url, sep=',', header=None)

ModuleNotFoundError: No module named 'sklearn'

This is a preview of the dataframe, with 14 attributes as previously described. As they are currently numerical, they will be converted to actual attributes as per the dataset documentation: https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

In [None]:
# Preview the first 5 rows
print(df.head())

     0    1    2      3      4    5    6      7    8    9    10   11   12  13
0  63.0  1.0  1.0  145.0  233.0  1.0  2.0  150.0  0.0  2.3  3.0  0.0  6.0   0
1  67.0  1.0  4.0  160.0  286.0  0.0  2.0  108.0  1.0  1.5  2.0  3.0  3.0   2
2  67.0  1.0  4.0  120.0  229.0  0.0  2.0  129.0  1.0  2.6  2.0  2.0  7.0   1
3  37.0  1.0  3.0  130.0  250.0  0.0  0.0  187.0  0.0  3.5  3.0  0.0  3.0   0
4  41.0  0.0  2.0  130.0  204.0  0.0  2.0  172.0  0.0  1.4  1.0  0.0  3.0   0


In [None]:
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'] 

In [None]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


Here, we are attempting to remove values that do not correspond to the usual numerical values by conducting pre-processing. Here, we convert all non-numerical values to "NA" values and remove the corresponding rows. For the attributes that are described in the dataset as categorical, they are converted to corresponding categorical codes for the sake of simplicity. A standard scaler, while not typically used in random forest decision trees, is also implemented for the sake of consistency.

In [None]:
df['ca'] = pd.to_numeric(df['ca'], errors='coerce')
df = df.dropna()

In [None]:
df_copy = df.copy()
for x in ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']:
    df_copy.loc[:, x] = df_copy.loc[:, x].astype('category').cat.codes
df = df_copy

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

The scaled dataframe is split into explanatory and response variables, X and y. It is converted into training and testing tests for the sake of validation.

In [None]:
X = scaled_df.drop('num', axis=1)
y = scaled_df['num']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

An XGBoost model is developed and trained on the training dataset, X_train and y_train. A GridSearchCV model is used to optimize based on various model hyperparameters, including learning rate, n_estimators, and gamma. The ranges chosen are arbritrary and limited for the sake of time in running the GridSearch function. A GPU-accelerated tree-method is used to augment the speed of the program. Nvidia-based GPU implementations of numpy, sklearn, and pandas (cupy, cuml, and cudf) were initially used, but they were unable to support certain datatypes.

Overall, the model performs decently well, considering a test size of only 0.2. This means that 20% of the data is randomly designated as the testing set based on the training. While this could be better, it would require further hyperparameter adjustments. Based on the dataset repository, the average accuracy of Xgboost classification models on this dataset is 81.579% and the precision is 83.185%, but the details of the test_size and the parameters used are unknown. This can be found under "Baseline Model Performance" at this link: https://archive-beta.ics.uci.edu/dataset/45/heart+disease

In [None]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()

param_grid = {
    'max_depth': list(range(10)),
    'learning_rate': list(np.arange(0.05, 0.2+0.05, 0.05)),
    'n_estimators': list(np.arange(10, 60, 10)),
    'gamma': list(np.arange(0, 0.25, 0.05)),
    'objective': ['binary:logistic'],
    'random_state': [5],
    'enable_categorical': [True],
    'tree_method': ['gpu_hist']
}

# Creating the GridSearchCV object
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    verbose=2,
    error_score='raise'
)

# Fitting the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Printing the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
[CV] END enable_categorical=True, gamma=0.0, learning_rate=0.05, max_depth=0, n_estimators=10, objective=binary:logistic, random_state=5, tree_method=gpu_hist; total time=   0.1s
[CV] END enable_categorical=True, gamma=0.0, learning_rate=0.05, max_depth=0, n_estimators=10, objective=binary:logistic, random_state=5, tree_method=gpu_hist; total time=   0.1s
[CV] END enable_categorical=True, gamma=0.0, learning_rate=0.05, max_depth=0, n_estimators=10, objective=binary:logistic, random_state=5, tree_method=gpu_hist; total time=   0.1s
[CV] END enable_categorical=True, gamma=0.0, learning_rate=0.05, max_depth=0, n_estimators=10, objective=binary:logistic, random_state=5, tree_method=gpu_hist; total time=   0.1s
[CV] END enable_categorical=True, gamma=0.0, learning_rate=0.05, max_depth=0, n_estimators=10, objective=binary:logistic, random_state=5, tree_method=gpu_hist; total time=   0.1s
[CV] END enable_categorical=True, gamma=

Overall, the model worked considerably well for an initial attempt at attempting to use Xgboost, an industry standard decision tree library, for classifying heart disease patients. Greater grid search parameter ranges can aid in producing more-accurate models. Other possible ways of improving this model include proving greater input data, feature engineering (combining features), more hyperparameter tuning, tree ensemble methods (bagging, boosting), cross-validation, feature selection (recursive feature elimination with logistic regression), and transfer learning.