## Machine Learning Approach

**Here are some things I am curious about, after exploring the data:**
1. Do any of the features strongly predict heart disease on their own?
2. Which combinations of features boost heart disease prediction the most?
3. Are any of the features less related...unless combined with others?

[how to examine permutation (feature) importance](https://www.kaggle.com/dansbecker/permutation-importance)

**But first, I need to select an appropriate machine learning algorithm.**

[These are the important considerations:](https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html#:~:text=%20An%20easy%20guide%20to%20choose%20the%20right,time.%20Higher%20accuracy%20typically%20means%20higher...%20More%20)

1. *Size of the training data (small in this case)*

    if the training data is smal or if the dataset has a small number of observations/high number of features 
    choose algorithms with high bias/low variance like Linear regression, Naïve Bayes, or Linear SVM.
    
    If the training data is large and the number of observations is higher, compared to the number of 
    features, you can use low bias/high variance algorithms like KNN, Decision trees, or kernel SVM.
    

2. *Desired Accuracy and/or Interpretability of the output (interpretability in this case)*

    Accuracy of a model means that the function predicts a response value for a given observation, which is 
    close to the true response value for that observation. 
    
    A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily 
    understand how any individual predictor is associated with the response while the flexible models give 
    higher accuracy at the cost of low interpretability.
    
    
3.  *Speed or Training time (non-issue for this small dataset)*

    Higher accuracy typically means higher training time. Also, algorithms require more time to train on 
    large training data. In real-world applications, the choice of algorithm is driven by these two factors 
    predominantly.

    Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. 
    Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and 
    random forests, need a lot of time to train the data.
    
    
4. *Linearity (to be determined)*

    The best way to find out the linearity is to either fit a linear line or run a logistic regression or SVM 
    and check for residual errors. A higher error means the data is not linear and would need complex 
    algorithms to fit.
    

5. *Number of features (small in this case)*

    The dataset may have a large number of features that may not all be relevant and significant. For a 
    certain type of data, such as genetics or textual, the number of features can be very large compared to 
    the number of data points.
    

6. *Supervised or Unsupervised learning (Supervised in this case)*

   Supervised learning algorithms are used when the training data has output variables corresponding to the 
   input variables. The algorithm analyses the input data and learns a function to map the relationship 
   between the input and output variables.
   
   Unspervised learning algorithms are used when the training data does not have a response variable. Such 
   algorithms try to find the intrinsic pattern and hidden structures in the data.

**I am going to start with a Linear or Logistic regression algorithm and evaluate whether the data is non-linear.**  

If non-linear, I will apply random forests (because they are easy to interpret).

[The choice between Linear or Logistic regression, depends on what type of output I am expecting:](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

I am expecting to output a category (Logistic Regression): yes - heart disease or no - not heart disease.

I am not expecting to output a quantity (Linear Regression).

## Fit my data to a logistic regression model

[Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive.](https://developers.google.com/machine-learning/crash-course/logistic-regression/video-lecture)



In [28]:
#import relevant libraries
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, accuracy_score

In [3]:
#import the dataset
heart_data = pd.read_csv("data/heart.csv")

#preview the columns
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


**Prepare data for the model by:**
1. determining the target
2. separating the features from the label 
3. splitting out the training and testing sets

In [5]:
### Determine the target (y) and separate label (y) from features (X)
target = 'target'
y = heart_data[target]
X = heart_data.drop(target,axis=1)

In [6]:
### Separate out Train Set from Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

**Train the model**

In [8]:
#create a variable called ml_model and use it to call the logistic regression classifer
ml_model = LogisticRegression(max_iter=1000)

#fit the classifier to the training data - the training features and the training labels are passed in
ml_model.fit(X_train,y_train)

LogisticRegression(max_iter=1000)

**Generate Predictions**

In [21]:
#calling the predict command on the classifier and providing it with the parameters it needs to make predictions about
#which are the features in your testing dataset
predictions = ml_model.predict(X_test)
overfit = ml_model.predict(X_train)

[**Evaluate the Predictions**](https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/#:~:text=The%20classification%20report%20is%20a%20Scikit-Learn%20built%20in,quick%20intuition%20of%20how%20your%20model%20is%20performing.)

[Logistic Regression outputs predictions about test data points on a binary scale, zero or one.](https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/#:~:text=The%20classification%20report%20is%20a%20Scikit-Learn%20built%20in,quick%20intuition%20of%20how%20your%20model%20is%20performing.)

If the value of something is 0.5 or above, it is classified as belonging to class 1, while below 0.5 if is classified as belonging to 0.

Each of the features also has a label of only 0 or 1. 

Logistic regression is a linear classifier and therefore used when there is some sort of linear relationship between the data.

In [29]:
# Accuracy score is the simplest way to evaluate how the model performs
#pass in the predictions against the ground truth labels which were stored in the test labels
print(accuracy_score(predictions, y_test))

#While it can give you a quick idea of how your classifier is performing
#it is best used when the number of observations/examples in each class is roughly equivalent.
#Because this doesn't happen very often, you're probably better off using another metric.

0.8681318681318682


In [31]:
# The Confusion Matrix and Classification Report give more details about performance
print(confusion_matrix(predictions, y_test))

#the number of correct predictions for each class run on the diagonal from top-left to bottom-right

[[39  4]
 [ 8 40]]


In [32]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.91      0.83      0.87        47
           1       0.83      0.91      0.87        44

    accuracy                           0.87        91
   macro avg       0.87      0.87      0.87        91
weighted avg       0.87      0.87      0.87        91



In [23]:
print(classification_report(y_train,overfit))

              precision    recall  f1-score   support

           0       0.89      0.75      0.81        91
           1       0.83      0.93      0.88       121

    accuracy                           0.85       212
   macro avg       0.86      0.84      0.85       212
weighted avg       0.86      0.85      0.85       212



In [24]:
print(roc_auc_score(y_test,predictions))

0.8694390715667311


In [25]:
print(roc_auc_score(y_train,overfit))

0.8405685223867042
