## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

ha_data = pd.read_csv("https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1")

In [39]:
X = ha_data[['age', 'chol']]
y = ha_data['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [40]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)

# Predict the probability for a 55-year-old
age_55_chol_pred = [[55, 0]]
chol_pred_probability = logreg.predict_proba(scaler.transform(age_55_chol_pred))[:, 1]

print(f"For a 55-year-old, the predicted probability of heart disease is: {chol_pred_probability[0]*100:.2f}%")

# Find the cholesterol level for a 90% chance of heart disease
chol_threshold_90_percent = 0.90  # Set the desired probability threshold
chol_threshold = 0

while chol_pred_probability[0] < chol_threshold_90_percent:
    chol_threshold += 1
    chol_pred_probability = logreg.predict_proba(scaler.transform([[55, chol_threshold]]))[:, 1]

print(f"For a 55-year-old, the cholesterol threshold for a 90% chance of heart disease is approximately: {chol_threshold}")

For a 55-year-old, the predicted probability of heart disease is: 30.83%
For a 55-year-old, the cholesterol threshold for a 90% chance of heart disease is approximately: 1218




## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [43]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)

age_55 = 55
chol_value = 100

prediction_probability = lda.predict_proba(scaler.transform([[age_55, chol_value]]))[:, 1]

print(f"For a 55-year-old with cholesterol {chol_value}, the predicted probability of heart disease is: {prediction_probability[0]*100:.2f}%")

For a 55-year-old with cholesterol 1, the predicted probability of heart disease is: 31.08%




## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [44]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']}

svc = SVC()

grid_search = GridSearchCV(svc, param_grid, refit=True, verbose=3, cv=5)
grid_search.fit(X_train_scaled, y_train)

print("Best Parameters:", grid_search.best_params_)

age_55 = 55
chol_threshold = grid_search.predict(scaler.transform([[age_55, 0]]))[0]

print(f"For a 55-year-old, the predicted diagnosis based on the best SVC model is: {chol_threshold}")


Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.606 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.636 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.606 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.531 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.562 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.576 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.576 total time=   0.0s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.545 total time=   0.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.562 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.562 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.576 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf



## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.

In [47]:
from plotnine import ggplot, aes, geom_point, geom_abline, labs, theme_minimal


# Create a DataFrame for plotting
plot_data = pd.DataFrame({'age': X_test['age'], 'chol': X_test['chol'], 'diagnosis': y_test})

# Create a ggplot scatterplot with decision boundaries
scatter_plot = (
    ggplot(plot_data, aes(x='age', y='chol', color='factor(diagnosis)')) +
    geom_point(alpha=0.8) +
    geom_abline(slope=-logreg.coef_[0, 0] / logreg.coef_[0, 1], intercept=-logreg.intercept_[0] / logreg.coef_[0, 1], linetype='dashed', color='blue') +
    geom_abline(intercept=lda.intercept_[0] / lda.coef_[0, 1], slope=-lda.coef_[0, 0] / lda.coef_[0, 1], linetype='dashed', color='green') +
    geom_abline(slope=-svc.coef_[0, 0] / svc.coef_[0, 1], intercept=-svc.intercept_[0] / svc.coef_[0, 1], linetype='dashed', color='red') +
    labs(title='Scatterplot with Decision Boundaries', x='Age', y='Cholesterol', color='Diagnosis') +
    theme_minimal() +
    scale_color_manual(values=["#440154", "#00A08A"])
)

# Display the plot
print(scatter_plot)

AttributeError: coef_ is only available when using a linear kernel