---
title: PA 9.1
author: Marvin (Wenxiang) Li
format:
    html:
        toc: False
        code-fold: true
embed-resources: true
---

## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [None]:
import numpy as np
import pandas as pd
HeartAttack = pd.read_csv("https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1")
HeartAttack

Unnamed: 0,Name,age,sex,cp,trtbps,chol,restecg,thalach,diagnosis
0,Magnolia Cassin,60,1,1,117,230,1,160,No Disease
1,Dr. Cathern Heathcote,60,0,3,102,318,1,160,Disease
2,Miles Wolf,62,0,3,130,263,1,97,No Disease
3,Mikaila Block,43,1,1,115,303,1,181,Disease
4,Mrs. Jacquline Marquardt,66,1,1,120,302,0,151,Disease
...,...,...,...,...,...,...,...,...,...
199,Bridgett Franecki,55,0,1,128,205,2,130,No Disease
200,Mr. Foster Zieme,51,1,3,94,227,1,154,Disease
201,Lashanda Hagenes,42,1,2,120,295,1,162,Disease
202,Levern Trantow III,35,0,1,138,183,1,182,Disease


In [None]:
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures, FunctionTransformer

In [None]:
HeartAttack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       204 non-null    object
 1   age        204 non-null    int64 
 2   sex        204 non-null    int64 
 3   cp         204 non-null    int64 
 4   trtbps     204 non-null    int64 
 5   chol       204 non-null    int64 
 6   restecg    204 non-null    int64 
 7   thalach    204 non-null    int64 
 8   diagnosis  204 non-null    object
dtypes: int64(7), object(2)
memory usage: 14.5+ KB


In [None]:
X = HeartAttack[["age","chol"]]
y = HeartAttack["diagnosis"]

In [None]:
ct = ColumnTransformer(
  [
    ("keep", FunctionTransformer(None),["age","chol"])
  ],
  remainder = "drop"
)
model_1 = Pipeline(
  [
    ("column_transformer", ct),
    ("linear_regression", LogisticRegression(random_state = 42))
  ]
)

In [None]:
model_1_fitted = model_1.fit(X,y)
coef1 = model_1_fitted.named_steps['linear_regression'].coef_
coef1

array([[0.04686331, 0.00180124]])

In [None]:
log_reg = model_1.named_steps["linear_regression"]

# Get the intercept
intercept = log_reg.intercept_
intercept

array([-3.24011226])

In [None]:
(np.log(0.5/(1-0.5)) + 3.24011226 - 0.04686331 * 55) / 0.00180124

367.87446980968673

In [None]:
# (ln(p/1-p)-intercept-beta1 * x1) / beta2
(np.log(0.9/(1-0.9)) + 3.24011226 - 0.04686331 * 55 ) / 0.00180124

1587.7144563390887

## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [None]:
ct = ColumnTransformer(
  [
    ("keep", FunctionTransformer(None),["age","chol"])
  ],
  remainder = "drop"
)
model_2 = Pipeline(
  [
    ("column_transformer", ct),
    ("model", LinearDiscriminantAnalysis())
  ]
)

In [None]:
model_2_fitted = model_2.fit(X,y)
coef2 = model_2_fitted.named_steps['model'].coef_
coef2

array([[0.04655744, 0.00178967]])

In [None]:
intercept2 = model_2_fitted.named_steps['model'].intercept_
intercept2

array([-3.21967766])

In [None]:
-(intercept + 0.04655744 * 55) / 0.00178967

array([379.65270629])

## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

In [None]:
model_3 = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('svc', SVC(kernel='linear'))  # SVC model
])

In [None]:
param_grid = {'svc__C': [0.1, 1, 10, 100, 1000]}
model_3_search = GridSearchCV(model_3, param_grid, cv=5)

In [None]:
model_3_search.fit(X, y)
best_C = model_3_search.best_params_['svc__C']
print(f"Best regularization parameter (C): {best_C}")

Best regularization parameter (C): 10


## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.