<a href="https://colab.research.google.com/github/francji1/01ZLMA/blob/main/hw/01ZLMA_assignment_2025_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment for Course 01ZLMA in 2024/2025

The assignment should be completed on patient data where heart disease was diagnosed.

The original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Various analyses and visualizations of this dataset can also be found here: https://www.kaggle.com/ronitf/heart-disease-uci (another notation, different NaN manipulation, etc ...)

However, for this assignment, the data have been slightly modified and split as available from the link below.

## 00 - Data description


    age:
    sex:
        0: Female
        1: Male
    chest_pain_type: Chest Pain Type
        0: asymptomatic
        1: atypical angina
        2: non-anginal pain
        3: typical angina
    blood_pressure: Resting Blood Pressure: Person's resting blood pressure
    cholesterol: Serum Cholesterol in mg/dl
    blood_sugar: Fasting Blood Sugar
        0:Less Than 120mg/ml
        1: Greater Than 120mg/ml
    rest_ecg: Resting Electrocardiographic Measurement
        0: showing probable or definite left ventricular hypertrophy by Estes' criteria
        1: normal
        2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    heart_rate: Max Heart Rate Achieved: Maximum Heart Rate Achieved
    ex_angina: Exercise Induced Angina
        1: Yes
        0: No
    st_depression: ST depression induced by exercise relative to rest
    st_slope: Slope of the peak exercise ST segment
        0: downsloping
        1: flat
        2: upsloping
    thal:  blood disorder called 'Thalassemia':
        1: fixed defect
        2: normal
        3: reversable Defect
    num_vessels: Number of Major Vessels: Number of major vessels colored by fluoroscopy


### Loading and preprocessing data



In [None]:
library(tidyverse)
library(knitr)

In [None]:
data_train <- "https://raw.githubusercontent.com/francji1/01ZLMA/main/data/heart_train.csv"
data_train  <- read.table(data_train, header = T, sep = ",")
head(data_train)

In [None]:
data_test <- "https://raw.githubusercontent.com/francji1/01ZLMA/main/data/heart_test.csv"
data_test  <- read.table(data_test, header = T, sep = ",")
head(data_test)

### Creating an aggregated table

In [None]:
data_table <- data_train %>%
   dplyr::select(age, sex, blood_pressure,disease) %>%
   mutate(age             = cut(age, breaks=c(-Inf, 44,60, Inf),labels=c("30-45","45-60","60-75")),
          blood_pressure  = cut(blood_pressure, breaks=c(-Inf, 120,130,140,Inf),labels=c("100-120","121-130","131-140","140-180"))) %>%
  group_by(age,blood_pressure) %>%
  summarise(n = n(),
         disease_yes = sum(disease),
         disease_no = n - sum(disease)
        )
   data_table

## 01 - Graphical data visualization (optionally)

Use `data_train` only,  for better work and more illustrative graphs, replace the code names of the factor variables with the descriptions from the assignment.

* Select the categorical variables, convert them to categories, and rename coded labels according to the data description.
* Plot the discrete variables with histograms, using color to distinguish patients with and without heart disease (target 0/1).
* For continuous variables, show two boxplots by response (with vs. without heart disease) and add pairwise scatterplots of the continuous variables, coloring points by response (with/without heart disease).



## 02 - Logistic regression on aggregated tabular data

Use `data_table`.





* Define the response for a binomial logistic model and fit the **null model** (intercept only). What are the **average odds** of heart disease in the sample, and what is the **probability** of heart disease?

* Fit a model where heart disease depends **only on blood pressure**. Is blood pressure statistically significant at the 0.05 level? If yes, by how many times are the **odds** of heart disease higher for patients with blood pressure **140–180** compared to those with **100–120**?

* Fit a model where heart disease depends **only on age**. Is age statistically significant at the 0.01 level? If yes, by how many times are the **odds** of heart disease higher for patients aged **60–75** compared to those aged **45–60**?

* Assume the odds of heart disease increase **exponentially** with blood pressure and **exponentially** with age. Create corresponding **numeric continuous predictors** as the midpoints of the blood-pressure and age intervals. Fit a model where the odds depend on these numeric values **without interaction**. What is the **odds ratio** for two patients who differ by **10 years of age** but have the same blood pressure?

* Test the previous model **against the saturated model**. Does this test make sense here? Add a short comment on the result.


## 03 - Poisson regression on aggregated tabular data

Use `data_table`.


* Reshape the table into the required format and fit a **purely additive log-linear model** for the **group counts**, assuming **mutual independence** among the three grouping predictors (**age**, **blood pressure**, **disease**).

* From that model, what is the estimated **odds** of heart disease among all selected patients, and what is the estimated **probability** of heart disease?

* Fit a model that includes **all pairwise interactions** among the classification variables and compare it to the previous **no-interaction** model. Is the interaction model **significantly better**?

* Using the interaction model, what is the estimated **odds ratio** for heart disease for patients aged **60–75** compared to those aged **45–60**?

* Fit the **saturated model** and print the **parameter estimates**. Is this model significantly better than the model with pairwise interactions?

* Based on the saturated model, is the relationship between **blood pressure** and **heart disease** the **same across all age groups**, or does it differ?

* In which **age category** is the **largest difference** in heart disease between people with **blood pressure < 120** and those with **blood pressure > 140**?


## 04 - Logistic regression - statistical approach

Use `data_train`.

* Print a **contingency table** for `sex` and `disease`. From that table, **by hand**, compute the **empirical odds ratio** for heart disease (men vs. women) and the **probability** of disease for women and for men. Compare these to a **logistic regression** with `sex` as the only predictor and `disease` as the response. For the odds ratio, also report a **95% confidence interval**, and comment on whether women have **significantly lower odds** of heart disease.

* Print a **contingency table** for `chest_pain_type` and `disease`. From that table, **by hand**, compute the **empirical odds ratio** for heart disease comparing **type 0 (asymptomatic)** vs. **all other types**, and compute the **probability** of disease for each type. Compare these to a **logistic regression** with `chest_pain_type` as the only predictor and `disease` as the response. For the odds ratio, also report a **95% confidence interval**, and comment on whether patients with **asymptomatic** chest pain have **significantly lower odds** of heart disease than the other types.

* Fit a model using **all available variables** (both categorical and numeric). Use **deviance tests** to **stepwise reduce** the model. Compare the final model to the model you would get using **automatic stepwise selection** (e.g., `step()`).

* For your selected model, compute the **odds** of heart disease for **men vs. women**, including **95% confidence intervals**. Do the same for **asymptomatic chest pain** vs. **other types**. How did these results change compared to the simple models, and how would you **explain** the change?

* Using your model, compute **predicted probabilities** of heart disease for the **test data** and, for the predictor `blood_pressure`, **plot prediction intervals / confidence bands** for the predictions.

* Based on the **training data**, choose a suitable **threshold** for classifying **disease vs. no disease**. On the **test data**, compute **Accuracy** and draw the **ROC curve**.




## 05 - Logistic regression - machine learning approach

Use  `data_train` and `data_test`.

* Build a **pipeline** on the **training data** for logistic regression with **elastic-net regularization** that includes:
  * Variable preparation: transformations, one-hot encoding, normalization, etc.
  * Hyper-parameter search for the “optimal” regularization settings.
  * **k-fold cross-validation**.

* Using the pipeline/workflow, choose the hyper-parameter value. If the goal is to **detect patients with heart disease**, which statistic should we focus on to avoid **sending a sick patient home as healthy** (i.e., minimize this error)?

* Compute and compare common **binary-classification metrics** on the **training** and **test** sets. Plot the **ROC curve** and compute the **AUC** for both the training and test data. What can we say about the model from **Section 05** compared to the model from **Section 04**?
