# Assignment 4: Predicting Heart Disease Using Decision Trees and Causal Forest (R)

This notebook implements:
1. Classification tree for heart disease prediction
2. Causal forest analysis for treatment effects

## Part 1: Predicting Heart Disease Using a Classification Tree

In [None]:
# Load required libraries
library(rpart)
library(rpart.plot)
library(caret)
library(ggplot2)
library(dplyr)
library(tidyr)
library(randomForest)
library(pheatmap)

# Set random seed for reproducibility
set.seed(123)

### 1.1 Data Cleaning (2 points)

In [None]:
# Load the data
column_names <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'hd')
df <- read.csv('../input/processed.cleveland.data', header = FALSE, col.names = column_names, na.strings = '?')

cat(sprintf("Original dataset shape: %d rows, %d columns\n", nrow(df), ncol(df)))
cat("\nMissing values:\n")
print(colSums(is.na(df)))

# Remove missing values
df <- na.omit(df)
cat(sprintf("\nDataset shape after removing missing values: %d rows, %d columns\n", nrow(df), ncol(df)))

In [None]:
# Create binary variable y (1 if heart disease, 0 otherwise)
df$y <- as.factor(ifelse(df$hd > 0, 1, 0))

cat("Distribution of heart disease:\n")
print(table(df$y))
cat(sprintf("\nPercentage with heart disease: %.2f%%\n", mean(df$y == 1) * 100))

In [None]:
# Identify categorical variables and convert to factors
categorical_vars <- c('sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal')

for (var in categorical_vars) {
  df[[var]] <- as.factor(df[[var]])
}

# Create dummy variables
# Using model.matrix to create dummy variables
formula_str <- paste("~ ", paste(categorical_vars, collapse = " + "), "- 1")
dummies <- model.matrix(as.formula(formula_str), data = df)

# Combine with continuous variables
continuous_vars <- c('age', 'restbp', 'chol', 'thalach', 'oldpeak')
df_encoded <- cbind(df[continuous_vars], dummies, y = df$y, hd = df$hd)

cat(sprintf("Dataset shape after creating dummy variables: %d rows, %d columns\n", nrow(df_encoded), ncol(df_encoded)))
cat("\nColumn names:\n")
print(colnames(df_encoded))

In [None]:
# Prepare features and target
X <- df_encoded[, !(colnames(df_encoded) %in% c('y', 'hd'))]
y <- df_encoded$y

cat(sprintf("Features shape: %d rows, %d columns\n", nrow(X), ncol(X)))
cat(sprintf("Target length: %d\n", length(y)))

### 1.2 Data Analysis (8 points)

#### (1 point) Split data and plot classification tree

In [None]:
# Split the data into training and test sets
set.seed(123)
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

cat(sprintf("Training set size: %d\n", nrow(X_train)))
cat(sprintf("Test set size: %d\n", nrow(X_test)))

In [None]:
# Train initial classification tree (without pruning)
train_data <- cbind(X_train, y = y_train)
tree_model <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0))

# Plot the tree
png('../output/initial_tree.png', width = 2000, height = 1000, res = 150)
rpart.plot(tree_model, type = 4, extra = 101, under = TRUE, cex = 0.8, 
           box.palette = "RdYlGn", shadow.col = "gray", main = "Initial Classification Tree (Unpruned)")
dev.off()

cat(sprintf("Tree depth: %d\n", max(tree_model$cptable[, "nsplit"])))
cat(sprintf("Number of leaves: %d\n", sum(tree_model$frame$var == "<leaf>")))

#### (2 points) Plot confusion matrix and interpret

In [None]:
# Predictions on test set
y_pred <- predict(tree_model, newdata = X_test, type = "class")

# Compute confusion matrix
cm <- confusionMatrix(y_pred, y_test)
print(cm)

# Plot confusion matrix
cm_table <- as.data.frame(cm$table)
png('../output/confusion_matrix_initial.png', width = 800, height = 600, res = 100)
ggplot(data = cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 8) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  scale_x_discrete(labels = c("Does not have HD", "Has HD")) +
  scale_y_discrete(labels = c("Does not have HD", "Has HD")) +
  labs(title = sprintf("Confusion Matrix - Initial Tree\nAccuracy: %.4f", cm$overall['Accuracy']),
       x = "Actual", y = "Predicted") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.text = element_text(size = 10))
dev.off()

cat(sprintf("\nTest Accuracy: %.4f\n", cm$overall['Accuracy']))

**Interpretation of Initial Confusion Matrix:**
- The initial tree may show overfitting characteristics with high training accuracy but lower test accuracy
- True Positives: Correctly identified patients with heart disease
- True Negatives: Correctly identified patients without heart disease
- False Positives: Patients incorrectly classified as having heart disease (Type I error)
- False Negatives: Patients incorrectly classified as not having heart disease (Type II error - more concerning in medical diagnosis)

#### (1.5 points) Fix overfitting using cross-validation

In [None]:
# Generate 50 alpha (cp) values equally spaced on a logarithmic scale between e^-10 and 0.05
alphas <- 10^seq(-10, log10(0.05), length.out = 50)
cat(sprintf("Alpha range: %.10f to %.4f\n", min(alphas), max(alphas)))
cat(sprintf("Number of alphas: %d\n", length(alphas)))

In [None]:
# Perform 4-fold cross-validation to select optimal alpha
set.seed(123)
folds <- createFolds(y_train, k = 4)

cv_scores <- sapply(alphas, function(alpha) {
  fold_accuracies <- sapply(folds, function(fold) {
    train_fold <- train_data[-fold, ]
    val_fold <- train_data[fold, ]
    
    tree_cv <- rpart(y ~ ., data = train_fold, method = "class", 
                     control = rpart.control(cp = alpha))
    pred_cv <- predict(tree_cv, newdata = val_fold, type = "class")
    mean(pred_cv == val_fold$y)
  })
  mean(fold_accuracies)
})

optimal_idx <- which.max(cv_scores)
optimal_alpha <- alphas[optimal_idx]
optimal_cv_score <- cv_scores[optimal_idx]

cat(sprintf("Optimal alpha: %.10f\n", optimal_alpha))
cat(sprintf("Best CV accuracy: %.4f\n", optimal_cv_score))

#### (1.5 points) Plot Inaccuracy Rate (1 - Accuracy) against alpha

In [None]:
# Calculate inaccuracy rate
inaccuracy_rates <- 1 - cv_scores

# Plot inaccuracy rate vs alpha
png('../output/inaccuracy_vs_alpha.png', width = 1200, height = 600, res = 100)
plot_df <- data.frame(alpha = alphas, inaccuracy = inaccuracy_rates)
ggplot(plot_df, aes(x = alpha, y = inaccuracy)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 2, color = "steelblue") +
  geom_vline(xintercept = optimal_alpha, linetype = "dashed", color = "red", linewidth = 1) +
  annotate("text", x = optimal_alpha, y = max(inaccuracy_rates), 
           label = sprintf("Optimal α = %.6f", optimal_alpha), hjust = -0.1, color = "red") +
  scale_x_log10() +
  labs(title = "Inaccuracy Rate vs Alpha (4-fold Cross-Validation)",
       x = "Alpha (log scale)",
       y = "Inaccuracy Rate (1 - Accuracy)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14))
dev.off()

cat(sprintf("Minimum inaccuracy rate: %.4f\n", min(inaccuracy_rates)))
cat(sprintf("Maximum inaccuracy rate: %.4f\n", max(inaccuracy_rates)))

#### (2 points) Plot optimal tree and confusion matrix with interpretation

In [None]:
# Train tree with optimal alpha
tree_optimal <- rpart(y ~ ., data = train_data, method = "class", 
                      control = rpart.control(cp = optimal_alpha))

# Plot optimal tree
png('../output/optimal_tree.png', width = 2000, height = 1000, res = 150)
rpart.plot(tree_optimal, type = 4, extra = 101, under = TRUE, cex = 0.8,
           box.palette = "RdYlGn", shadow.col = "gray",
           main = sprintf("Optimal Classification Tree (α = %.6f)", optimal_alpha))
dev.off()

cat(sprintf("Optimal tree depth: %d\n", max(tree_optimal$cptable[, "nsplit"])))
cat(sprintf("Optimal number of leaves: %d\n", sum(tree_optimal$frame$var == "<leaf>")))

In [None]:
# Predictions with optimal tree
y_pred_optimal <- predict(tree_optimal, newdata = X_test, type = "class")

# Compute confusion matrix
cm_optimal <- confusionMatrix(y_pred_optimal, y_test)
print(cm_optimal)

# Plot confusion matrix
cm_optimal_table <- as.data.frame(cm_optimal$table)
png('../output/confusion_matrix_optimal.png', width = 800, height = 600, res = 100)
ggplot(data = cm_optimal_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 8) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  scale_x_discrete(labels = c("Does not have HD", "Has HD")) +
  scale_y_discrete(labels = c("Does not have HD", "Has HD")) +
  labs(title = sprintf("Confusion Matrix - Optimal Tree (α = %.6f)\nAccuracy: %.4f", 
                       optimal_alpha, cm_optimal$overall['Accuracy']),
       x = "Actual", y = "Predicted") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.text = element_text(size = 10))
dev.off()

cat(sprintf("\nOptimal Test Accuracy: %.4f\n", cm_optimal$overall['Accuracy']))
cat(sprintf("Sensitivity (True Positive Rate): %.4f\n", cm_optimal$byClass['Sensitivity']))
cat(sprintf("Specificity (True Negative Rate): %.4f\n", cm_optimal$byClass['Specificity']))

**Interpretation and Discussion:**

1. **Tree Complexity**: The optimal tree with regularization (α) is simpler than the initial unpruned tree, reducing overfitting.

2. **Performance Comparison**: 
   - The pruned tree may have slightly lower training accuracy but better generalization on test data
   - The cross-validation process helped select an alpha that balances bias and variance

3. **Clinical Implications**:
   - Sensitivity measures the ability to correctly identify patients with heart disease
   - Specificity measures the ability to correctly identify patients without heart disease
   - In medical diagnosis, high sensitivity is often preferred to avoid missing cases (minimize false negatives)

4. **Model Insights**: 
   - The tree reveals which features are most important for heart disease prediction
   - The pruning process removed splits that didn't significantly improve accuracy
   - The optimal model provides interpretable decision rules for clinical use

## Part 2: Causal Forest Analysis

### (0.5 points) Create binary treatment variable T

In [None]:
# Reset random seed
set.seed(123)

# Create binary treatment variable (random assignment)
df$T <- rbinom(nrow(df), 1, 0.5)

cat("Treatment distribution:\n")
print(table(df$T))
cat(sprintf("\nProportion treated: %.4f\n", mean(df$T)))

### (1 point) Create outcome variable Y

In [None]:
# Generate outcome variable Y based on the specified formula
# Y = (1 + 0.05*age + 0.3*sex + 0.2*restbp) * T + 0.5*oldpeak + ε
# where ε ~ N(0, 1)

epsilon <- rnorm(nrow(df), 0, 1)
df$Y <- (1 + 0.05 * df$age + 0.3 * as.numeric(as.character(df$sex)) + 0.2 * df$restbp) * df$T + 
        0.5 * df$oldpeak + epsilon

cat("Outcome variable Y statistics:\n")
print(summary(df$Y))
cat(sprintf("\nMean Y for treated: %.4f\n", mean(df$Y[df$T == 1])))
cat(sprintf("Mean Y for control: %.4f\n", mean(df$Y[df$T == 0])))
cat(sprintf("Raw difference: %.4f\n", mean(df$Y[df$T == 1]) - mean(df$Y[df$T == 0])))

### (1 point) Calculate treatment effect using OLS

In [None]:
# OLS regression: Y ~ T
ols_model <- lm(Y ~ T, data = df)
ols_summary <- summary(ols_model)

cat("OLS Regression Results: Y ~ T\n")
cat("=================================================\n")
print(ols_summary)

treatment_effect_ols <- coef(ols_model)['T']
cat(sprintf("\nTreatment Effect (β_T): %.4f\n", treatment_effect_ols))

### (2 points) Use Random Forest to estimate causal effects

In [None]:
# Prepare features for causal forest
covariates <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal')

# Create data with treatment and covariates
rf_data <- df[, c(covariates, 'T')]
for (var in covariates) {
  if (is.factor(rf_data[[var]])) {
    rf_data[[var]] <- as.numeric(as.character(rf_data[[var]]))
  }
}

# Add interaction terms
for (var in covariates) {
  rf_data[[paste0('T_x_', var)]] <- rf_data$T * rf_data[[var]]
}

# Fit Random Forest
set.seed(123)
rf_model <- randomForest(x = rf_data, y = df$Y, ntree = 100, maxnodes = NULL, nodesize = 5)

# Predict outcomes under treatment and control
rf_treated <- rf_data
rf_treated$T <- 1
for (var in covariates) {
  rf_treated[[paste0('T_x_', var)]] <- rf_treated[[var]]
}

rf_control <- rf_data
rf_control$T <- 0
for (var in covariates) {
  rf_control[[paste0('T_x_', var)]] <- 0
}

Y_pred_treated <- predict(rf_model, newdata = rf_treated)
Y_pred_control <- predict(rf_model, newdata = rf_control)

# Individual treatment effects
df$ITE <- Y_pred_treated - Y_pred_control

cat("Random Forest Causal Effects Estimation\n")
cat("=================================================\n")
cat(sprintf("Average Treatment Effect (ATE): %.4f\n", mean(df$ITE)))
cat(sprintf("Standard Deviation of ITE: %.4f\n", sd(df$ITE)))
cat(sprintf("Min ITE: %.4f\n", min(df$ITE)))
cat(sprintf("Max ITE: %.4f\n", max(df$ITE)))

# Plot distribution of treatment effects
png('../output/ite_distribution.png', width = 1000, height = 600, res = 100)
ggplot(data.frame(ITE = df$ITE), aes(x = ITE)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = mean(df$ITE), linetype = "dashed", color = "red", linewidth = 1) +
  annotate("text", x = mean(df$ITE), y = Inf, 
           label = sprintf("Mean ITE = %.4f", mean(df$ITE)), 
           vjust = 2, hjust = -0.1, color = "red") +
  labs(title = "Distribution of Individual Treatment Effects",
       x = "Individual Treatment Effect",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14))
dev.off()

### (2 points) Plot representative tree with max_depth=2

In [None]:
# Train a single decision tree with maxdepth=2
tree_rf_data <- cbind(rf_data, Y = df$Y)
tree_model_rf <- rpart(Y ~ ., data = tree_rf_data, method = "anova", 
                       control = rpart.control(maxdepth = 2, minsplit = 10, cp = 0))

# Plot the tree
png('../output/representative_tree.png', width = 2000, height = 1000, res = 150)
rpart.plot(tree_model_rf, type = 4, extra = 101, under = TRUE, cex = 0.8,
           box.palette = "RdYlGn", shadow.col = "gray",
           main = "Representative Tree (max_depth=2) for Treatment Effect Heterogeneity")
dev.off()

cat(sprintf("Tree depth: %d\n", max(tree_model_rf$cptable[, "nsplit"])))
cat(sprintf("Number of leaves: %d\n", sum(tree_model_rf$frame$var == "<leaf>")))

**Interpretation of Representative Tree:**

This shallow tree (max_depth=2) reveals the most important heterogeneous treatment effects:
- The tree shows which patient characteristics lead to different treatment effects
- Each split represents a key decision point that differentiates treatment response
- Leaf nodes show the predicted outcome for patients in that subgroup
- The interaction terms (T_x_*) capture how treatment effects vary by patient characteristics

### (1.5 points) Feature importance visualization

In [None]:
# Get feature importances from Random Forest
importance_df <- data.frame(
  feature = names(rf_model$importance[, 1]),
  importance = rf_model$importance[, 1]
) %>% arrange(desc(importance))

# Plot top 15 most important features
png('../output/feature_importance.png', width = 1200, height = 800, res = 100)
top_features <- head(importance_df, 15)
ggplot(top_features, aes(x = reorder(feature, importance), y = importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Feature Importances (Random Forest)",
       x = "Feature",
       y = "Importance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14))
dev.off()

cat("\nTop 10 Most Important Features:\n")
print(head(importance_df, 10))

### (2 points) Covariate distribution by treatment effect terciles

In [None]:
# Standardize covariates
covariate_data <- df[, covariates]
for (var in covariates) {
  if (is.factor(covariate_data[[var]])) {
    covariate_data[[var]] <- as.numeric(as.character(covariate_data[[var]]))
  }
}

standardized_covariates <- scale(covariate_data)
standardized_df <- as.data.frame(standardized_covariates)

# Divide predicted treatment effects into terciles
df$ITE_tercile <- cut(df$ITE, breaks = quantile(df$ITE, probs = c(0, 1/3, 2/3, 1)), 
                      labels = c('Low', 'Medium', 'High'), include.lowest = TRUE)

cat("Treatment Effect Terciles:\n")
print(tapply(df$ITE, df$ITE_tercile, summary))

In [None]:
# Compute mean of each standardized covariate within each tercile
standardized_df$tercile <- df$ITE_tercile

tercile_means <- standardized_df %>%
  group_by(tercile) %>%
  summarise(across(all_of(covariates), mean, na.rm = TRUE)) %>%
  as.data.frame()

rownames(tercile_means) <- tercile_means$tercile
tercile_means <- tercile_means[, -1]

# Create heatmap
png('../output/terciles_heatmap.png', width = 1400, height = 600, res = 100)
pheatmap(tercile_means, 
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         display_numbers = TRUE,
         number_format = "%.2f",
         color = colorRampPalette(c("blue", "white", "red"))(50),
         breaks = seq(-2, 2, length.out = 51),
         main = "Mean Standardized Covariates by Predicted Treatment Effect Terciles",
         fontsize = 10,
         fontsize_number = 8)
dev.off()

cat("\nMean Standardized Covariates by Tercile:\n")
print(tercile_means)

**Interpretation of Tercile Analysis:**

This heatmap shows how patient characteristics differ across treatment effect terciles:

- **Low tercile**: Patients with lowest predicted treatment effects
- **Medium tercile**: Patients with moderate predicted treatment effects  
- **High tercile**: Patients with highest predicted treatment effects

The color intensity indicates how each covariate's mean differs from zero (the population mean after standardization):
- Red colors indicate above-average values for that tercile
- Blue colors indicate below-average values for that tercile

This analysis helps identify which patient characteristics are associated with higher or lower treatment effects, informing targeted intervention strategies.

## Summary of Results

### Part 1: Classification Tree
- Successfully built and pruned a classification tree for heart disease prediction
- Used cross-validation to find optimal complexity parameter (alpha)
- Achieved reasonable accuracy while maintaining interpretability

### Part 2: Causal Forest
- Estimated heterogeneous treatment effects using Random Forest
- Identified key characteristics associated with treatment response
- Visualized treatment effect heterogeneity across patient subgroups

All figures have been saved to the `../output/` directory.