# Test: Bank Marketing Dataset
Testing if this real-world dataset produces good stacking gains

## How to get the dataset:
1. Download from: https://archive.ics.uci.edu/ml/datasets/bank+marketing
2. This script will download and extract automatically


In [1]:
# Download and extract from UCI
library(caret)
library(dplyr)
set.seed(42)

cat("Downloading Bank Marketing dataset...\n")
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
temp_zip <- tempfile(fileext = ".zip")
download.file(url, temp_zip, mode = "wb")
unzip(temp_zip, exdir = tempdir())
bank <- read.csv(file.path(tempdir(), "bank-additional/bank-additional-full.csv"), sep=";")

cat("Dataset Info:\n")
cat("Dimensions:", nrow(bank), "x", ncol(bank), "\n")
cat("Target variable (y): yes/no subscription\n")
cat("Classes:\n")
print(table(bank$y))
cat("\nProportions:", round(prop.table(table(bank$y)) * 100, 1), "%\n")


Loading required package: ggplot2



Loading required package: lattice




Attaching package: 'dplyr'




The following objects are masked from 'package:stats':

    filter, lag




The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




Downloading Bank Marketing dataset...


Dataset Info:


Dimensions: 41188 x 21 


Target variable (y): yes/no subscription


Classes:



   no   yes 
36548  4640 



Proportions: 88.7 11.3 %


In [2]:
# Preprocessing
bank <- bank %>% 
  select(-duration) %>%  # Remove duration (leaks target info for prediction)
  rename(target = y)     # Rename target column

# Convert target to factor
bank$target <- as.factor(bank$target)

cat("Features after selection:", ncol(bank)-1, "\n\n")

# Identify numeric and categorical columns
numeric_cols <- names(bank)[sapply(bank, is.numeric)]
cat_cols <- names(bank)[sapply(bank, is.factor) | sapply(bank, is.character)]
cat_cols <- cat_cols[cat_cols != "target"]

cat("Numeric features:", length(numeric_cols) - 1, "\n")
cat("Categorical features:", length(cat_cols), "\n\n")

# One-hot encode categorical variables
bank_encoded <- bank
for (col in cat_cols) {
  dummies <- model.matrix(~ . - 1, data.frame(bank_encoded[[col]]))
  colnames(dummies) <- paste0(col, "_", colnames(dummies))
  bank_encoded[[col]] <- NULL
  bank_encoded <- cbind(bank_encoded, dummies[, -ncol(dummies)])  # Drop one for reference
}

cat("Final feature count:", ncol(bank_encoded) - 1, "\n")
cat("Class balance remains:", table(bank_encoded$target), "\n")


Features after selection: 19 



Numeric features: 8 


Categorical features: 10 



Final feature count: 52 


Class balance remains: 36548 4640 


In [3]:
# Split data
train_idx <- createDataPartition(bank_encoded$target, p = 0.8, list = FALSE)
X_train <- bank_encoded[train_idx, -which(names(bank_encoded) == "target")]
y_train <- bank_encoded$target[train_idx]  # Keep as factor
X_test <- bank_encoded[-train_idx, -which(names(bank_encoded) == "target")]
y_test <- bank_encoded$target[-train_idx]  # Keep as factor

cat("Train:", nrow(X_train), "| Test:", nrow(X_test), "\n")
cat("Features:", ncol(X_train), "\n")
cat("y_train class:", class(y_train), "\n")
cat("Class balance - Train:", table(y_train), "\n")
cat("Class balance - Test:", table(y_test), "\n")


Train: 32951 | Test: 8237 


Features: 52 


y_train class: factor 


Class balance - Train: 29239 3712 


Class balance - Test: 7309 928 


In [4]:
# Quick baseline test
library(randomForest)
library(pROC)

# Standardize
preproc <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preproc, X_train)
X_test_scaled <- predict(preproc, X_test)

# Train RF
cat("Training Random Forest (this may take ~60 seconds)...\n")
rf_test <- randomForest(x = X_train_scaled, y = y_train, ntree = 500)
rf_pred_prob <- predict(rf_test, X_test_scaled, type = "prob")[, 2]  # Get probabilities for 'yes'
rf_pred_class <- predict(rf_test, X_test_scaled)  # Get class predictions
rf_acc <- mean(rf_pred_class == y_test)

cat("\n\nRandom Forest Results:\n")
cat("Accuracy:", round(rf_acc, 4), "\n")

# AUC
roc_obj <- roc(as.numeric(y_test) - 1, rf_pred_prob, quiet = TRUE)
cat("AUC-ROC:", round(auc(roc_obj), 4), "\n")

if (rf_acc >= 0.85 && rf_acc <= 0.92) {
  cat("\n✓ This dataset has IDEAL baseline (85-92% range)\n")
  cat("✓ Room for stacking to improve!\n")
  cat("✓ Proceed to integrate into main project\n")
} else if (rf_acc < 0.85) {
  cat("\n⚠ Baseline is LOW (<85%). May need different preprocessing.\n")
} else {
  cat("\n⚠ Baseline is HIGH (>92%). Ceiling effect - limited room for improvement.\n")
}


randomForest 4.7-1.2



Type rfNews() to see new features/changes/bug fixes.




Attaching package: 'randomForest'




The following object is masked from 'package:dplyr':

    combine




The following object is masked from 'package:ggplot2':

    margin




Type 'citation("pROC")' for a citation.




Attaching package: 'pROC'




The following objects are masked from 'package:stats':

    cov, smooth, var




Training Random Forest (this may take ~60 seconds)...




Random Forest Results:


Accuracy: 0.9003 


AUC-ROC: 0.7981 



✓ This dataset has IDEAL baseline (85-92% range)
✓ Room for stacking to improve!
✓ Proceed to integrate into main project


## Summary
- **Dataset**: Bank Marketing (real-world financial data)
- **Size**: 41,188 observations (excellent for stacking)
- **Features**: ~50+ (after encoding) - mixed real-world data
- **Characteristics**:
  - Imbalanced (88% no, 12% yes) - realistic minority class problem
  - Mixed features requiring preprocessing
  - Real-world noise
  - Similar complexity to Ames Housing

## Advantages over tested datasets:
1. **Much larger**: 41,188 obs (vs Ionosphere 351 → prevents meta-model overfitting)
2. **Real-world data**: Not synthetic/controlled like Ionosphere
3. **Preprocessing required**: Like Ames (categorical encoding)
4. **Good baseline for improvement**: Should leave room for stacking gains

## If baseline is 85-92%:
- ✓ Proceed to integrate into main project as Dataset 3
- ✓ Replace Ionosphere (which had only 351 obs and ceiling effect)
- ✓ Will complement Ames (real estate) + Pima (medical) with financial/commercial data
