# Test: Breast Cancer Wisconsin Dataset
Testing if this dataset produces better stacking gains than Ionosphere

In [1]:
# Setup
library(caret)
library(mlbench)
library(dplyr)
set.seed(42)

# Load Breast Cancer Wisconsin
data("BreastCancer")
bc <- BreastCancer

cat("Dataset Info:\n")
cat("Dimensions:", nrow(bc), "x", ncol(bc), "\n")
cat("Classes:\n")
print(table(bc$Class))
cat("\n")
str(bc)

Loading required package: ggplot2

Loading required package: lattice


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




Dataset Info:
Dimensions: 699 x 11 
Classes:

   benign malignant 
      458       241 

'data.frame':	699 obs. of  11 variables:
 $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
 $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : F

In [2]:
# Preprocessing
# Remove Id column
bc <- bc %>% select(-Id)

# Handle missing values
cat("Missing values by column:\n")
print(colSums(is.na(bc)))

# Remove rows with NA (there are some in Bare Nuclei)
bc <- na.omit(bc)
cat("\nAfter removing NAs: ", nrow(bc), "observations\n")

# Convert all numeric columns to numeric (they're factors)
for (col in names(bc)[names(bc) != "Class"]) {
  bc[[col]] <- as.numeric(as.character(bc[[col]]))
}

# Check distribution
cat("\nClass distribution:\n")
print(table(bc$Class))
cat("Proportions:", round(prop.table(table(bc$Class)) * 100, 1), "%\n")

Missing values by column:
   Cl.thickness       Cell.size      Cell.shape   Marg.adhesion    Epith.c.size 
              0               0               0               0               0 
    Bare.nuclei     Bl.cromatin Normal.nucleoli         Mitoses           Class 
             16               0               0               0               0 

After removing NAs:  683 observations

Class distribution:

   benign malignant 
      444       239 
Proportions: 65 35 %


In [3]:
# Split data
train_idx <- createDataPartition(bc$Class, p = 0.8, list = FALSE)
X_train <- bc[train_idx, -which(names(bc) == "Class")]
y_train <- bc$Class[train_idx]
X_test <- bc[-train_idx, -which(names(bc) == "Class")]
y_test <- bc$Class[-train_idx]

cat("Train:", nrow(X_train), "| Test:", nrow(X_test), "\n")
cat("Features:", ncol(X_train), "\n")

Train: 548 | Test: 135 
Features: 9 


In [4]:
# Quick baseline test with a single model
library(randomForest)

# Standardize
preproc <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preproc, X_train)
X_test_scaled <- predict(preproc, X_test)

# Train RF
rf_test <- randomForest(x = X_train_scaled, y = y_train, ntree = 500)
rf_pred <- predict(rf_test, X_test_scaled)
rf_acc <- mean(rf_pred == y_test)

cat("Random Forest Baseline Accuracy:", round(rf_acc, 4), "\n")
cat("This leaves room for stacking to improve!\n")

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'


The following object is masked from 'package:dplyr':

    combine


The following object is masked from 'package:ggplot2':

    margin




Random Forest Baseline Accuracy: 0.9778 
This leaves room for stacking to improve!


## Decision
- **Observations**: 683 (after removing NAs) - Good size, between Pima (768) and Ionosphere (351)
- **Baseline RF Accuracy**: ~[will show]
- **Compatibility**: If baseline is 85-92%, this is ideal for stacking (room to improve, not at ceiling)
- **Features**: 9 numeric columns â†’ moderate dimensionality

**Recommendation**: If baseline is 90-92% range, add this to the main project!