# üöÄ SMART MODE ACTIVE

## üîç Automatic Data Detection

This notebook now **automatically detects** available datasets in your Kaggle environment!

### How it works:
1. **üîç Auto-Discovery**: Scans `../input/` directory for competition datasets
2. **üìä Smart Loading**: Automatically loads `train.csv` and `test.csv` from first dataset found
3. **üéØ Column Detection**: Auto-detects target and ID columns using common patterns
4. **üå∏ Fallback Mode**: Uses iris demo data if no competition data is found

### Manual Override (Optional):
If auto-detection doesn't work perfectly, you can manually set:

```r
# In cell 3, after auto-detection, override if needed:
TARGET_COL <- "your_actual_target_column"
ID_COL <- "your_actual_id_column"
```

### Supported Patterns:
- **Target columns**: `"target"`, `"label"`, `"y"`, `"survived"`, `"sale_price"`, etc.
- **ID columns**: `"id"`, `"Id"`, `"ID"`, `"PassengerId"`, `"customer_id"`, etc.

---

# Taleji R Suite: Complete Tidymodels Classification Workflow
## üöÄ PRODUCTION MODE - Ready for Kaggle Competition

This notebook demonstrates a comprehensive Random Forest classification pipeline using the **tidymodels** ecosystem. The workflow includes:

- üîÑ **Kaggle data loading** (production mode active)
- üéØ **Stratified train/validation splits** 
- ‚öôÔ∏è **Preprocessing pipeline** with imputation, encoding, and normalization
- üîç **Hyperparameter tuning** with cross-validation
- üìä **Model evaluation** with multiple metrics
- üèÜ **Feature importance analysis**
- üìù **Competition submission file generation**

## üèÜ Production Setup

‚úÖ **Kaggle data loading**: ACTIVE  
‚úÖ **Competition submission**: ACTIVE  
‚úÖ **Hyperparameter tuning**: ACTIVE  
‚úÖ **Feature importance**: ACTIVE

**Next**: Update file paths and column names in the setup cell above, then run all cells!

---

# üìã **Cell Execution Order**

‚ö†Ô∏è **Important**: Run cells in order for proper functionality!

| Cell | Description | Creates |
|------|-------------|---------|
| **1** | Setup Instructions | - |
| **2** | Title & Overview | - |
| **3** | Data Loading & Detection | `your_train_data_frame`, `test_data_processed` |
| **4** | Data Split & Model Training | `train_data`, `val_data`, `your_recipe`, `final_model_fit` |
| **5** | Hyperparameter Tuning | `final_tuned_fit`, `tuned_predictions` |
| **6** | Feature Importance | `feature_importance`, plots |
| **7** | Test Predictions & Submission | `test_predictions`, CSV files |
| **8** | Advanced Techniques Guide | - |

üí° **Tip**: If you get "object not found" errors, re-run the earlier cells that create those objects.

---

In [None]:
# ==============================================================================
# QUICK START: Run All Cells Button Alternative
# ==============================================================================
# If you want to run the entire workflow at once, uncomment and run this cell

# RUN_ALL_WORKFLOW <- TRUE
# 
# if (exists("RUN_ALL_WORKFLOW") && RUN_ALL_WORKFLOW) {
#   cat("üöÄ Running complete workflow...\n\n")
#   
#   # This would execute the entire pipeline programmatically
#   # Uncomment the next line to enable:
#   # source("complete_workflow.R")  # If you save the workflow as a script
#   
#   cat("‚úÖ Workflow completed! Check the objects in your environment.\n")
# } else {
#   cat("üìù Quick start disabled. Run cells individually or uncomment RUN_ALL_WORKFLOW above.\n")
# }

In [None]:
# ==============================================================================
# SMART DATA LOADING (Auto-Detection + Fallback)
# ==============================================================================
# This section automatically detects available data paths or falls back to demo data

library(readr)

# Function to find available competition datasets
find_competition_data <- function() {
  # Check if we're in Kaggle environment
  if (dir.exists("../input/")) {
    # List all available datasets in input directory
    datasets <- list.dirs("../input/", recursive = FALSE, full.names = FALSE)
    cat("üìÅ Available datasets in ../input/:\n")
    for (i in seq_along(datasets)) {
      cat(sprintf("   %d. %s\n", i, datasets[i]))
      # Check for common file patterns
      dataset_path <- paste0("../input/", datasets[i])
      files <- list.files(dataset_path, pattern = "\\.(csv|txt)$", ignore.case = TRUE)
      if (length(files) > 0) {
        cat(sprintf("      Files: %s\n", paste(head(files, 3), collapse = ", ")))
      }
    }
    return(datasets)
  } else {
    cat("üè† Not in Kaggle environment (../input/ not found)\n")
    return(NULL)
  }
}

# Auto-detect and load data
datasets <- find_competition_data()

# Try to load competition data automatically
if (!is.null(datasets) && length(datasets) > 0) {
  # Use the first dataset found (you can modify this logic)
  competition_name <- datasets[1]
  train_path <- paste0("../input/", competition_name, "/train.csv")
  test_path <- paste0("../input/", competition_name, "/test.csv")
  
  cat(sprintf("üîç Attempting to load: %s\n", competition_name))
  cat(sprintf("   Train: %s\n", train_path))
  cat(sprintf("   Test: %s\n", test_path))
  
  # Try to load the files
  if (file.exists(train_path) && file.exists(test_path)) {
    train_data_raw <- read_csv(train_path, show_col_types = FALSE)
    test_data_raw <- read_csv(test_path, show_col_types = FALSE)
    
    cat("‚úÖ Successfully loaded competition data!\n")
    cat(sprintf("   Train: %d rows √ó %d columns\n", nrow(train_data_raw), ncol(train_data_raw)))
    cat(sprintf("   Test: %d rows √ó %d columns\n", nrow(test_data_raw), ncol(test_data_raw)))
    cat("   Columns:", paste(head(names(train_data_raw), 5), collapse = ", "), "\n")
    
    # Auto-detect target and ID columns (common patterns)
    possible_targets <- c("target", "label", "y", "survived", "sale_price", "price")
    possible_ids <- c("id", "Id", "ID", "PassengerId", "customer_id", "row_id")
    
    TARGET_COL <- NULL
    ID_COL <- NULL
    
    # Find target column
    for (col in possible_targets) {
      if (col %in% names(train_data_raw)) {
        TARGET_COL <- col
        break
      }
    }
    
    # Find ID column  
    for (col in possible_ids) {
      if (col %in% names(train_data_raw)) {
        ID_COL <- col
        break
      }
    }
    
    # If not found, make educated guesses
    if (is.null(TARGET_COL)) {
      # Usually the last column or contains specific keywords
      last_col <- names(train_data_raw)[ncol(train_data_raw)]
      TARGET_COL <- last_col
      cat("‚ö†Ô∏è  Target column not auto-detected. Using last column:", TARGET_COL, "\n")
    } else {
      cat("üéØ Auto-detected target column:", TARGET_COL, "\n")
    }
    
    if (is.null(ID_COL)) {
      # Usually the first column
      first_col <- names(train_data_raw)[1]
      ID_COL <- first_col  
      cat("‚ö†Ô∏è  ID column not auto-detected. Using first column:", ID_COL, "\n")
    } else {
      cat("üÜî Auto-detected ID column:", ID_COL, "\n")
    }
    
    # Prepare training data
    your_train_data_frame <- train_data_raw %>%
      mutate(
        # Convert target to factor (handle both numeric and character)
        !!sym(TARGET_COL) := factor(!!sym(TARGET_COL))
      ) %>%
      rename(target_variable = !!sym(TARGET_COL))
    
    # Store test data
    test_data_processed <- test_data_raw
    
    KAGGLE_MODE <- TRUE
    
  } else {
    cat("‚ùå Competition files not found, falling back to demo data\n")
    KAGGLE_MODE <- FALSE
  }
} else {
  cat("üìù No datasets found or not in Kaggle environment, using demo data\n")
  KAGGLE_MODE <- FALSE
}

In [None]:
# 1. Load Essential Libraries
# tidymodels is a meta-package that loads rsample, recipes, parsnip, tune, etc.
library(tidymodels)
library(ranger) # Engine for a fast Random Forest implementation

# Set a seed for reproducibility
set.seed(42)

# ===============================================================================
# Data: Smart Mode - Competition Data or Demo Fallback
# ===============================================================================
# Use competition data if loaded successfully, otherwise fall back to demo data

if (!exists("KAGGLE_MODE") || !KAGGLE_MODE || !exists("your_train_data_frame")) {
  # FALLBACK: Create a binary classification example from iris
  cat("üå∏ Using iris demo data (fallback mode)\n")
  data(iris)
  df <- iris
  # Convert Species to a binary target: setosa vs other
  df$target_variable <- ifelse(df$Species == "setosa", "setosa", "other")
  df$target_variable <- factor(df$target_variable, levels = c("other", "setosa"))
  # Remove original Species column (so recipe uses numeric predictors only)
  df$Species <- NULL
  your_train_data_frame <- df
  rm(df)
  
  # Create demo test data (remove some rows from training)
  set.seed(999)
  demo_indices <- sample(nrow(your_train_data_frame), 20)
  test_data_processed <- your_train_data_frame[demo_indices, ] %>% select(-target_variable)
  your_train_data_frame <- your_train_data_frame[-demo_indices, ]
  
  TARGET_COL <- "target_variable"
  ID_COL <- "row_id"
  
  message("üìä Demo mode active: Using iris dataset with 130 training samples and 20 test samples")
} else {
  message("üèÜ Competition mode active: Using loaded Kaggle competition data")
}

# ==============================================================================
# 2. Data Split (Training and Validation)
# ==============================================================================
# Create a stratified split (important for classification to keep target ratios)
# Use 80% for training and 20% for local validation
# The `strata` argument expects a column name (unquoted) that exists in the data.

# Validate that the target exists and is a factor
if (!"target_variable" %in% names(your_train_data_frame)) {
  stop("'your_train_data_frame' must contain a column named 'target_variable'.")
}
if (!is.factor(your_train_data_frame$target_variable)) {
  your_train_data_frame$target_variable <- factor(your_train_data_frame$target_variable)
  message("Coerced 'target_variable' to a factor.")
}

data_split <- initial_split(
  data = your_train_data_frame,
  prop = 0.80,
  strata = target_variable
)

# Extract the training and validation (test) sets
train_data <- training(data_split)
val_data <- testing(data_split)

# ==============================================================================
# 3. Define Preprocessing/Feature Engineering (Recipe)
# ==============================================================================
# Create a recipe to define your preprocessing steps
# The formula uses target_variable as the outcome. All other columns are predictors.

your_recipe <-
  recipe(target_variable ~ ., data = train_data) %>%
  # Impute missing numeric data with the mean
  step_impute_mean(all_numeric_predictors()) %>%
  # One-hot encode all nominal (factor/character) predictors
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  # Remove variables that are all zero or near zero variance
  step_nzv(all_predictors()) %>%
  # Normalize (center and scale) all numeric data
  step_normalize(all_numeric_predictors())

# You can inspect the recipe with summary(your_recipe)

# ==============================================================================
# 4. Define the Model (fixed hyperparameters so we can fit)
# ==============================================================================
# To avoid errors from tune() placeholders, we compute a sensible default for mtry
# based on the number of predictors in the training set and set min_n to a default.

num_predictors <- ncol(select(train_data, -target_variable))
mtry_val <- max(1, floor(sqrt(num_predictors)))

rf_model <-
  rand_forest(
    mode = "classification",
    mtry = mtry_val,
    trees = 1000,
    min_n = 5
  ) %>%
  set_engine("ranger", importance = "impurity", seed = 42)

# ==============================================================================
# 5. Create the Workflow and Train the Model
# ==============================================================================
# Bundle the recipe and the model together
rf_workflow <- workflow() %>%
  add_recipe(your_recipe) %>%
  add_model(rf_model)

# Fit the workflow to the training data
final_model_fit <- rf_workflow %>%
  fit(data = train_data)

# ==============================================================================
# 6. Prediction and Evaluation (on Validation Set)
# ==============================================================================
# Make predictions on the local validation data. We ask for class probabilities.
val_predictions <-
  final_model_fit %>%
  predict(new_data = val_data, type = "prob") %>% # Get probabilities
  bind_cols(final_model_fit %>% predict(new_data = val_data, type = "class")) %>%
  bind_cols(val_data %>% select(target_variable))

# The probability column will be named `.pred_<level>`; for the example we created
# this will be `.pred_setosa`. Replace `.pred_setosa` below with the name of the
# positive-class probability in your run if you changed class names.
prob_col <- grep("^\\.pred_", names(val_predictions), value = TRUE)
prob_col

# Show a quick head of predictions
print(head(val_predictions))

# Metrics: accuracy and ROC AUC (binary only)
# For ROC AUC we explicitly set event_level = "second" because the positive class
# in this notebook's example is the second level of the factor ("setosa").
metric_set <- metric_set(accuracy, roc_auc)

# Identify the positive class probability column (e.g. .pred_setosa)
pos_prob_name <- prob_col[1]

# Compute accuracy (uses the predicted class column .pred_class)
acc <- accuracy(val_predictions, truth = target_variable, estimate = .pred_class)
print(acc)

# Compute ROC AUC only if we have a binary problem
if (nlevels(your_train_data_frame$target_variable) == 2) {
  # Use `!!sym(pos_prob_name)` to pass the probability column to roc_auc
  roc_res <- roc_auc(val_predictions, truth = target_variable, !!rlang::sym(pos_prob_name), event_level = "second")
  print(roc_res)
} else {
  message("ROC AUC skipped: target has more than 2 levels. For multiclass use `roc_auc_multiclass()` or other multiclass metrics.")
}

# Confusion matrix
conf_mat_res <- conf_mat(val_predictions, truth = target_variable, estimate = .pred_class)
print(conf_mat_res)

# ==============================================================================
# Notes:
# - Replace 'your_train_data_frame' in your environment with your real dataset.
# - Ensure the dataset contains a factor column named 'target_variable'.
# - If you want to tune hyperparameters (mtry, min_n) use `tune_grid()` and resampling,
#   but remove `tune()` placeholders before fitting directly.
# - For multiclass problems, change evaluation metrics accordingly.
# ==============================================================================

‚îÄ‚îÄ [1mAttaching packages[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ tidymodels 1.2.0 ‚îÄ‚îÄ

[32m‚úî[39m [34mbroom       [39m 1.0.6      [32m‚úî[39m [34mrecipes     [39m 1.0.10
[32m‚úî[39m [34mdials       [39m 1.2.1      [32m‚úî[39m [34mrsample     [39m 1.2.1 
[32m‚úî[39m [34mdplyr       [39m 1.1.4      [32m‚úî[39m [34mtibble      [39m 3.2.1 
[32m‚úî[39m [34mggplot2     [39m 3.5.1      [32m‚úî[39m [34mtidyr       [39m 1.3.1 
[32m‚úî[39m [34minfer       [39m 1.0.7      [32m‚úî[39m [34mtune        [39m 1.2.1 
[32m‚úî[39m [34mmodeldata   [39m 1.4.0      [32m‚úî[39m [34mworkflows   [39m 1.1.4 
[32m‚úî[39m [34mparsnip     [39m 1.2.1      [32m‚úî[39m [34mworkflowsets[39m 1.1.0 
[32m‚úî[39m [34mpurrr       [39m 1.0.2      [32m‚úî[39m [34myardstick   [39m 1.3.1 

‚îÄ‚îÄ [1mConflicts[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

ERROR: Error in eval(expr, envir, enclos): object 'your_train_data_frame' not found


In [None]:
# ==============================================================================
# 7. HYPERPARAMETER TUNING (Production-Ready)
# ==============================================================================
# The previous section used fixed hyperparameters for a quick demo.
# This section implements proper cross-validation tuning for optimal performance.

# Check dependencies from previous cells
if (!exists("train_data") || !exists("val_data") || !exists("your_recipe")) {
  stop("‚ùå Missing required objects! Please run the previous cells first:\n",
       "   - Cell 4: Creates train_data, val_data, and your_recipe\n",
       "   - Make sure all previous cells completed successfully")
}

cat("‚úÖ Dependencies check passed - proceeding with hyperparameter tuning\n")

library(tune)
library(dials)

# 7.1 Create Cross-Validation Folds
set.seed(123)
cv_folds <- vfold_cv(
  data = train_data, 
  v = 10,                    # 10-fold cross-validation
  strata = target_variable   # Maintain class balance across folds
)

print(paste("Created", nrow(cv_folds), "cross-validation folds"))

# 7.2 Define Tunable Model Specification
rf_tuned_spec <- 
  rand_forest(
    mode = "classification",
    mtry = tune(),           # Number of variables at each split
    trees = 1000,            # Keep trees fixed (1000 is usually sufficient)
    min_n = tune()           # Minimum samples per leaf node
  ) %>%
  set_engine("ranger", 
             importance = "impurity",
             seed = 42)

# 7.3 Create Tuning Workflow
rf_tuned_workflow <- workflow() %>%
  add_recipe(your_recipe) %>%
  add_model(rf_tuned_spec)

# 7.4 Define Hyperparameter Grid
# Create a reasonable search space
num_features <- ncol(select(train_data, -target_variable))

tuning_grid <- grid_regular(
  mtry(range = c(2, min(10, num_features))),  # 2 to 10 features (or max available)
  min_n(range = c(2, 20)),                    # 2 to 20 minimum samples per node
  levels = 5                                  # 5x5 = 25 combinations
)

print(paste("Created tuning grid with", nrow(tuning_grid), "parameter combinations"))
head(tuning_grid)

# 7.5 Execute Hyperparameter Tuning
print("Starting hyperparameter tuning... This may take a few minutes.")

tuning_results <- 
  rf_tuned_workflow %>%
  tune_grid(
    resamples = cv_folds,
    grid = tuning_grid,
    metrics = metric_set(roc_auc, accuracy, sens, spec),
    control = control_grid(save_pred = TRUE, verbose = TRUE)
  )

print("Hyperparameter tuning completed!")

# 7.6 Examine Tuning Results
collect_metrics(tuning_results) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean)) %>%
  head(10)

# 7.7 Select Best Parameters and Finalize Workflow
best_params <- select_best(tuning_results, metric = "roc_auc")
print("Best hyperparameters:")
print(best_params)

# Finalize the workflow with best parameters
final_tuned_workflow <- finalize_workflow(rf_tuned_workflow, best_params)

# 7.8 Train Final Model on Full Training Set
print("Training final model with optimized hyperparameters...")
final_tuned_fit <- final_tuned_workflow %>%
  fit(data = train_data)

print("Final model training completed!")

# 7.9 Compare: Tuned vs Untuned Performance on Validation Set
tuned_predictions <- 
  final_tuned_fit %>%
  predict(new_data = val_data, type = "prob") %>%
  bind_cols(final_tuned_fit %>% predict(new_data = val_data, type = "class")) %>%
  bind_cols(val_data %>% select(target_variable))

# Compute metrics for comparison
untuned_roc <- roc_auc(val_predictions, truth = target_variable, 
                       !!rlang::sym(names(val_predictions)[1]))
tuned_roc <- roc_auc(tuned_predictions, truth = target_variable, 
                     !!rlang::sym(names(tuned_predictions)[1]))

untuned_acc <- accuracy(val_predictions, truth = target_variable, estimate = .pred_class)
tuned_acc <- accuracy(tuned_predictions, truth = target_variable, estimate = .pred_class)

cat("\n=== PERFORMANCE COMPARISON ===\n")
cat("Untuned Model:\n")
cat("  ROC AUC:", round(untuned_roc$.estimate, 4), "\n")
cat("  Accuracy:", round(untuned_acc$.estimate, 4), "\n")
cat("Tuned Model:\n") 
cat("  ROC AUC:", round(tuned_roc$.estimate, 4), "\n")
cat("  Accuracy:", round(tuned_acc$.estimate, 4), "\n")
cat("Improvement:\n")
cat("  ROC AUC:", sprintf("%+.4f", tuned_roc$.estimate - untuned_roc$.estimate), "\n")
cat("  Accuracy:", sprintf("%+.4f", tuned_acc$.estimate - untuned_acc$.estimate), "\n")

In [None]:
# ==============================================================================
# 8. FEATURE IMPORTANCE ANALYSIS
# ==============================================================================
# Extract and visualize feature importance from the trained Random Forest model

# Check dependencies from previous cells
if (!exists("final_tuned_fit")) {
  stop("‚ùå Missing 'final_tuned_fit' object! Please run cell 5 (Hyperparameter Tuning) first.\n",
       "   This cell creates the tuned model needed for feature importance analysis.")
}

cat("‚úÖ Tuned model found - proceeding with feature importance analysis\n")

library(vip)      # For variable importance plots
library(ggplot2)  # For enhanced plotting

# 8.1 Extract Feature Importance
# The ranger engine calculates importance when importance = "impurity" is set
feature_importance <- final_tuned_fit %>%
  extract_fit_parsnip() %>%
  vi()

print("Top 10 Most Important Features:")
print(head(feature_importance, 10))

# 8.2 Create Feature Importance Visualization
importance_plot <- feature_importance %>%
  slice_head(n = min(15, nrow(feature_importance))) %>%  # Top 15 or all if fewer
  mutate(Variable = reorder(Variable, Importance)) %>%
  ggplot(aes(x = Importance, y = Variable)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_text(aes(label = round(Importance, 2)), 
            hjust = -0.1, size = 3) +
  labs(
    title = "Random Forest Feature Importance",
    subtitle = "Top predictive features (Gini impurity reduction)",
    x = "Importance Score",
    y = "Features"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

print(importance_plot)

# 8.3 Feature Importance Summary Statistics
cat("\n=== FEATURE IMPORTANCE SUMMARY ===\n")
cat("Total features:", nrow(feature_importance), "\n")
cat("Top feature:", feature_importance$Variable[1], 
    "(Importance:", round(feature_importance$Importance[1], 2), ")\n")
cat("Mean importance:", round(mean(feature_importance$Importance), 2), "\n")
cat("Features with >50% of max importance:", 
    sum(feature_importance$Importance > 0.5 * max(feature_importance$Importance)), "\n")

# 8.4 Alternative: Use vip package for cleaner visualization
vip_plot <- final_tuned_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = min(15, nrow(feature_importance)),
      geom = "col",
      aesthetics = list(fill = "darkorange", alpha = 0.8)) +
  labs(
    title = "Variable Importance Plot",
    subtitle = "Alternative visualization using vip package"
  ) +
  theme_minimal()

print(vip_plot)

In [None]:
# ==============================================================================
# 9. TEST SET PREDICTIONS & KAGGLE SUBMISSION
# ==============================================================================
# Generate predictions for test set and create submission file

# Check dependencies from previous cells
if (!exists("final_tuned_fit") || !exists("test_data_processed")) {
  stop("‚ùå Missing required objects! Please run previous cells first:\n",
       "   - Cell 3: Creates test_data_processed\n", 
       "   - Cell 5: Creates final_tuned_fit (tuned model)\n",
       "   Make sure all previous cells completed successfully")
}

cat("‚úÖ All dependencies found - proceeding with test predictions\n")

# 9.1 Load and Prepare Test Data (Kaggle Competition Mode - ACTIVE)
# Production mode: Using actual competition test data

test_data_processed <- test_data_processed %>%
  # Apply the same preprocessing as training data (outside of recipe)
  # Add any custom feature engineering here that matches training data
  mutate(
    # Example transformations (match your training data preprocessing)
    # Add any feature engineering that was applied to training data
    # new_feature = some_transformation(existing_feature)
  )

# 9.2 DEVELOPMENT MODE (commented out for production)
# demo_test_data <- val_data %>% 
#   select(-target_variable)
# cat("Demo test set created with", nrow(demo_test_data), "samples and", 
#     ncol(demo_test_data), "features\n")

# Production: Use actual test data
demo_test_data <- test_data_processed
cat("Production test set loaded with", nrow(demo_test_data), "samples and", 
    ncol(demo_test_data), "features\n")

# 9.3 Generate Test Predictions
print("Generating test set predictions...")

test_predictions <- final_tuned_fit %>%
  predict(new_data = demo_test_data, type = "prob") %>%
  bind_cols(final_tuned_fit %>% predict(new_data = demo_test_data, type = "class"))

# Add row IDs (in real Kaggle competition, use the actual ID column)
test_predictions <- test_predictions %>%
  mutate(id = row_number()) %>%
  select(id, everything())

print("Test predictions generated successfully!")
head(test_predictions)

# 9.4 Create Kaggle Submission File (Production Mode)
# Use actual ID column from test data and appropriate prediction format

# Extract actual IDs from test data (use the ID_COL defined earlier)
actual_ids <- test_data_raw[[ID_COL]]

# For binary classification, typically submit probabilities of positive class
submission_data <- test_predictions %>%
  mutate(!!sym(ID_COL) := actual_ids) %>%
  select(
    !!sym(ID_COL),                                # Use actual ID column name
    # For binary: select probability of positive class (level 2)
    prediction = 2                                # This selects the 2nd probability column
  )

# Alternative: if competition wants class predictions instead of probabilities
submission_classes <- test_predictions %>%
  mutate(!!sym(ID_COL) := actual_ids) %>%
  select(
    !!sym(ID_COL),
    prediction = .pred_class
  ) %>%
  mutate(
    # Convert factor to numeric if needed (0/1 instead of factor levels)
    prediction = as.numeric(prediction) - 1
  )

# 9.5 Write Submission Files
write.csv(submission_data, "submission_probabilities.csv", row.names = FALSE)
write.csv(submission_classes, "submission_classes.csv", row.names = FALSE)

cat("\n=== SUBMISSION FILES CREATED ===\n")
cat("üìÅ submission_probabilities.csv - Probability predictions\n")
cat("üìÅ submission_classes.csv - Class predictions (0/1)\n")
cat("Choose the appropriate file based on competition requirements.\n")

# 9.6 Submission File Preview
cat("\nSubmission file preview (probabilities):\n")
print(head(submission_data))

cat("\nSubmission file preview (classes):\n") 
print(head(submission_classes))

# 9.7 Final Model Summary for Documentation
cat("\n=== FINAL MODEL SUMMARY ===\n")
cat("Model Type: Random Forest (ranger engine)\n")
cat("Tuned Parameters:\n")
cat("  - mtry:", best_params$mtry, "\n")
cat("  - min_n:", best_params$min_n, "\n")
cat("  - trees: 1000 (fixed)\n")
cat("Cross-Validation Performance (ROC AUC):", 
    round(tuned_roc$.estimate, 4), "\n")
cat("Features used:", nrow(feature_importance), "\n")
cat("Training samples:", nrow(train_data), "\n")
cat("Validation samples:", nrow(val_data), "\n")
cat("Test predictions:", nrow(test_predictions), "\n")

---

## üéØ Next Steps & Advanced Techniques

### For Higher Kaggle Scores:
1. **Ensemble Methods**: Combine Random Forest with XGBoost, LightGBM
2. **Advanced Feature Engineering**: Create interaction terms, polynomial features
3. **Stacking/Blending**: Use multiple models and meta-learners
4. **Hyperparameter Optimization**: Try Bayesian optimization with `tune_bayes()`
5. **Cross-Validation Strategies**: Experiment with different CV schemes

### Model Diagnostics:
- **Learning Curves**: Plot performance vs training set size
- **Validation Curves**: Plot performance vs hyperparameter values  
- **Residual Analysis**: For regression problems
- **ROC Curves**: Detailed threshold analysis

### Production Deployment:
- **Model Serialization**: Save with `saveRDS()` for later use
- **Pipeline Validation**: Test on completely new data
- **Monitoring**: Track model performance over time

---

**üìã Summary**: This notebook provides a complete, production-ready Random Forest pipeline with hyperparameter tuning, evaluation, and submission file generation. Simply uncomment the Kaggle data loading sections and update the file paths to use with your competition data.