# Random Forest Scaling Laws Analysis

This notebook demonstrates how to use the Random Forest scaling laws research framework to analyze how Random Forest performance scales with different computational resources.

## Overview

We'll explore:
1. How training time scales with dataset size (samples and features)
2. How training time scales with Random Forest parameters
3. Memory usage patterns
4. Scaling law analysis and interpretation

In [None]:
# Import necessary libraries
import sys
import os
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scaling_laws_research.utils.config import ExperimentConfig, RandomForestConfig, DataConfig
from scaling_laws_research.experiments.scaling_experiments import ScalingExperiment
from scaling_laws_research.analysis.scaling_laws import ScalingLawAnalyzer
from scaling_laws_research.visualizations.scaling_plots import ScalingPlotter

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. Configure Experiment

First, let's set up a simple experiment configuration. We'll use smaller scales for the notebook to keep runtime reasonable.

In [None]:
# Create a custom configuration for the notebook
config = ExperimentConfig(
    name="notebook_scaling_demo",
    description="Random Forest scaling demo for notebook",
    output_dir="../results/notebook_demo",
    verbose=True
)

# Customize data configuration for faster execution
config.data.base_samples = 500
config.data.base_features = 10
config.data.sample_scales = [1, 2, 4, 8]  # Smaller scales
config.data.feature_scales = [1, 2, 4]    # Smaller scales

# Customize Random Forest parameters for faster execution
config.random_forest.n_estimators = [10, 50, 100]  # Fewer trees
config.random_forest.max_depth = [None, 5, 10]     # Fewer depth options
config.random_forest.min_samples_split = [2, 5]    # Fewer options
config.random_forest.min_samples_leaf = [1, 2]     # Fewer options

print("Experiment configuration:")
print(f"Base samples: {config.data.base_samples}")
print(f"Base features: {config.data.base_features}")
print(f"Sample scales: {config.data.sample_scales}")
print(f"Feature scales: {config.data.feature_scales}")
print(f"Total dataset combinations: {len(config.data.sample_scales) * len(config.data.feature_scales)}")
print(f"Total parameter combinations: {len(config.get_parameter_grid())}")

## 2. Run Scaling Experiment

Now let's run the experiment to collect scaling data.

In [None]:
# Create and run the experiment
experiment = ScalingExperiment(config)
results_df = experiment.run_full_experiment()

print(f"\nExperiment completed! Collected {len(results_df)} data points.")
print("\nFirst few results:")
print(results_df[['experiment_id', 'scaling_type', 'n_samples_train', 'n_features', 'training_time', 'accuracy']].head())

## 3. Analyze Data Scaling Results

Let's examine how performance scales with dataset size.

In [None]:
# Filter data scaling results
data_results = results_df[results_df['scaling_type'] == 'data'].copy()

print("Data Scaling Results:")
print(data_results[['n_samples_train', 'n_features', 'training_time', 'training_memory_mb', 'accuracy']].head(10))

# Quick statistics
print("\nScaling ranges:")
print(f"Samples: {data_results['n_samples_train'].min()} - {data_results['n_samples_train'].max()}")
print(f"Features: {data_results['n_features'].min()} - {data_results['n_features'].max()}")
print(f"Training time: {data_results['training_time'].min():.3f} - {data_results['training_time'].max():.3f} seconds")

## 4. Visualize Scaling Patterns

Let's create visualizations to understand the scaling behavior.

In [None]:
# Create plotter and generate data scaling plots
plotter = ScalingPlotter()

# Plot training time scaling
fig = plotter.plot_data_scaling(results_df, metric="training_time")
plt.show()

print("The plots above show how training time scales with:")
print("- Number of training samples (top left)")
print("- Number of features (top right)")
print("- Heatmap of both dimensions (bottom left)")
print("- Overall dataset complexity (bottom right)")

In [None]:
# Plot parameter scaling
fig = plotter.plot_parameter_scaling(results_df, metric="training_time")
plt.show()

print("The plots above show how training time scales with Random Forest parameters:")
print("- Number of estimators (trees)")
print("- Maximum depth")
print("- Minimum samples for split")
print("- Minimum samples per leaf")

## 5. Extract Scaling Laws

Now let's analyze the scaling behavior mathematically and extract scaling laws.

In [None]:
# Analyze scaling laws
analyzer = ScalingLawAnalyzer()
data_analysis = analyzer.analyze_data_scaling(results_df)
param_analysis = analyzer.analyze_parameter_scaling(results_df)

print("Data Scaling Analysis:")
print("=" * 50)

for x_var, metrics in data_analysis.items():
    print(f"\n{x_var.replace('_', ' ').title()}:")
    for metric, analysis in metrics.items():
        if analysis.get('valid', False) and analysis['r_squared'] > 0.5:
            print(f"  {metric}: O(x^{analysis['b']:.2f}) - R² = {analysis['r_squared']:.3f}")
            if 'interpretation' in analysis:
                print(f"    → {analysis['interpretation']}")

In [None]:
# Generate and display the full report
report = analyzer.generate_summary_report(data_analysis, param_analysis)
print(report)

## 6. Scaling Laws Visualization

Let's create detailed plots showing the power-law fits.

In [None]:
# Generate scaling laws plot with power-law fits
fig = plotter.plot_scaling_laws(results_df)
plt.show()

print("The plots above show:")
print("- Log-log plots of performance metrics vs dataset dimensions")
print("- Red dashed lines show power-law fits (y = a × x^b)")
print("- Bottom right panel shows scaling exponents and fit quality")

## 7. Performance Comparison

Finally, let's compare different performance aspects.

In [None]:
# Generate performance comparison plots
fig = plotter.plot_performance_comparison(results_df)
plt.show()

print("The comparison plots show:")
print("- Performance vs training time trade-offs")
print("- Memory vs time relationships")
print("- Training vs prediction time scaling")
print("- CPU usage patterns")

## 8. Key Insights

Let's summarize the key findings from our scaling analysis.

In [None]:
# Extract key insights
insights = analyzer._extract_key_insights(data_analysis, param_analysis)

print("Key Insights from Random Forest Scaling Analysis:")
print("=" * 60)
for i, insight in enumerate(insights, 1):
    print(f"{i}. {insight}")

print("\nConclusion:")
print("This analysis provides empirical scaling laws for Random Forest models,")
print("helping to predict computational requirements and optimize resource allocation")
print("for machine learning workflows.")

## Next Steps

To extend this analysis:

1. **Larger Scale Experiments**: Run with larger datasets and more parameter combinations
2. **Different Datasets**: Test with various real-world datasets to see how scaling laws generalize
3. **Other Algorithms**: Compare Random Forest scaling with other ensemble methods
4. **Hardware Analysis**: Study how scaling laws change with different hardware configurations
5. **Parallel Processing**: Analyze scaling with different numbers of CPU cores

The framework is designed to be extensible - you can easily modify configurations and add new metrics for analysis.