# Random Forest Model - Cloud Resource Forecasting

---

## Objectives

1. **Hyperparameter Tuning**: Grid search for optimal Random Forest parameters
2. **Training**: Fit Random Forest models with high correlation features
3. **Forecasting**: Multi-step ahead prediction (10 minutes = 20 steps)
4. **Evaluation**: Calculate MAE, RMSE, MAPE, R² metrics
5. **Comparison**: Save results for model comparison

---

**Dataset Info:**
- Time interval: 30 seconds
- Forecast horizon: 10 minutes (20 steps)
- Models: 3 (memory_usage_pct, cpu_total_usage, system_load)
- Method: Random Forest Regressor
- Feature selection: High correlation features from ETL


## 1. Import Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings
from datetime import datetime
import time

# Machine Learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Import model utilities
from model_utils import (
    save_model,
    load_model,
    calculate_metrics,
    print_metrics,
    save_results,
    create_models_directory
)

warnings.filterwarnings('ignore')

# Display settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

# Create models directory
create_models_directory()

print("✓ Libraries imported")
print("✓ Model utilities loaded")
print(f"Analysis started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


## 2. Load Processed Data


In [None]:
# Load feature metadata
with open('processed_data/feature_metadata.json', 'r') as f:
    feature_metadata = json.load(f)

print("Feature Metadata:")
print("="*80)
for target, info in feature_metadata.items():
    print(f"\n{target}:")
    print(f"  Features: {info['n_features']}")
    print(f"  List: {info['features']}")

# Target variables
target_vars = ['memory_usage_pct', 'cpu_total_usage', 'system_load']

print("\n" + "="*80)
print("✓ Metadata loaded")


In [None]:
# Load train/test datasets
datasets = {}

for target in target_vars:
    print(f"\nLoading {target}...")
    
    X_train = pd.read_csv(f'processed_data/{target}/X_train.csv')
    X_test = pd.read_csv(f'processed_data/{target}/X_test.csv')
    y_train = pd.read_csv(f'processed_data/{target}/y_train.csv').squeeze()
    y_test = pd.read_csv(f'processed_data/{target}/y_test.csv').squeeze()
    
    datasets[target] = {
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'features': feature_metadata[target]['features']
    }
    
    print(f"  X_train: {X_train.shape}")
    print(f"  X_test: {X_test.shape}")
    print(f"  y_train: {len(y_train):,} samples")
    print(f"  y_test: {len(y_test):,} samples")

print("\n" + "="*80)
print("✓ All datasets loaded")
print("="*80)


## 3. Random Forest Configuration & Hyperparameter Tuning

Random Forest hyperparameters to tune:
- **n_estimators**: Number of trees in the forest
- **max_depth**: Maximum depth of trees
- **min_samples_split**: Minimum samples required to split
- **min_samples_leaf**: Minimum samples at leaf node
- **max_features**: Number of features to consider for best split


In [None]:
# Random Forest hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Forecast horizon
FORECAST_HORIZON = 20  # 10 minutes

# Grid search configuration
GRID_SEARCH = True  # Set to False to use default parameters
USE_RANDOMIZED = True  # Use RandomizedSearchCV for faster search
N_ITER = 20  # Number of parameter settings sampled
N_JOBS = -1  # Use all CPU cores
RANDOM_STATE = 42

print("Random Forest Configuration:")
print("="*80)
print(f"Parameter grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")
print(f"\nGrid search enabled: {GRID_SEARCH}")
print(f"Randomized search: {USE_RANDOMIZED}")
if USE_RANDOMIZED:
    print(f"Iterations: {N_ITER}")
print(f"Forecast horizon: {FORECAST_HORIZON} steps (10 minutes)")
print(f"Random state: {RANDOM_STATE}")
print("="*80)


## 4. Train Random Forest Models


In [None]:
# Train Random Forest models
rf_models = {}
training_results = {}

print("="*80)
print("TRAINING RANDOM FOREST MODELS")
print("="*80)

for target in target_vars:
    print(f"\n{'='*80}")
    print(f"Target: {target}")
    print(f"{'='*80}")
    
    X_train = datasets[target]['X_train']
    y_train = datasets[target]['y_train']
    
    print(f"Training samples: {len(y_train):,}")
    print(f"Features: {len(X_train.columns)}")
    
    start_time = time.time()
    
    try:
        if GRID_SEARCH:
            rf = RandomForestRegressor(random_state=RANDOM_STATE, n_jobs=N_JOBS)
            
            if USE_RANDOMIZED:
                print("\nPerforming randomized search...")
                search = RandomizedSearchCV(
                    rf,
                    param_grid,
                    n_iter=N_ITER,
                    cv=3,
                    scoring='neg_mean_squared_error',
                    n_jobs=N_JOBS,
                    verbose=1,
                    random_state=RANDOM_STATE
                )
            else:
                print("\nPerforming grid search...")
                search = GridSearchCV(
                    rf,
                    param_grid,
                    cv=3,
                    scoring='neg_mean_squared_error',
                    n_jobs=N_JOBS,
                    verbose=1
                )
            
            search.fit(X_train, y_train)
            
            best_model = search.best_estimator_
            best_params = search.best_params_
            best_score = -search.best_score_  # Convert to positive MSE
            
            print(f"\n✓ Search completed")
            print(f"  Best parameters: {best_params}")
            print(f"  Best CV MSE: {best_score:.6f}")
            
        else:
            print("\nTraining with default parameters...")
            best_model = RandomForestRegressor(
                n_estimators=100,
                max_depth=20,
                random_state=RANDOM_STATE,
                n_jobs=N_JOBS
            )
            best_model.fit(X_train, y_train)
            best_params = {
                'n_estimators': 100,
                'max_depth': 20,
                'min_samples_split': 2,
                'min_samples_leaf': 1,
                'max_features': 'sqrt'
            }
            best_score = None
        
        training_time = time.time() - start_time
        
        # Store model
        rf_models[target] = best_model
        
        # Save model
        print("\nSaving model...")
        model_path = save_model(
            best_model,
            model_name='random_forest',
            target=target,
            config={
                'params': best_params,
                'n_features': len(X_train.columns),
                'features': list(X_train.columns)
            },
            models_dir='models'
        )
        
        training_results[target] = {
            'params': best_params,
            'n_features': len(X_train.columns),
            'n_samples': len(y_train),
            'training_time': training_time,
            'cv_mse': best_score,
            'model_path': model_path
        }
        
        print(f"✓ Training completed in {training_time:.2f}s")
        
        # Feature importance
        feature_importance = pd.DataFrame({
            'feature': X_train.columns,
            'importance': best_model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        print(f"\nTop 5 Important Features:")
        for idx, row in feature_importance.head(5).iterrows():
            print(f"  {row['feature']:30s}: {row['importance']:.4f}")
        
    except Exception as e:
        print(f"✗ Training failed: {str(e)}")
        training_results[target] = {'error': str(e), 'success': False}

print("\n" + "="*80)
print("✓ Training completed and models saved")
print("="*80)


In [None]:
def rf_rolling_forecast(model, X_test, y_test, horizon=20):
    """
    Rolling forecast for Random Forest
    Predict 'horizon' steps ahead at each time point
    """
    n_test = len(y_test)
    predictions = []
    
    # Can only forecast where we have enough future data
    n_forecast_points = n_test - horizon + 1
    
    print(f"  Forecasting {n_forecast_points} points with horizon={horizon}")
    
    for i in range(n_forecast_points):
        # Use features at time t to predict value at time t+horizon
        X_current = X_test.iloc[i:i+1]
        pred = model.predict(X_current)[0]
        predictions.append(pred)
        
        if (i + 1) % 5000 == 0:
            print(f"    Progress: {i+1}/{n_forecast_points}")
    
    # Align actual values
    predictions = np.array(predictions)
    actual = y_test.iloc[horizon:horizon+n_forecast_points].values
    
    return predictions, actual

# Perform forecasting
print("="*80)
print(f"MULTI-STEP FORECASTING (Horizon: {FORECAST_HORIZON} steps = 10 minutes)")
print("="*80)

forecast_results = {}

for target in target_vars:
    if target not in rf_models:
        print(f"\n✗ Skipping {target} - model not trained")
        continue
    
    print(f"\n{'='*80}")
    print(f"Target: {target}")
    print(f"{'='*80}")
    
    model = rf_models[target]
    X_test = datasets[target]['X_test']
    y_test = datasets[target]['y_test']
    
    start_time = time.time()
    
    try:
        predictions, actual = rf_rolling_forecast(model, X_test, y_test, FORECAST_HORIZON)
        
        forecast_time = time.time() - start_time
        
        forecast_results[target] = {
            'predictions': predictions,
            'actual': actual,
            'n_predictions': len(predictions),
            'forecast_time': forecast_time,
            'horizon': FORECAST_HORIZON
        }
        
        print(f"✓ Completed in {forecast_time:.2f}s")
        print(f"  Predictions: {len(predictions):,}")
        print(f"  Avg time: {forecast_time/len(predictions)*1000:.2f}ms per forecast")
        
    except Exception as e:
        print(f"✗ Forecasting failed: {str(e)}")
        forecast_results[target] = {'error': str(e), 'success': False}

print("\n" + "="*80)
print("✓ Forecasting completed")
print("="*80)


In [None]:
# Calculate metrics
print("="*80)
print("EVALUATION METRICS")
print("="*80)

evaluation_results = {}

for target in target_vars:
    if target not in forecast_results or 'predictions' not in forecast_results[target]:
        print(f"\n✗ Skipping {target} - no predictions")
        continue
    
    print(f"\n{'='*80}")
    print(f"Target: {target}")
    print(f"{'='*80}")
    
    y_true = forecast_results[target]['actual']
    y_pred = forecast_results[target]['predictions']
    
    # Calculate metrics
    metrics = calculate_metrics(y_true, y_pred)
    evaluation_results[target] = metrics
    
    # Print formatted metrics
    print_metrics(metrics, target)

print("\n" + "="*80)
print("✓ Evaluation completed")
print("="*80)


## 7. Feature Importance Analysis


In [None]:
# Plot feature importance for all targets
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Random Forest: Feature Importance', fontsize=16, fontweight='bold')

for idx, target in enumerate(target_vars):
    if target not in rf_models:
        continue
    
    model = rf_models[target]
    X_train = datasets[target]['X_train']
    
    # Get feature importance
    importance_df = pd.DataFrame({
        'feature': X_train.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=True)
    
    # Plot
    importance_df.plot(kind='barh', x='feature', y='importance', ax=axes[idx], legend=False)
    axes[idx].set_title(f'{target}')
    axes[idx].set_xlabel('Importance')
    axes[idx].set_ylabel('')
    axes[idx].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()


## 8. Visualization


In [None]:
# Plot predictions vs actual
fig, axes = plt.subplots(3, 1, figsize=(15, 12))
fig.suptitle('Random Forest: Predictions vs Actual (10-minute horizon)', fontsize=16, fontweight='bold')

for idx, target in enumerate(target_vars):
    if target not in forecast_results or 'predictions' not in forecast_results[target]:
        continue
    
    y_true = forecast_results[target]['actual']
    y_pred = forecast_results[target]['predictions']
    
    # Plot first 500 points
    n_plot = min(500, len(y_true))
    
    axes[idx].plot(y_true[:n_plot], label='Actual', alpha=0.7, linewidth=1.5)
    axes[idx].plot(y_pred[:n_plot], label='Predicted', alpha=0.7, linewidth=1.5)
    axes[idx].set_title(f'{target} - MAE: {evaluation_results[target]["mae"]:.4f}, R²: {evaluation_results[target]["r2"]:.4f}')
    axes[idx].set_xlabel('Time Step')
    axes[idx].set_ylabel('Normalized Value')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Scatter plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Random Forest: Predicted vs Actual', fontsize=16, fontweight='bold')

for idx, target in enumerate(target_vars):
    if target not in forecast_results or 'predictions' not in forecast_results[target]:
        continue
    
    y_true = forecast_results[target]['actual']
    y_pred = forecast_results[target]['predictions']
    
    axes[idx].scatter(y_true, y_pred, alpha=0.3, s=10)
    
    # Perfect prediction line
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    axes[idx].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect')
    
    axes[idx].set_title(f'{target}\\nR² = {evaluation_results[target]["r2"]:.4f}')
    axes[idx].set_xlabel('Actual')
    axes[idx].set_ylabel('Predicted')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Metrics comparison
metrics_df = pd.DataFrame(evaluation_results).T

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Random Forest Performance Metrics', fontsize=16, fontweight='bold')

metrics_df['mae'].plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('MAE')
axes[0, 0].set_ylabel('MAE')
axes[0, 0].grid(True, alpha=0.3)

metrics_df['rmse'].plot(kind='bar', ax=axes[0, 1], color='lightcoral')
axes[0, 1].set_title('RMSE')
axes[0, 1].set_ylabel('RMSE')
axes[0, 1].grid(True, alpha=0.3)

metrics_df['mape'].plot(kind='bar', ax=axes[1, 0], color='lightgreen')
axes[1, 0].set_title('MAPE')
axes[1, 0].set_ylabel('MAPE (%)')
axes[1, 0].grid(True, alpha=0.3)

metrics_df['r2'].plot(kind='bar', ax=axes[1, 1], color='plum')
axes[1, 1].set_title('R² Score')
axes[1, 1].set_ylabel('R²')
axes[1, 1].axhline(y=0, color='r', linestyle='--', linewidth=1)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 9. Save Results


In [None]:
# Compile results
final_results = {
    'model': 'Random Forest',
    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'forecast_horizon': FORECAST_HORIZON,
    'forecast_horizon_minutes': FORECAST_HORIZON * 0.5,
    'targets': {}
}

for target in target_vars:
    if target not in evaluation_results:
        continue
    
    final_results['targets'][target] = {
        'model_config': {
            'params': training_results[target]['params'],
            'n_features': training_results[target]['n_features']
        },
        'training': {
            'samples': training_results[target]['n_samples'],
            'time_seconds': training_results[target]['training_time'],
            'cv_mse': training_results[target].get('cv_mse'),
            'model_path': training_results[target]['model_path']
        },
        'forecasting': {
            'n_predictions': forecast_results[target]['n_predictions'],
            'time_seconds': forecast_results[target]['forecast_time']
        },
        'metrics': evaluation_results[target]
    }

# Save results
output_file = save_results(final_results, 'results_random_forest.json')

print("="*80)
print("RESULTS SUMMARY")
print("="*80)
print(f"Model: Random Forest")
print(f"Forecast horizon: {FORECAST_HORIZON} steps ({FORECAST_HORIZON*0.5:.1f} min)")
print(f"Results saved to: {output_file}")
print(f"Models saved in: models/")

print(f"\nMetrics Overview:")
for target in target_vars:
    if target in final_results['targets']:
        metrics = final_results['targets'][target]['metrics']
        print(f"\n  {target}:")
        print(f"    MAE:  {metrics['mae']:.6f}")
        print(f"    RMSE: {metrics['rmse']:.6f}")
        print(f"    R²:   {metrics['r2']:.6f}")

print("="*80)


## 10. Test with Different Horizons


In [None]:
# Test with multiple horizons
HORIZONS_TO_TEST = [10, 20, 40, 60]  # 5, 10, 20, 30 minutes

print("="*80)
print("TESTING MULTIPLE HORIZONS")
print("="*80)
print(f"Horizons: {HORIZONS_TO_TEST} steps = {[h*0.5 for h in HORIZONS_TO_TEST]} minutes")
print()

horizon_comparison = {}

for target in target_vars:
    if target not in rf_models:
        continue
    
    print(f"\n{'='*80}")
    print(f"Target: {target}")
    print(f"{'='*80}")
    
    model = rf_models[target]
    X_test = datasets[target]['X_test']
    y_test = datasets[target]['y_test']
    
    horizon_comparison[target] = {}
    
    for horizon in HORIZONS_TO_TEST:
        print(f"  Horizon: {horizon} steps ({horizon*0.5:.1f} min)... ", end='')
        
        try:
            predictions, actual = rf_rolling_forecast(model, X_test, y_test, horizon)
            metrics = calculate_metrics(actual, predictions)
            
            horizon_comparison[target][f"h{horizon}"] = {
                'horizon': horizon,
                'horizon_minutes': horizon * 0.5,
                'n_predictions': len(predictions),
                'metrics': metrics
            }
            
            print(f"MAE={metrics['mae']:.4f}, R²={metrics['r2']:.4f}")
            
        except Exception as e:
            print(f"Failed: {str(e)}")

print("\n" + "="*80)
print("✓ Horizon testing completed")
print("="*80)


In [None]:
# Comparison table
print("\nHORIZON COMPARISON:")
print("="*80)

for target in target_vars:
    if target not in horizon_comparison:
        continue
    
    print(f"\n{target.upper()}:")
    print(f"{'Horizon':>10} {'Minutes':>10} {'MAE':>12} {'RMSE':>12} {'R²':>12}")
    print("-" * 80)
    
    for h_key in sorted(horizon_comparison[target].keys(),
                       key=lambda x: horizon_comparison[target][x]['horizon']):
        h = horizon_comparison[target][h_key]
        print(f"{h['horizon']:>10} {h['horizon_minutes']:>10.1f} "
              f"{h['metrics']['mae']:>12.6f} {h['metrics']['rmse']:>12.6f} "
              f"{h['metrics']['r2']:>12.6f}")


In [None]:
# Visualize performance vs horizon
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Random Forest: Performance vs Forecast Horizon', fontsize=16, fontweight='bold')

metrics_to_plot = ['mae', 'rmse', 'mape', 'r2']
titles = ['MAE', 'RMSE', 'MAPE (%)', 'R²']
colors_list = ['skyblue', 'lightcoral', 'lightgreen']

for idx, (metric, title) in enumerate(zip(metrics_to_plot, titles)):
    ax = axes[idx // 2, idx % 2]
    
    for cidx, target in enumerate(target_vars):
        if target not in horizon_comparison:
            continue
        
        horizons = []
        values = []
        
        for h_key in sorted(horizon_comparison[target].keys(),
                           key=lambda x: horizon_comparison[target][x]['horizon']):
            h = horizon_comparison[target][h_key]
            horizons.append(h['horizon_minutes'])
            values.append(h['metrics'][metric])
        
        ax.plot(horizons, values, marker='o', linewidth=2,
               label=target, alpha=0.7, color=colors_list[cidx])
    
    ax.set_title(title)
    ax.set_xlabel('Horizon (minutes)')
    ax.set_ylabel(title)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    if metric == 'r2':
        ax.axhline(y=0, color='r', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


In [None]:
# Save horizon comparison
horizon_results = {
    'model': 'Random Forest',
    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'horizons_tested': HORIZONS_TO_TEST,
    'targets': horizon_comparison
}

horizon_file = save_results(horizon_results, 'results_random_forest_horizon_comparison.json')

print("\nBest horizon by MAE:")
for target in target_vars:
    if target in horizon_comparison:
        best = min(horizon_comparison[target].items(),
                  key=lambda x: x[1]['metrics']['mae'])
        print(f"  {target}: {best[1]['horizon']} steps ({best[1]['horizon_minutes']:.1f} min) "
              f"- MAE: {best[1]['metrics']['mae']:.6f}")

print(f"\n✓ Horizon comparison saved to: {horizon_file}")


## Summary

### Completed:

1. ✅ **Data Loading**: Loaded preprocessed train/test data
2. ✅ **Hyperparameter Tuning**: Grid/Randomized search for optimal parameters
3. ✅ **Model Training**: Random Forest with optimal hyperparameters
4. ✅ **Model Saving**: Saved to `models/` directory
5. ✅ **Feature Importance**: Analyzed most important features
6. ✅ **Forecasting**: 10-minute ahead predictions (20 steps)
7. ✅ **Evaluation**: MAE, RMSE, MAPE, R² metrics
8. ✅ **Multi-Horizon Testing**: Tested with 5, 10, 20, 30 minute horizons
9. ✅ **Visualization**: Performance comparisons
10. ✅ **Results Saved**: JSON files for comparison

### Output Files:

- `models/random_forest_[target]_[timestamp].pkl`: Trained models
- `results_random_forest.json`: Main results (20-step horizon)
- `results_random_forest_horizon_comparison.json`: Multi-horizon comparison

### How to Load and Test:

```python
from model_utils import load_model, calculate_metrics

# Load saved model
model, metadata = load_model('models/random_forest_memory_usage_pct_xxx.pkl')

# Predict with custom horizon
predictions, actual = rf_rolling_forecast(model, X_test, y_test, horizon=120)

# Evaluate
metrics = calculate_metrics(actual, predictions)
print(f"R²: {metrics['r2']:.4f}")
```

### Model Comparison:

Compare all three models:
```python
from model_utils import compare_results

comparison = compare_results([
    'results_arimax.json',
    'results_svr.json',
    'results_random_forest.json'
])
print(comparison)
```

### Advantages of Random Forest:

- ✅ Handles non-linear relationships
- ✅ Feature importance analysis
- ✅ Robust to outliers
- ✅ No assumptions about data distribution
- ✅ Fast prediction after training

---

**All Random Forest models and results saved successfully!**
