# Ensemble Methods Example
Here is demonstrated the power of Random Forest, a critical ensemble method, by applying them to both a classification task and a regression task using implementations from the rice_ml library.

In [1]:
from sklearn.datasets import load_iris, make_regression
from rice_ml.supervised_learning.ensemble_methods import RandomForestClassifier, RandomForestRegressor
from rice_ml.processing.preprocessing import train_test_split
from rice_ml.processing.post_processing import accuracy_score, mse, r2_score

## 1. Classificaiton with RandomForestClassifier
We will use the Iris dataset to train and evaluate the classifier's ability to categorize flowers into three species.
### 1.1 Load Data and Preparation
The iris dataset contains four features (sepal/petal length/width) and three target classes.

In [2]:
iris = load_iris()
X_cls, y_cls = iris.data, iris.target

print(f"Total Classification Samples: {X_cls.shape[0]}")
print(f"Target Classes: {iris.target_names}")

Total Classification Samples: 150
Target Classes: ['setosa' 'versicolor' 'virginica']


### 1.2 Data Pre-Processing: Splitting the Dataset
We split the dataset into separate training and testing sets. This ensures we evaluate the model's generalization ability on data it has not previously seen.

In [3]:
# Split the dataset into training (80%) and testing (20%) sets
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=67
)

# Verify the split integrity
print(f"\nTraining Set Size: {X_train_cls.shape[0]} samples")
print(f"Testing Set Size: {X_test_cls.shape[0]} samples")


Training Set Size: 120 samples
Testing Set Size: 30 samples


### 1.3 Initialize and Train the Model
We instantiate the RandomForestClassifier with specific hyperparameters and fit it to the training data. The model interally trains 50 individuals Decision Trees. 

In [4]:
# Initialize the RandomForestClassifier
rfc = RandomForestClassifier(
    n_estimators=50,             # Number of trees in the forest
    max_depth=5,                 # Max depth of each individual tree
    max_features='sqrt',         # Number of features to consider at each split (standard for classification)
    random_state=67
)

print("\nBeginning RandomForestClassifier Training...")

# Fit the ensemble model
rfc.fit(X_train_cls, y_train_cls)

print("Training Complete.")


Beginning RandomForestClassifier Training...
Training Complete.


### 1.4 Prediction and Evaluation
Predict the classes for the test set and assess performance using the accuracy_score.

In [5]:
# Generate predictions on the held-out test set
y_pred_cls = rfc.predict(X_test_cls)

# Calculate the Accuracy Score
accuracy = accuracy_score(y_test_cls, y_pred_cls)

print(f"\n--- Classification Results ---")
print(f"Random Forest Accuracy: {accuracy:.4f}")



--- Classification Results ---
Random Forest Accuracy: 1.0000


## 2. Part 2: Regression with RandomForestRegressor
We generate a synthetic dataset to test the regressor's ability to predict continuous numerical values
### 2.1 Load Data and Preparation
A synthetic regression dataset is created, featuring 200 samples and 4 features with some added noise.

In [6]:
X_reg, y_reg = make_regression(
    n_samples=200, n_features=4, noise=15.0, random_state=67
)
print(f"Total Regression Samples: {X_reg.shape[0]}")

Total Regression Samples: 200


### 2.2 Data Pre-Processing: Splitting the Dataset
The regression data is split to ensure the evaluation is performed on truly independent observations.

In [7]:
# Split the regression dataset
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=67
)

# Verify the split integrity
print(f"\nTraining Set Size: {X_train_reg.shape[0]} samples")
print(f"Testing Set Size: {X_test_reg.shape[0]} samples")


Training Set Size: 140 samples
Testing Set Size: 60 samples


### 2.3 Initialize and Train the Model
Initialize the RandomForestRegressor, which uses a specialized Decision Tree Regressor as its base estimator and aggregates results via averaging.

In [8]:
# Initialize the RandomForestRegressor
rfr = RandomForestRegressor(
    n_estimators=50,
    max_depth=10,
    max_features=0.5,            # Using 50% of features for regression
    random_state=67
)

print("\nBeginning RandomForestRegressor Training...")

# Fit the ensemble model
rfr.fit(X_train_reg, y_train_reg)

print("Training Complete.")


Beginning RandomForestRegressor Training...
Training Complete.


### 2.4 Prediction and Evaluation
Predict the continuous target values for the test set and evaluate performance using two key regression metrics: Mean Squared Error (MSE) and R-squared($R^2$)

In [9]:
# Generate continuous predictions
y_pred_reg = rfr.predict(X_test_reg)

# Calculate evaluation metrics
mean_squared_error = mse(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"\n--- Regression Results ---")
print(f"Mean Squared Error (MSE): {mean_squared_error:.2f}")
print(f"R-squared (R2) Score: {r2:.4f}")


--- Regression Results ---
Mean Squared Error (MSE): 5465.86
R-squared (R2) Score: -0.0301
