# Regression Trees Example
Here demonstrates the basic workflow for using the `DecisionTreeRegressor` for a continuous prediction task.\
Our objective is to train the regression tree using synthetic data and subsequently assess its effectiveness in continuous prediction.

In [1]:
import numpy as np
from sklearn.datasets import make_regression 
from rice_ml.supervised_learning.regression_trees import DecisionTreeRegressor
from rice_ml.processing.preprocessing import train_test_split
from rice_ml.processing.post_processing import mse, r2_score

## 1. Load Data and Data Preparation
Generate a synthetic dataset where the target variable is continuous, suitable for regression

In [2]:
# Generate synthetic regression data (100 samples, 4 features)
X, y = make_regression(
    n_samples=100, 
    n_features=4, 
    n_informative=2, # Only 2 features are strongly correlated with y
    noise=10.0, 
    random_state=67
)

print(f"Total Samples in Dataset: {X.shape[0]}")
print(f"Number of Features: {X.shape[1]}")

Total Samples in Dataset: 100
Number of Features: 4


## 2. Data Pre-Processing: Splitting the Dataset
Separate the data into two distinct groups to ensure it's possible to evaluate the model's accuracy on unseen data: the training set and the test set.

In [3]:
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=67
)

# Verify the split integrity
print(f"\nTraining Set Size (X_train): {X_train.shape[0]} samples")
print(f"Testing Set Size (X_test): {X_test.shape[0]} samples")


Training Set Size (X_train): 80 samples
Testing Set Size (X_test): 20 samples


## 3. Initialize and Train the Model
Instantiate the `DecisionTreeRegressor` and fit it using only the designated training data. Set a max_depth to prevent the tree from overfitting the small dataset.

In [4]:
# 1. Initialize the Decision Tree Regressor
dtr = DecisionTreeRegressor(
    max_depth=5,                 # Maximum depth to limit complexity
    min_samples_split=5,         # Require at least 5 samples to split a node
    random_state=67
)

print("\nBeginning Decision Tree Regressor Training...")

# 2. Fit the model to the training data (X_train, y_train)
dtr.fit(X_train, y_train)

print("Training Complete. Tree structure learned.")


Beginning Decision Tree Regressor Training...
Training Complete. Tree structure learned.


## 4. Prediction and Evaluation
Use the trained tree structure to predict outcomes for the unseen test data and assess performance using standard regression metrics.

In [5]:
# 1. Generate predictions on the held-out test set
y_pred = dtr.predict(X_test)

# 2. Calculate Mean Squared Error (MSE)
mean_squared_error = mse(y_test, y_pred)

# 3. Calculate the R-squared (Coefficient of Determination) Score
r2 = r2_score(y_test, y_pred)

print(f"\n--- Regression Results ---")
print(f"Mean Squared Error (MSE): {mean_squared_error:.2f}")
print(f"R-squared (R2) Score: {r2:.4f}")

# Note: R^2 indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.


--- Regression Results ---
Mean Squared Error (MSE): 663.15
R-squared (R2) Score: 0.7641
