# Random Forest Regression Demo
This notebook demonstrates training a **RandomForestRegressor** on the California Housing dataset. It explains bagging, ensemble averaging, and shows evaluation plus feature importance plots.

## Intuition: why ensembles help
- **Bagging (bootstrap aggregating):** each tree trains on a bootstrap sample, giving slightly different perspectives on the data.
- **Random feature selection:** at each split a random subset of features is considered, making trees less correlated.
- **Averaging predictions:** by averaging many diverse trees, the forest reduces variance and overfitting compared to a single deep tree.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

In [None]:
data = fetch_california_housing(as_frame=True)
df = data.frame
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

### Train the model
We use 200 trees, allow full depth, and fix a random seed for reproducibility.

In [None]:
model = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    max_features=1.0,
    random_state=42,
    n_jobs=-1,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)

### Evaluate
We compute common regression metrics to understand performance.

In [None]:
mse = mean_squared_error(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = (mse) ** 0.5
r2 = r2_score(y_test, preds)
mse, mae, rmse, r2

### Feature importance
Random forests estimate feature importance based on how much each feature reduces impurity across the ensemble.

In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x=importances, y=importances.index, palette='viridis')
plt.title('Feature importance (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

### Predicted vs Actual
A well-performing regressor should cluster near the diagonal.

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=preds, alpha=0.4)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual')
plt.tight_layout()
plt.show()