<a href="https://colab.research.google.com/github/divyakanojia/machinelearning/blob/main/ML_regressor_algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ML Pipeline Steps (Common to All):
Problem Statement: Predict a continuous target (regression problem)

Import Libraries

Load Dataset

Preprocess Data (handle nulls, encode categoricals, feature scaling)

Split Data (into train and test sets)

Train Model (fit the regressor)

Predict

Evaluate (R², MAE, MSE, RMSE)

(Optional): Tune Hyperparameters





In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import numpy as np


#Decision Tree Regressor

🌳 Decision Tree Regressor – Definition & Explanation
✅ Definition:
A Decision Tree Regressor is a machine learning model used to predict continuous numeric values by learning decision rules from features, structured in a tree-like format.
It splits the data into smaller and smaller subsets using feature thresholds that reduce prediction error.

📏 How It Works:
Starts at the root node with all data.

At each node, chooses the best feature and split point that minimizes Mean Squared Error (MSE).

Splits the data into child nodes.

Repeats this until a stopping condition is met (like max depth or min samples).

Prediction is the average of target values in each leaf node.

🧠 Loss Function (MSE):
MSE
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑦
𝑖
−
𝑦
^
)
2
MSE=
n
1
​
  
i=1
∑
n
​
 (y
i
​
 −
y
^
​
 )
2

The tree chooses splits that minimize MSE at each step.

📌 Key Hyperparameters:
Parameter	Meaning
max_depth	Maximum depth of the tree
min_samples_split	Min samples to split a node
min_samples_leaf	Min samples allowed in a leaf node
max_features	Number of features considered at each split
criterion	Default is 'squared_error' for regression

✅ Advantages:
Simple to understand and interpret

No need for feature scaling

Handles non-linear relationships well

Can capture feature interactions

❌ Limitations:
Prone to overfitting

Small changes in data can lead to a very different tree (high variance)

Less accurate than ensemble methods like Random Forest

🧪 Use Cases:
Predicting house prices

Predicting sales or revenue

Regression tasks in finance, health, and marketing

📘 Summary Table:
Feature	Decision Tree Regressor
Output type	Continuous (Regression)
Scaling needed?	❌ No
Handles missing values?	❌ Not directly
Overfitting risk	✅ High (if not pruned)
Interpretability	✅ High

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Load data
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predict
y_pred_dt = dt_model.predict(X_test)

# Evaluate
print("Decision Tree R2:", r2_score(y_test, y_pred_dt))
print("MSE:", mean_squared_error(y_test, y_pred_dt))
print("MAE:", mean_absolute_error(y_test, y_pred_dt))


Decision Tree R2: 0.622075845135081
MSE: 0.495235205629094
MAE: 0.45467918846899225


#Random Forest Regressor

Random Forest Regressor – Definition & Explanation
✅ Definition:
Random Forest Regressor is an ensemble learning method that uses multiple decision trees to predict a continuous target variable. The final prediction is made by averaging the predictions of all individual decision trees.

🧠 Core Idea:
Build many decision trees (each trained on a random subset of the data and features).

Each tree gives a prediction.

The final prediction is the average of all the tree predictions.

🔍 Why "Random"?
Random rows: Each tree is trained on a bootstrapped sample (random sample with replacement).

Random columns: At each split, only a random subset of features is considered.

This randomness:

Reduces overfitting

Increases model robustness

📏 How It Works (Steps):
Randomly select data samples and features.

Train many decision trees.

Predict using each tree.

Average the results for final output.

📌 Key Parameters:
Parameter	Description
n_estimators	Number of trees in the forest
max_depth	Maximum depth of each tree
min_samples_split	Min samples to split a node
min_samples_leaf	Min samples at leaf node
max_features	Number of features to consider at each split
bootstrap	Whether to use bootstrapped samples

✅ Advantages:
High accuracy and performance

Robust to outliers and noise

Handles large datasets and high dimensions

Reduces overfitting (unlike single decision trees)

Automatically handles feature importance

❌ Limitations:
Less interpretable than a single tree

Slower prediction time (due to multiple trees)

Memory-intensive for very large forests

🧪 Use Cases:
Predicting house prices

Weather forecasting

Energy consumption estimation

Risk prediction in finance or insurance

📘 Summary Table:
Feature	Value
Algorithm type	Ensemble (Bagging)
Base model	Decision Tree
Output	Average of predictions
Scaling required?	❌ No
Handles non-linearity?	✅ Yes
Robust to outliers?	✅ Yes

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
print("Random Forest R2:", r2_score(y_test, y_pred_rf))
print("MSE:", mean_squared_error(y_test, y_pred_rf))
print("MAE:", mean_absolute_error(y_test, y_pred_rf))


Random Forest R2: 0.8051230593157366
MSE: 0.2553684927247781
MAE: 0.32754256845930246


#SVR (Support Vector Regressor)

📘 SVR (Support Vector Regressor) – Definition
SVR (Support Vector Regressor) is a type of Support Vector Machine (SVM) that is used for regression problems (predicting continuous values).

✅ Definition:
SVR aims to find a function (or line, curve, surface) that predicts target values within a specified margin of tolerance (ε), while also being as flat as possible and ignoring small errors.

🔍 Key Concepts:
It does not try to minimize prediction error directly, but rather fits the best line within a tube of size ε around the data.

Only points outside this ε-tube (called support vectors) influence the final model.

Uses kernel functions (like RBF, linear, polynomial) to handle non-linear data.

📏 Objective:
Find a function f(x) such that:

∣
𝑦
𝑖
−
𝑓
(
𝑥
𝑖
)
∣
≤
𝜀
∣y
i
​
 −f(x
i
​
 )∣≤ε
for most training data points, and also keep f(x) as flat (simple) as possible.

📌 SVR Parameters:
Parameter	Description
C	Regularization: Controls trade-off between margin size and model error
ε	Epsilon: Width of the margin of tolerance
kernel	Function to map data to higher dimensions (e.g., 'linear', 'rbf')
gamma	Defines how far the influence of a single training example reaches

✅ Use Cases:
Predicting house prices

Stock price forecasting

Real estate valuations

Any regression task with non-linear trends

In [None]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train_scaled, y_train)

# Predict
y_pred_svr = svr_model.predict(X_test_scaled)

# Evaluate
print("SVR R2:", r2_score(y_test, y_pred_svr))
print("MSE:", mean_squared_error(y_test, y_pred_svr))
print("MAE:", mean_absolute_error(y_test, y_pred_svr))


SVR R2: 0.7275628923016773
MSE: 0.357004031933865
MAE: 0.39859907695205365


In [None]:
pip install streamlit

Collecting streamlit
  Downloading streamlit-1.45.1-py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m70.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m82.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInst

# Using streamlit

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

st.title("🏡 California Housing Price Predictor")
st.write("Select a regression algorithm to predict housing prices")

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_choice = st.selectbox("Select Regressor", ["Decision Tree", "Random Forest", "SVR"])

# Initialize model and hyperparameters
if model_choice == "Decision Tree":
    st.subheader("Decision Tree Regressor Parameters")
    max_depth = st.slider("Max Depth", 1, 20, 5)
    model = DecisionTreeRegressor(max_depth=max_depth, random_state=42)

elif model_choice == "Random Forest":
    st.subheader("Random Forest Regressor Parameters")
    n_estimators = st.slider("Number of Trees", 10, 200, 100, step=10)
    max_depth = st.slider("Max Depth", 1, 20, 5)
    model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)

elif model_choice == "SVR":
    st.subheader("Support Vector Regressor Parameters")
    kernel = st.selectbox("Kernel", ["linear", "rbf", "poly"])
    C = st.slider("Regularization (C)", 0.1, 10.0, 1.0)
    gamma = st.selectbox("Gamma", ["scale", "auto"])
    # Feature scaling
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    model = SVR(kernel=kernel, C=C, gamma=gamma)

# Train model
if st.button("Train & Predict"):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    # Evaluation
    st.subheader("📊 Evaluation Metrics")
    st.write("R² Score:", r2_score(y_test, predictions))
    st.write("MSE:", mean_squared_error(y_test, predictions))
    st.write("MAE:", mean_absolute_error(y_test, predictions))

    st.subheader("🔍 Sample Predictions")
    results = pd.DataFrame({"Actual": y_test[:10], "Predicted": predictions[:10]})
    st.dataframe(results)

# Optional: Hyperparameter Tuning
if st.checkbox("🔧 Run Hyperparameter Tuning (GridSearchCV)"):
    st.write("This might take some time...")
    if model_choice == "Decision Tree":
        param_grid = {'max_depth': [3, 5, 10, 15]}
        search_model = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=3)
    elif model_choice == "Random Forest":
        param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
        search_model = GridSearchCV(RandomForestRegressor(), param_grid, cv=3)
    else:  # SVR
        search_model = GridSearchCV(SVR(), {
            'kernel': ['linear', 'rbf'],
            'C': [0.1, 1, 10],
            'gamma': ['scale', 'auto']
        }, cv=3)
        # Scaling again in case tuning was skipped earlier
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

    search_model.fit(X_train, y_train)
    best_model = search_model.best_estimator_
    y_pred_best = best_model.predict(X_test)

    st.success(f"Best Parameters: {search_model.best_params_}")
    st.write("Best R² Score:", r2_score(y_test, y_pred_best))


2025-06-16 06:36:54.287 
  command:

    streamlit run /usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2025-06-16 06:36:54.348 Session state does not function when running a script without `streamlit run`
