    Decision  Tree as Regression

A decision tree can be used as a regression model to predict a continuous target variable. In regression, the goal is to predict an output value based on input features, and a decision tree regression model does this by recursively splitting the data into subsets based on the input features, creating a tree-like structure.

Here’s how a decision tree works for regression:

***Steps in Decision Tree Regression***

**Splitting:**

The dataset is split into subsets based on the feature that results in the **highest reduction in variance (or other suitable criteria like Mean Squared Error).**

The split is done recursively, choosing the best feature and threshold at each step to maximize homogeneity within subsets.

**Stopping Criteria:**

The splitting process stops when a stopping criterion is met, such as:
* A maximum depth is reached.
* A minimum number of samples in a node is reached.
* Further splits do not significantly reduce variance.

**Prediction:**

* For predicting the value of a new sample, the sample is passed through the tree from the root to a leaf node, following the decision rules.
* The predicted value is typically the mean of the target values in the leaf node.

***Key Concepts***

**Variance Reduction:**

* At each split, the goal is to minimize the variance of the target variable within the resulting subsets. This is similar to minimizing the sum of squared residuals in linear regression.

**Pruning:**

* Pruning is used to prevent overfitting. This involves removing branches that have little importance and may not generalize well to new data.

***In Decision tree regression we calculate the variance and of parent and child combined(left chid variance + right child variance) and see the reduction and move forward.***

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Decision Tree Regression Example</title>
<style>
body {
font-family: Arial, sans-serif;
line-height: 1.6;
}
h1, h2, h3 {
color: #333;
}
table {
width: 100%;
border-collapse: collapse;
margin-bottom: 20px;
}
table, th, td {
border: 1px solid #ddd;
}
th, td {
padding: 8px;
text-align: left;
}
th {
background-color: #f4f4f4;
}
.formula {
background-color: #f9f9f9;
border-left: 4px solid #ccc;
padding: 10px;
margin: 20px 0;
font-family: "Courier New", Courier, monospace;
}
.section {
margin-bottom: 30px;
}
</style>
</head>
<body>

<h1>Decision Tree Regression Example</h1>

<div class="section">
<h2>Dataset</h2>
<table>
<tr>
<th>House Size (sq ft)</th>
<th>Price ($)</th>
</tr>
<tr><td>600</td><td>150,000</td></tr>
<tr><td>800</td><td>180,000</td></tr>
<tr><td>1000</td><td>200,000</td></tr>
<tr><td>1200</td><td>220,000</td></tr>
<tr><td>1400</td><td>240,000</td></tr>
<tr><td>1600</td><td>280,000</td></tr>
<tr><td>1800</td><td>300,000</td></tr>
<tr><td>2000</td><td>320,000</td></tr>
<tr><td>2200</td><td>340,000</td></tr>
<tr><td>2400</td><td>360,000</td></tr>
</table>
</div>

<div class="section">
<h2>Initial Split</h2>
<h3>Calculate the Variance of the Entire Dataset</h3>
<div class="formula">
Variance = (1/N) &sum;(y<sub>i</sub> - &yuml;)²
</div>
<p>
Mean price, &yuml; = (150,000 + 180,000 + 200,000 + 220,000 + 240,000 + 280,000 + 300,000 + 320,000 + 340,000 + 360,000) / 10 = 259,000
</p>
<p>
Variance = (1/10) [(150000 - 259000)² + (180000 - 259000)² + ... + (360000 - 259000)²] = 58,100,000,000
</p>

<h3>Calculate the Variance for Different Splits</h3>
<p>Split at 1500 sq ft:</p>
<p>Subset 1 (≤ 1500 sq ft): {150000, 180000, 200000, 220000, 240000}</p>
<p>Subset 2 (> 1500 sq ft): {280000, 300000, 320000, 340000, 360000}</p>

<h3>Calculate Mean and Variance for Each Subset</h3>
<p>For Subset 1 (≤ 1500 sq ft):</p>
<p>
Mean price, &yuml;<sub>left</sub> = (150000 + 180000 + 200000 + 220000 + 240000) / 5 = 198000
</p>
<p>
Variance<sub>left</sub> = (1/5) [(150000 - 198000)² + (180000 - 198000)² + ... + (240000 - 198000)²] = 126,000,000
</p>

<p>For Subset 2 (> 1500 sq ft):</p>
<p>
Mean price, &yuml;<sub>right</sub> = (280000 + 300000 + 320000 + 340000 + 360000) / 5 = 320000
</p>
<p>
Variance<sub>right</sub> = (1/5) [(280000 - 320000)² + (300000 - 320000)² + ... + (360000 - 320000)²] = 640,000,000
</p>

<h3>Weighted Average Variance for the Split</h3>
<div class="formula">
Weighted Variance = (5/10) × 126,000,000 + (5/10) × 640,000,000 = 383,000,000
</div>

<h3>Compare the Variance Before and After the Split</h3>
<p>
Variance before the split: 58,100,000,000<br>
Weighted variance after the split: 383,000,000
</p>
<p>Since the variance after the split is significantly lower, this split is beneficial.</p>
</div>

<div class="section">
<h2>Recursive Splitting</h2>
<p>Repeat the above steps for each subset until a stopping criterion is met. For simplicity, we'll consider only one more level of splits.</p>

<h3>Left Subset (≤ 1500 sq ft) split</h3>
<p>Let's try 1000 sq ft:</p>
<p>Subset 1 (≤ 1000 sq ft): {150000, 180000, 200000}</p>
<p>Subset 2 (> 1000 sq ft): {220000, 240000}</p>

<h3>Calculate Mean and Variance for Each Subset</h3>
<p>For Subset 1 (≤ 1000 sq ft):</p>
<p>
Mean price, &yuml;<sub>left</sub> = (150000 + 180000 + 200000) / 3 = 176667
</p>
<p>
Variance<sub>left</sub> = (1/3) [(150000 - 176667)² + (180000 - 176667)² + (200000 - 176667)²] = 422,222,222
</p>

<p>For Subset 2 (> 1000 sq ft):</p>
<p>
Mean price, &yuml;<sub>right</sub> = (220000 + 240000) / 2 = 230000
</p>
<p>
Variance<sub>right</sub> = (1/2) [(220000 - 230000)² + (240000 - 230000)²] = 100,000,000
</p>

<h3>Weighted Average Variance for the Split</h3>
<div class="formula">
Weighted Variance = (3/5) × 422,222,222 + (2/5) × 100,000,000 = 297,333,333
</div>

<h3>Right Subset (> 1500 sq ft) split</h3>
<p>Let's try 2000 sq ft:</p>
<p>Subset 1 (≤ 2000 sq ft): {280000, 300000, 320000}</p>
<p>Subset 2 (> 2000 sq ft): {340000, 360000}</p>

<h3>Calculate Mean and Variance for Each Subset</h3>
<p>For Subset 1 (≤ 2000 sq ft):</p>
<p>
Mean price, &yuml;<sub>left</sub> = (280000 + 300000 + 320000) / 3 = 300000
</p>
<p>
Variance<sub>left</sub> = (1/3) [(280000 - 300000)² + (300000 - 300000)² + (320000 - 300000)²] = 266,666,667
</p>

<p>For Subset 2 (> 2000 sq ft):</p>
<p>
Mean price, &yuml;<sub>right</sub> = (340000 + 360000) / 2 = 350000
</p>
<p>
Variance<sub>right</sub> = (1/2) [(340000 - 350000)² + (360000 - 350000)²] = 100,000,000
</p>

<h3>Weighted Average Variance for the Split</h3>
<div class="formula">
Weighted Variance = (3/5) × 266,666,667 + (2/5) × 100,000,000 = 193,333,333
</div>
</div>

<div class="section">
<h2>Final Decision Tree</h2>
<p>The final decision tree might look like this:</p>
<ul>
<li>Root split at 1500 sq ft.
<ul>
<li>Left subtree (≤ 1500 sq ft): further split at 1000 sq ft.
<ul>
<li>Left leaf (≤ 1000 sq ft): predict &yuml; = 176667</li>
<li>Right leaf (> 1000 sq ft): predict &yuml; = 230000</li>
</ul>
</li>
<li>Right subtree (> 1500 sq ft): further split at 2000 sq ft.
<ul>
<li>Left leaf (≤ 2000 sq ft): predict &yuml; = 300000</li>
<li>Right leaf (> 2000 sq ft): predict &yuml; = 350000</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>

<div class="section">
<h2>Predictions</h2>
<p>Given a new house size, you follow the tree to make a prediction. For example, a house size of 1800 sq ft:</p>
<ul>
<li>Follow the right subtree (since 1800 > 1500).</li>
<li>Within the right subtree, follow the left subtree (since 1800 ≤ 2000).</li>
<li>Predicted price: $300,000.</li>
</ul>
</div>

</body>
</html>
