# Decision tree - Regression tree

## Recursive partitioning (partitions/subsets) - Example 1

<img src="img/linkedin.png" width="15"/> [LinkedIn - How is a regression tree built?](https://www.linkedin.com/learning/machine-learning-with-python-decision-trees/how-is-a-regression-tree-built)

SSR - sum of squared residuals - difference between an observed data point and a reference data point i.e. the mean. 
- The SSR of a partition quantifies the overall difference between the values in the partition and the average of the values in the partition. 
- High SSR implies that the values in the partition are dissimilar, or very different from the mean (a partition that poorly explains the data). 
- Meanwhile, a partition with low SSR implies that the values in the partition are similar, or close to the mean (partition that explains the data well)

![SSR Formula](../../img/SSR_Formula.png)

So how does a regression tree algorithm use SSR to determine the best split?  
Use this example:

<img src="img/regressionTree_splitSample01.png" style="width:300px; height:auto">

Let's assume that the first split the algorithm evaluates is where age is equal to 27.5. This is the halfway point between the data points for age 25 and those for age 30. The values in the left partition are 16.8, 43.9, and 50.4. The average of these values is 37. Recall that a residual is the difference between an observed data point and a reference data point. The reference data point in this example is the average value. So the residuals are the differences between each value and the mean. To get the SSR, we square the residuals and add them. This comes out to 635.2. 

<img src="img/regressionTree_splitSample02.png" style="width:450px; height:auto">

 Using the same approach for the right partition, we get an SSR of 13106.9
 They combined some of squared residuals for both partitions if the data was split by age equal to 27.5 is the sum of the left SSR and the right SSR, which is 13742.1  

 <img src="img/regressionTree_splitSample03.png" style="width:450px; height:auto">

<img src="img/regressionTree_splitSample04.png?v=1" style="width:390px; height:auto">

In order to determine the split that reduces variability the most, the regression tree algorithm evaluates the SSR based on each possible split, and chooses the one with the lowest SSR, which is the split where age is equal to 40.

<img src="img/regressionTree_splitSample05.png?v=1" style="width:500px">

This initial split creates the logic for the root node of our regression tree, which is shown here. It asks the question, is a worker 40 years old or younger? 

<img src="img/regressionTree_splitSample06.png" style="width:400px">

To create the branches and the next set of nodes, the regression tree algorithm makes some generalizations or simplifying assumptions based on the data in the two partitions. The first generalization it makes is based on the left partition. It's estimates that if a worker is 40 years old or younger, then the annual salary will be 44,503, which is the average of the left partition. 

<img src="img/regressionTree_splitSample07.png" width="400px">

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Recursive partitioning (partitions/subsets) - Example 2

<img src="img/youtubeIcon.jpg" alt="Play" width="15"/>[Regression Trees, Clearly Explained!!! - StatQuest with Josh Starmer](https://www.youtube.com/watch?v=g9c66TUylZ4)

Regression Trees are one of the fundamental machine learning techniques that more complicated methods, like Gradient Boost, are based on. They are useful for times when there isn't an obviously linear relationship between what you want to predict, and the things you are using to make the predictions. 

Now imagine we developed a new drug to cure the common cold. However, we don't know the optimal dosage to give patients. 
So we do a clinical trial with different dosages and measure how effective each dosage is. The data looked like this, and in general, the higher the dose, the more effective the drug. Then we could easily fit a line to the data. And if someone told us they were taking a 27 milligram dose, we could use the line to predict that a 27 milligram dose should be 62% effective.

<img src="img/regressionStatquest_01.png" width="300px">

However, what if the data looked like this: low dosages are not effective, moderate dosages work really well, somewhat higher dosages work at about 50% effectiveness, and high dosages are not effective at all? In this case, fitting a straight line to the data will not be very useful. 

<img src="img/regressionStatquest_02.png" width="300px">

<img src="img/regressionStatquest_03.png" width="300px">

For example, if someone told us they were taking a 20 milligram dose, then we would predict that a 20 milligram dose should be 45% effective — even though the observed data says it should be 100% effective. So we need to use something other than a straight line to make predictions. 

<img src="img/regressionStatquest_04.png?vs=1" width="300px">

One option is to use a regression tree. Regression trees are a type of decision tree. In a regression tree, each leaf represents a numeric value. Contrast: classification trees have true or false in their leaves, or some other discrete category.

<img src="img/regressionStatquest_05.png" width="400px">

With this regression tree, we start by asking if the dosage is less than 14.5. If so, then we are talking about these six observations in the training data, and the average drug effectiveness for these six observations is 4.2%. So the tree uses the average value, 4.2%, as its prediction for people with dosages less than 14.5.

<img src="img/regressionStatquest_06.png" width="500px">

On the other hand, if the dosage is greater than or equal to 14.5 and greater than or equal to 29, then we are talking about these four observations in the training dataset, and the average drug effectiveness for these four observations is 2.5%. So the tree uses the average value, 2.5%, as its prediction for people with dosages greater than or equal to 29.

<img src="img/regressionStatquest_07.png" width="500px">

Now, if the dosage is greater than or equal to 14.5 and less than 29 and greater than or equal to 23.5, then we are talking about these five observations in the training dataset, and the average drug effectiveness for these five observations is 52.8%. So the tree uses the average value, 52.8%, as its prediction for people with dosages between 23.5 and 29.

<img src="img/regressionStatquest_08.png?v=1" width="500px">

Lastly, if the dosage is greater than or equal to 14.5 and less than 29 and less than 23.5, then we are talking about these four observations in the training dataset, and the average drug effectiveness for these four observations is 100%. So the tree uses the average value, 100%, as its prediction for people with dosages between 14.5 and 23.5.

<img src="img/regressionStatquest_09.png" width="500px">

Since each leaf corresponds to the average drug effectiveness in a different cluster of observations, the tree does a better job reflecting the data than the straight line.
At this point, you might be thinking, "The regression tree is cool, but I can also predict drug effectiveness just by looking at the graph." For example, if someone said they were taking a 27 milligram dose, then just by looking at the graph, I can tell that the drug will be about 50% effective. So why make a big deal about the regression tree?
When the data are super simple and we are only using one predictor — dosage — to predict drug effectiveness, making predictions by eye isn't terrible. 

<img src="img/regressionStatquest_10.png" width="500px">

But when we have three or more predictors, like dosage, age, and sex, to predict drug effectiveness, drawing a graph is very difficult, if not impossible. In contrast, a regression tree easily accommodates the additional predictors.

<img src="img/regressionStatquest_11.png" width="500px">

For example, if we wanted to predict the drug effectiveness for this patient, we would start by asking if they are older than 50. And since they are not over 50, we follow the branch on the right and ask if their dosage is greater than or equal to 29. And since their dosage is not greater than or equal to 29, we follow the branch on the right and ask if they are female. And since they are female, we follow the branch on the left and predict that the dosage will be 100% effective. And that's not too far off from the truth — 98%.

<img src="img/regressionStatquest_12.png" width="500px">

Okay, now that we know that regression trees can easily handle complicated data, let's go back to the original data with just one predictor — dosage — and talk about how to build this regression tree from scratch. And since regression trees are built from the top down, the first thing we do is figure out why we start by asking if dosage is less than 14.5.

<img src="img/regressionStatquest_13.png" width="500px">

Going back to the graph of the data, let's focus on the two observations with the smallest dosages. Their average dosage is three, and that corresponds to this dotted red line. Now we can build a very simple tree that splits the observations into two groups based on whether or not dosage is less than three.

The point on the far left is the only one with dosage less than three, and the average drug effectiveness for that one point is zero. So we put zero in the leaf on the left side for when dosage is less than three. All of the other points have dosages greater than or equal to three, and the average drug effectiveness for all of the points with dosages greater than or equal to three is 38.8 (the green line). 

So we put 38.8 in the leaf on the right side for when dosage is greater than or equal to three. The values in each leaf are the predictions that this simple tree will make for drug effectiveness. For example, the point on the far left has dosage less than three, and the tree predicts that the drug effectiveness will be zero. The prediction for this point — drug effectiveness equals zero — is pretty good since it is the same as the observed value.

<img src="img/regressionStatquest_14.png" width="400px">

In contrast, for this point which has dosage greater than three, the tree predicts that the drug effectiveness will be 38.8. And that prediction is not very good since the observed drug effectiveness is 100%.

<img src="img/regressionStatquest_15.png" width="400px">

Note: we can visualize how bad the prediction is by drawing a dotted line between the observed and predicted values. In other words, the dotted line is a residual. 

<img src="img/regressionStatquest_16.png" width="400px">

For each point in the data, we can draw its residual — the difference between the observed and predicted values — and we can use the residuals to quantify the quality of these predictions.

<img src="img/regressionStatquest_17.png" width="400px">

Starting with the only point with dosage less than three, we calculate the difference between its observed drug effectiveness (zero) and the predicted drug effectiveness (zero), and then square the difference. In other words, this is the squared residual for the first point.

<img src="img/regressionStatquest_18.png?v=1" width="400px">

Now we add the squared residuals for the remaining points with dosages greater than or equal to three. In other words, for this point, we calculate the difference between the observed and predicted values and square it, and then add it to the first term. Then we do the same thing for the next point, and the next point, and the rest of the points until we have added squared residuals for every point.
Thus, to evaluate the predictions made when the threshold is dosage less than three, we add up the squared residuals for every point and get 27,468.5. 

<img src="img/regressionStatquest_19.png" width="400px">

Note: we can plot the sum of squared residuals on a graph. The y-axis corresponds to the sum of squared residuals, and the x-axis corresponds to dosage thresholds. In this case, the dosage threshold was three.

<img src="img/regressionStatquest_20.png" width="400px">

But if we focus on the next two points in the graph and calculate their average dosage, which is five, then we can use dosage less than five as a new threshold. And using dosage less than five gives us new predictions and new residuals, and that means we can add a new sum of squared residuals to our graph.
In this case, the new threshold — dosage less than five — results in a smaller sum of squared residuals. And that means using dosage less than five as the threshold resulted in better predictions overall. 

<img src="img/regressionStatquest_21.png" width="400px">

Now let's focus on the next two points. Calculate their average, which is seven, and use dosage less than seven as a new threshold.
Again, the new threshold gives us new predictions, new residuals, and a new sum of squared residuals. Now shift the threshold over to the average dosage for the next two points and add a new sum of squared residuals to the graph. And we repeat until we have calculated the sum of squared residuals for all of the remaining thresholds. 
Now we can see the sum of squared residuals for all of the thresholds, and dosage less than 14.5 has the smallest sum of squared residuals. So dosage less than 14.5 will be the root of the tree.
In summary, we split the data into two groups by finding the threshold that gave us the smallest sum of squared residuals. Now let's focus on the six observations with dosage less than 14.5 that ended up in the node to the left of the root.

In theory, we could split these six observations into two smaller groups, just like we did before, by calculating the sum of squared residuals for different thresholds and choosing the threshold with the lowest sum of squared residuals.

<img src="img/regressionStatquest_22.png" width="400px">

<img src="img/regressionStatquest_23.png" width="400px">

Note: this observation has dosage less than 14.5 and does not have dosage less than 11.5, so it is the only observation to end up in this node. And since we can't split a single observation into two groups, we will call this node a leaf.

<img src="img/regressionStatquest_24.png" width="400px">

However, since the remaining five observations go to the other node, we can split them once more. Now we have divided the observations with dosage less than 14.5 into three separate groups. These two leaves only contain one observation each and cannot be split into smaller groups. In contrast, this leaf contains four observations.
That said, those four observations all have the same drug effectiveness, so we don't need to split them into smaller groups. So we are done splitting the observations with dosage less than 14.5 into smaller groups.
Note: the predictions that this tree makes for all observations with dosage less than 14.5 are perfect. In other words, this observation has 20% drug effectiveness, and the tree predicts 20% drug effectiveness. So the observed and predicted values are the same. This observation has 5% drug effectiveness, and that's exactly what the tree predicts. These four observations all have 0% drug effectiveness, and that's exactly what the tree predicts.

<img src="img/regressionStatquest_25.png" width="400px">

<img src="img/regressionStatquest_26.png" width="400px">

Is that awesome? **No**. 

When a model fits the training data perfectly, it probably means it is overfit and will not perform well with new data. In machine learning lingo, the model has no **bias** but potentially large **variance**. Bummer.
Is there a way to prevent our tree from overfitting the training data? Yes! There are a bunch of techniques. The simplest is to only split observations when there are more than some minimum number. Typically, the minimum number of observations to allow for a split is 20.
However, since this example doesn't have many observations, I set the minimum to 7. In other words, since there are only six observations with dosage less than 14.5, we will not split the observations in this node. Instead, this node will become a leaf, and the output will be the average drug effectiveness for the six observations with dosage less than 14.5: 4.2%.

Now we need to figure out what to do with the remaining 13 observations with dosages greater than or equal to 14.5. Since we have more than seven observations on the right side, we can split them into two groups. And we do that by finding the threshold that gives us the smallest sum of squared residuals.
Note: there are only four observations with dosage greater than or equal to 29. Thus, there are only four observations in this node. Thus, we will make this a leaf because it contains fewer than seven observations, and the output will be the average drug effectiveness for these four observations: 2.5%.

<img src="img/regressionStatquest_27.png" width="400px">

<img src="img/regressionStatquest_28.png" width="400px">

Now we need to figure out what to do with the nine observations with dosages between 14.5 and 29. Since we have more than seven observations, we can split them into two groups by finding the threshold that gives us the minimum sum of squared residuals.
Note: since there are fewer than seven observations in each of these two groups, this is the last split because none of the leaves have more than seven observations in them.
So we use the average drug effectiveness for the observations with dosages between 14.5 and 23.5 — 100% — as the output for the leaf on the right. And we use the average drug effectiveness for observations with dosages between 23.5 and 29 — 52.8% — as the output for the leaf on the left.
Since no leaf has more than seven observations in it, we're done building the tree. Each leaf corresponds to the average drug effectiveness from a different cluster of observations.

<img src="img/regressionStatquest_29.png" width="400px">

<img src="img/regressionStatquest_30.png" width="400px">

So far, we have built a tree using a single predictor — dosage — to predict drug effectiveness. Now let's talk about how to build a tree to predict drug effectiveness using a bunch of predictors.
Just like before, we will start by using dosage to predict drug effectiveness. Thus, just like before, we will try different thresholds for dosage and calculate the sum of squared residuals at each step and pick the threshold that gives us the minimum sum of squared residuals. The best threshold becomes a candidate for the root.

<img src="img/regressionStatquest_31.png" width="400px">

<img src="img/regressionStatquest_32.png" width="350px">

Now we focus on using age to predict drug effectiveness. Just like with dosage, we try different thresholds for age and calculate the sum of squared residuals at each step and pick the one that gives us the minimum sum of squared residuals. The best threshold becomes another candidate for the root.
Now we focus on using sex to predict drug effectiveness. With sex, there is only one threshold to try, so we use that threshold to calculate the sum of squared residuals, and that becomes another candidate for the root.
Now we compare the sum of squared residuals (SSRs) for each candidate and pick the candidate with the lowest value. 

Since age greater than 50 had the lowest sum of squared residuals, it becomes the root of the tree.

Then we grow the tree just like before, except now we compare the lowest sum of squared residuals from each predictor. And just like before, when a leaf has less than a minimum number of observations — which is usually 20, but we are using 7 — we stop trying to divide them.

<img src="img/regressionStatquest_33.png" width="200px">

<img src="img/regressionStatquest_34.png" width="400px">

In summary, regression trees are a type of decision tree. In a regression tree, each leaf represents a numeric value. We determine how to divide the observations by trying different thresholds and calculating the sum of squared residuals at each step.
The threshold with the smallest sum of squared residuals becomes a candidate for the root of the tree. If we have more than one predictor, we find the optimal threshold for each one and pick the candidate with the smallest sum of squared residuals to be the root.
When we have fewer than some minimum number of observations in a node — 7 in this example, but more commonly 20 — then that node becomes a leaf. Otherwise, we repeat the process to split the remaining observations until we can no longer split the observations into smaller groups. And then we are done.