## Regression Tree

A regression tree is a type of decision tree used for solving regression problems. Instead of predicting discrete classes, like in classification, regression trees predict continuous values. They work by recursively partitioning the input space into smaller regions and assigning a numerical value to each region based on the average (or another summary statistic) of the target variable for the data points within that region.

Here's a simplified explanation of how regression trees work:

1. **Start with the Whole Dataset**: At the beginning, the entire dataset is considered one big region.

2. **Select a Splitting Feature**: The algorithm looks for the feature and the threshold that best splits the dataset into two smaller regions. The goal is to minimize the variance of the target variable within each region.

3. **Create Two Child Nodes**: Once the best split is found, the dataset is divided into two subsets based on the chosen feature and threshold.

4. **Repeat the Process**: The process continues recursively for each subset, with the algorithm searching for the best split in each subset until a stopping criterion is met. This criterion could be reaching a maximum depth, having a minimum number of samples in each leaf node, or no further reduction in variance can be achieved.

5. **Assign Predictions**: Finally, when the tree is fully grown, each terminal node (or leaf node) contains a prediction value. This value is typically the average of the target variable for the data points in that region.

6. **Making Predictions**: To make predictions for new data points, you traverse the tree from the root node down to a leaf node based on the values of the features. Once you reach a leaf node, you use the prediction value associated with that node as the predicted output.

In summary, regression trees partition the input space into smaller regions and assign a continuous value to each region based on the average of the target variable. They're intuitive and easy to interpret, making them useful for tasks where understanding the decision-making process is important.

#### Regression trees vs Linear regression

![](https://i.stack.imgur.com/yE8JP.png)

Regression trees can be advantageous over linear regression in certain scenarios due to their flexibility and ability to capture nonlinear relationships in the data. Here are some situations where regression trees might be preferred over linear regression:

1. **Nonlinear Relationships**: When the relationship between the independent variables and the dependent variable is nonlinear, regression trees can capture these nonlinearities more effectively than linear regression. Linear regression assumes a linear relationship between the independent and dependent variables, while regression trees can model complex nonlinear relationships through recursive partitioning of the feature space.

2. **Interaction Effects**: Regression trees can automatically capture interaction effects between variables without explicitly specifying them. In linear regression, you need to manually include interaction terms, which can become cumbersome and may not capture all interactions accurately.

3. **Robustness to Outliers**: Regression trees are generally robust to outliers in the data. Outliers can significantly impact the coefficients estimated by linear regression models, leading to biased results. Regression trees, on the other hand, are less sensitive to outliers because they partition the data into regions and make predictions based on the majority of data points within each region.

4. **Handling Categorical Variables**: Regression trees naturally handle categorical variables without requiring one-hot encoding or other preprocessing techniques. Linear regression typically requires categorical variables to be converted into dummy variables, which can increase the dimensionality of the feature space and potentially lead to overfitting.

5. **Interpretability**: Regression trees offer intuitive interpretability, as the decision-making process can be visualized as a tree structure. This makes it easier to understand and explain the factors driving predictions compared to linear regression models, which may have coefficients that are harder to interpret, especially when dealing with interactions or multicollinearity.

However, it's essential to consider the trade-offs when choosing between regression trees and linear regression. Regression trees can suffer from overfitting, especially if not properly regularized, and they may not perform well with small datasets. Additionally, they may not generalize as well to new data compared to linear regression, particularly if the relationships in the data are relatively simple and linear. Therefore, the choice between regression trees and linear regression should be based on the specific characteristics of the dataset and the goals of the analysis.

### How Regression trees find splitting criteria?

Regression trees find splitting criteria by selecting the feature and the threshold that minimizes some measure of impurity or variance within the resulting subsets. The process involves evaluating all possible splits on each feature and selecting the one that optimally separates the data into more homogeneous subsets.

Here's a step-by-step explanation of how regression trees find splitting criteria:

1. **Evaluate Potential Splits**: For each feature in the dataset, the algorithm evaluates all possible split points. For continuous features, potential split points are the unique values of that feature. For categorical features, each category is considered as a potential split point.

2. **Calculate Impurity or Variance Reduction**: For each split point, the algorithm calculates a measure of impurity or variance within the resulting subsets. Common measures used for regression trees include:
   - Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values within each subset.
   - Total Variance Reduction: Measures the reduction in variance of the target variable achieved by the split.
   
3. **Select Best Split**: The algorithm selects the split point that maximizes the reduction in impurity or variance. This split point becomes the splitting criterion for that particular feature.

4. **Repeat for All Features**: Steps 1-3 are repeated for all features in the dataset. The algorithm evaluates the potential splits for each feature and selects the best overall splitting criterion based on the chosen impurity or variance measure.

5. **Choose the Best Split Among All Features**: Finally, the algorithm selects the feature and split point that result in the greatest reduction in impurity or variance across all features. This becomes the final splitting criterion used to partition the data into two subsets at that node of the tree.

By iteratively selecting the feature and split point that minimizes impurity or variance, regression trees can recursively partition the input space into regions that lead to more accurate predictions of the target variable. This process continues until a stopping criterion is met, such as reaching a maximum tree depth or having all leaf nodes contain a minimum number of samples.

#### Decision Tree Regressor Hyperparameters

```
class sklearn.tree.DecisionTreeRegressor(*, criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0, monotonic_cst=None)
```


1. **criterion**:
   - The criterion used to measure the quality of a split. For regression, it's typically 'squared_error', which minimizes the mean squared error (MSE) between the actual and predicted values.

2. **splitter**:
   - The strategy used to choose the split at each node. It can be 'best' to choose the best split or 'random' to choose the best random split.

3. **max_depth**:
   - The maximum depth of the tree. It limits the number of levels in the tree. Default is None, which means nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

4. **min_samples_split**:
   - The minimum number of samples required to split an internal node. Default is 2.

5. **min_samples_leaf**:
   - The minimum number of samples required to be at a leaf node. Default is 1.

6. **min_weight_fraction_leaf**:
   - The minimum weighted fraction of the sum total of weights required to be at a leaf node. Default is 0.0.

7. **max_features**:
   - The number of features to consider when looking for the best split. It can be an int, float, 'auto', 'sqrt', 'log2', None, or a fraction of features. Default is None.

8. **random_state**:
   - Seed used by the random number generator. It controls the randomness of the estimator. Default is None.

9. **max_leaf_nodes**:
   - The maximum number of leaf nodes in the tree. Default is None.

10. **min_impurity_decrease**:
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value. Default is 0.0.

11. **ccp_alpha**:
    - Complexity parameter used for Minimal Cost-Complexity Pruning. Default is 0.0.

12. **monotonic_cst**:
    - Monotonic constraints for the decision tree. Default is None.

These hyperparameters control various aspects of the decision tree regressor, such as its depth, the criteria for splitting, and how it handles imbalanced data. By adjusting these parameters, you can tailor the decision tree model to suit the specific characteristics of your dataset and the requirements of your regression problem.

### Code Example

In [1]:
import pandas as pd
from pandas_datareader import data
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston
import warnings
warnings.filterwarnings('ignore')

In [2]:
boston = load_boston()
df = pd.DataFrame(boston.data)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [4]:
df.columns = boston.feature_names
df['MEDV'] = boston.target

In [5]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [6]:
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

In [8]:
rt = DecisionTreeRegressor(criterion= 'squared_error', max_depth= 5)

In [9]:
rt.fit(X_train, y_train)

In [10]:
y_pred = rt.predict(X_test)

In [11]:
r2_score(y_test, y_pred)

0.8833565347917997

#### Hyperparameter Tuning

In [12]:
param_grid = {
    'max_depth':[2,4,8,10,None],
    'criterion':['mse','mae'],
    'max_features':[0.25,0.5,1.0],
    'min_samples_split':[0.25,0.5,1.0]
}

In [13]:
reg = GridSearchCV(DecisionTreeRegressor(), param_grid= param_grid)

In [14]:
reg.fit(X_train, y_train)

In [15]:
reg.best_score_

0.6254023352956153

In [16]:
reg.best_params_

{'criterion': 'mse',
 'max_depth': 4,
 'max_features': 1.0,
 'min_samples_split': 0.25}

### Feature Importance

In [17]:
for importance, name in sorted(zip(rt.feature_importances_, X_train.columns),reverse=True):
  print (name, importance)

RM 0.6292222501550527
LSTAT 0.20562720153418312
DIS 0.07272221949122784
CRIM 0.05725882102630604
TAX 0.01669708628286318
AGE 0.00617612617436646
PTRATIO 0.0055650568587021785
NOX 0.0035610403857025503
INDUS 0.0026274687266826754
B 0.0005427293649131192
ZN 0.0
RAD 0.0
CHAS 0.0


In a decision tree model, `rt.feature_importances_` refers to an attribute that holds the importance scores of each feature in the dataset as determined by the trained decision tree model `rt`.

The feature importance represents the relative importance of each feature in making accurate predictions with the decision tree model. It is calculated based on how much each feature contributes to reducing impurity (e.g., Gini impurity or entropy) at each node of the decision tree during the training process.

Higher feature importance values indicate that the feature is more influential in making decisions within the decision tree model. These scores can provide insights into which features are most informative or relevant for the model's predictions.

In the code snippet provided earlier, `rt.feature_importances_` is used to obtain the feature importance scores, which are then zipped with the column names of the features (`X_train.columns`). This allows for sorting and printing the features along with their corresponding importance scores.