# Decision Tree Regression

## Intuition behind Support Vector Regression

Decision Tree Regression is a subset of CART (Classification and Regression Trees).

![Decision Tree Regression - Intuition](Decision-Tree-Regression-Intuition-01.PNG)

Let our dataset contains two features X1, X2 and Target Variable y. X1 and X2 can be represented using a scatter plot as below. Notice that y falls in the 3rd dimension.

![Decision Tree Regression - Dataset](Decision-Tree-Regression-Intuition-02.PNG)

Decision Tree Regression involves an optimal split of dataset. Each of these splits are called "Leaves". How and where these splits are conducted is determined by the algorithm based on "Information Entropy". We then take averages of y for each of these terminal leaves.

![Decision Tree Regression - Splitting & Averaging](Decision-Tree-Regression-Intuition-03.PNG)

The predicted y for a new data will be the average for the corresponding leaf to which the new data falls.

![Decision Tree Regression - Prediction](Decision-Tree-Regression-Intuition-04.PNG)


**Important Note:** The mathematical background of Decision Tree Regression is based on Information Entropy. Since our algorithm takes care of optimal splitting of dataset, Information Entropy need not be discussed now.

## Problem Statement

We have a dataset containing salary data of employees of different position levels (1 to 10) in a company. Here,


*   Independent Variable/Feature = Level
*   Dependent/Target Variable = Salary

Here, Salary and Level have non-linear relationship between them. We want to build a model that can predict the salary given the level of an employee.

**Important Note:** Decision Tree Regression works well with complex datasets having multiple features. Here, we are using a simple dataset with only one feature and hence this dataset is not ideal for visualizing the merits of Decision Tree Regression.

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

## Taking care of missing data

Here we have no missing data in the dataset.

**Important Note:** If missing data acounts for less than 1% of dataset, we can discard them. But in all other cases, we have to replace missing data. Missing data can be replaced with either mean, median, most frequent data or with a constant using `SimpleImputer` from `sklearn.impute`. Other solutions include `IterativeImputer`, `KNNImputer` and `MissingIndicator`.



## Encoding categorical data

In the dataset, you can see that 'Position' column is effectively encoded in 'Level' column. Hence, 'Position' column is redundant and is not included in the Feature Matrix X.

## Splitting the dataset into the Training set and Test set

Here, we want to predict the salary for Employee Level (which is a continuous real value that varies from 1 to 10). Hence we want the data for all levels from 1 to 10 for accurate prediction. Hence, in this scenario, we are not splitting the dataset.

## Feature Scaling

Decision Tree Regression doesn't require Feature Scaling. Hence, it is not applied.

## Training the Decision Tree Regression model on the whole dataset

## Predicting a new result

## Visualising the Decision Tree Regression results (higher resolution)