# Decision Trees and Gradient Boosting

## Setting up the Environment

For this laboratory exercise, you will need to install the Anaconda package & environment manager. We will install a minimal distribution, [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/). Choose the adequate distribution for your operating system, download and install it.

Or use the following commands:

### Windows
```shell
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o miniconda.exe
start /wait "" miniconda.exe /S
del miniconda.exe
```

### Linux
```shell
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

### macOS

```shell
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

For both Linux and macOS after installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

```shell
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
```


Once you have installed miniconda, run the following commands to create an environment:
```bash
conda create --name myenv
```

'myenv' is the name of the environment, you can change the name however you want.

When conda asks you to proceed, type y

After successfully creating the environment, activate it with the following command:
```bash
conda activate myenv
```

For more detailed information you can read the [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).

Now, once the environment is activated, proceed to install the required libraries.

```bash
pip install numpy pandas scikit-learn xgboost matplotlib seaborn gdown
```

In the next step, we need to add the environment to jupyter. Use the following commands to install ipykernel and add the environment to ipykernel.

```bash
pip install ipykernel
```
```bash
python -m ipykernel install --name=myenv
```


Next, start Jupyter Notebook, download this starter notebook and open it. On the dropdown menu in the Kernel tab choose the name of the environment you created, like in the picture below.


![jupyter](https://drive.google.com/uc?export=view&id=1N-27jjlIgpTILi-_6lny7ng8sE52SAZx)


## Download and Read the Dataset

run the code below for downloading the dataset

In [17]:
!gdown 1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx

Error:

	HTTPSConnectionPool(host='doc-14-1k-docs.googleusercontent.com',
	port=443): Max retries exceeded with url: /docs/securesc/ha0ro937gcuc7
	l7deffksulhg5h7mbp1/h198dtpgsnbqh9b4o0ae3ast6tg3d8vh/1735238775000/040
	43802626059007351/*/1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx?uuid=a8d4f1f2-
	ed14-431a-aa7a-b6a2c48ee32e (Caused by
	NameResolutionError("<urllib3.connection.HTTPSConnection object at
	0x1072f1700>: Failed to resolve 'doc-14-1k-docs.googleusercontent.com'
	([Errno 8] nodename nor servname provided, or not known)"))

To report issues, please visit https://github.com/wkentaro/gdown/issues.


In [18]:
!pip install numpy pandas scikit-learn xgboost matplotlib seaborn gdown



### Import the required libraries

In [19]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor 
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from xgboost import XGBRegressor

### Read the dataset

CONTEXT:
This is a dataset of electric vehicles.

It contains the following columns:


*   Brand
*   Model
*   AccelSec - Acceleration as 0-100 km/h
*   TopSpeed_KmH - The top speed in km/h
*   Range_Km - Range in km
*   Efficiency_WhKm - Efficiency Wh/km
*   FastCharge_KmH - Charge km/h
*   RapidCharge - Yes / No
*   PowerTrain - Front, rear, or all wheel drive
*   PlugType
*   BodyStyle - Basic size or style
*   Segment - Market segment
*   Seats - Number of seats
*   PriceEuro - Price in Germany before tax incentives




TASK:
Predict the target 'PriceEuro' and compare the performance of the DecisionTreeRegressor and the XGBRegressor models.

In [20]:
data = pd.read_csv('ElectricCarData.csv')

In [21]:
data.head()

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997


In [22]:
data.isnull().sum()

Brand              0
Model              0
AccelSec           0
TopSpeed_KmH       0
Range_Km           0
Efficiency_WhKm    0
FastCharge_KmH     0
RapidCharge        0
PowerTrain         0
PlugType           0
BodyStyle          0
Segment            0
Seats              0
PriceEuro          0
dtype: int64

In [23]:
data.head(30)

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997
5,Lucid,Air,2.8,250,610,180,620,Yes,AWD,Type 2 CCS,Sedan,F,5,105000
6,Volkswagen,e-Golf,9.6,150,190,168,220,Yes,FWD,Type 2 CCS,Hatchback,C,5,31900
7,Peugeot,e-208,8.1,150,275,164,420,Yes,FWD,Type 2 CCS,Hatchback,B,5,29682
8,Tesla,Model 3 Standard Range Plus,5.6,225,310,153,650,Yes,RWD,Type 2 CCS,Sedan,D,5,46380
9,Audi,Q4 e-tron,6.3,180,400,193,540,Yes,AWD,Type 2 CCS,SUV,D,5,55000


In [24]:
data = data.drop(columns='Model')

In [25]:
data

Unnamed: 0,Brand,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,Nissan,7.5,160,330,191,440,Yes,FWD,Type 2 CCS,Hatchback,C,5,45000
99,Audi,4.5,210,335,258,540,Yes,AWD,Type 2 CCS,SUV,E,5,96050
100,Nissan,5.9,200,325,194,440,Yes,AWD,Type 2 CCS,Hatchback,C,5,50000
101,Nissan,5.1,200,375,232,450,Yes,AWD,Type 2 CCS,Hatchback,C,5,65000


In [26]:
data['FastCharge_KmH'].unique()

array(['940', '250', '620', '560', '190', '220', '420', '650', '540',
       '440', '230', '380', '210', '590', '780', '170', '260', '930',
       '850', '910', '490', '470', '270', '450', '350', '710', '240',
       '390', '570', '610', '340', '730', '920', '-', '550', '900', '520',
       '430', '890', '410', '770', '460', '360', '810', '480', '290',
       '330', '740', '510', '320', '500'], dtype=object)

In [27]:
data.replace('-', np.nan, inplace=True)

In [28]:
data.fillna(data.mean(numeric_only=True), inplace=True)

In [29]:
data.isnull().sum()

Brand              0
AccelSec           0
TopSpeed_KmH       0
Range_Km           0
Efficiency_WhKm    0
FastCharge_KmH     5
RapidCharge        0
PowerTrain         0
PlugType           0
BodyStyle          0
Segment            0
Seats              0
PriceEuro          0
dtype: int64

### Encode string variables

In [30]:
data = pd.get_dummies(data, columns = ['PlugType', 'Segment', 'PowerTrain'], prefix=['PlugType', 'Segment', 'PowerTrain']) # so one-hot encoding

In [31]:
encoder = LabelEncoder()

data['Brand'] = encoder.fit_transform(data['Brand'])
data['BodyStyle'] = encoder.fit_transform(data['BodyStyle'])
data['RapidCharge'] = encoder.fit_transform(data['RapidCharge'])

In [32]:
data

Unnamed: 0,Brand,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,BodyStyle,Seats,PriceEuro,...,Segment_B,Segment_C,Segment_D,Segment_E,Segment_F,Segment_N,Segment_S,PowerTrain_AWD,PowerTrain_FWD,PowerTrain_RWD
0,30,4.6,233,450,161,940,1,7,5,55480,...,False,False,True,False,False,False,False,True,False,False
1,31,10.0,160,270,167,250,1,1,5,30000,...,False,True,False,False,False,False,False,False,False,True
2,23,4.7,210,400,181,620,1,2,5,56440,...,False,False,True,False,False,False,False,True,False,False
3,2,6.8,180,360,206,560,1,6,5,68040,...,False,False,True,False,False,False,False,False,False,True
4,9,9.5,145,170,168,190,1,1,4,32997,...,True,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,20,7.5,160,330,191,440,1,1,5,45000,...,False,True,False,False,False,False,False,False,True,False
99,1,4.5,210,335,258,540,1,6,5,96050,...,False,False,False,True,False,False,False,True,False,False
100,20,5.9,200,325,194,440,1,1,5,50000,...,False,True,False,False,False,False,False,True,False,False
101,20,5.1,200,375,232,450,1,1,5,65000,...,False,True,False,False,False,False,False,True,False,False


## Split the dataset for training and testing in ratio 80:20

In [33]:
x = data.drop(columns='PriceEuro')
y = data['PriceEuro']

In [34]:
x

Unnamed: 0,Brand,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,BodyStyle,Seats,PlugType_Type 1 CHAdeMO,...,Segment_B,Segment_C,Segment_D,Segment_E,Segment_F,Segment_N,Segment_S,PowerTrain_AWD,PowerTrain_FWD,PowerTrain_RWD
0,30,4.6,233,450,161,940,1,7,5,False,...,False,False,True,False,False,False,False,True,False,False
1,31,10.0,160,270,167,250,1,1,5,False,...,False,True,False,False,False,False,False,False,False,True
2,23,4.7,210,400,181,620,1,2,5,False,...,False,False,True,False,False,False,False,True,False,False
3,2,6.8,180,360,206,560,1,6,5,False,...,False,False,True,False,False,False,False,False,False,True
4,9,9.5,145,170,168,190,1,1,4,False,...,True,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,20,7.5,160,330,191,440,1,1,5,False,...,False,True,False,False,False,False,False,False,True,False
99,1,4.5,210,335,258,540,1,6,5,False,...,False,False,False,True,False,False,False,True,False,False
100,20,5.9,200,325,194,440,1,1,5,False,...,False,True,False,False,False,False,False,True,False,False
101,20,5.1,200,375,232,450,1,1,5,False,...,False,True,False,False,False,False,False,True,False,False


In [35]:
y

0      55480
1      30000
2      56440
3      68040
4      32997
       ...  
98     45000
99     96050
100    50000
101    65000
102    62000
Name: PriceEuro, Length: 103, dtype: int64

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

## Initialize the DecisionTreeRegressor model, and use the fit function for training the model.

Add values for the parameters max_depth, min_samples_split, and max_features.

Fit the model using the fit function


In [37]:
data['FastCharge_KmH'].unique()

array(['940', '250', '620', '560', '190', '220', '420', '650', '540',
       '440', '230', '380', '210', '590', '780', '170', '260', '930',
       '850', '910', '490', '470', '270', '450', '350', '710', '240',
       '390', '570', '610', '340', '730', '920', nan, '550', '900', '520',
       '430', '890', '410', '770', '460', '360', '810', '480', '290',
       '330', '740', '510', '320', '500'], dtype=object)

In [38]:
for column in x_train.columns:
    if x_train[column].dtype == 'object':  # Check for non-numeric columns
        print(column, x_train[column].unique())

FastCharge_KmH ['710' '450' '380' '480' nan '810' '260' '350' '560' '170' '520' '890'
 '240' '210' '290' '340' '730' '360' '230' '850' '470' '610' '330' '220'
 '190' '440' '320' '920' '490' '430' '770' '590' '650' '740' '420' '540'
 '550' '270' '390' '940' '510' '620' '910' '250' '500' '900']


In [39]:
x_train.replace('-', np.nan, inplace=True)

In [40]:
data.isnull().sum()

Brand                      0
AccelSec                   0
TopSpeed_KmH               0
Range_Km                   0
Efficiency_WhKm            0
FastCharge_KmH             5
RapidCharge                0
BodyStyle                  0
Seats                      0
PriceEuro                  0
PlugType_Type 1 CHAdeMO    0
PlugType_Type 2            0
PlugType_Type 2 CCS        0
PlugType_Type 2 CHAdeMO    0
Segment_A                  0
Segment_B                  0
Segment_C                  0
Segment_D                  0
Segment_E                  0
Segment_F                  0
Segment_N                  0
Segment_S                  0
PowerTrain_AWD             0
PowerTrain_FWD             0
PowerTrain_RWD             0
dtype: int64

In [41]:
model = DecisionTreeRegressor(max_depth=5, min_samples_split=3, max_features='sqrt')

model.fit(x_train, y_train)

## Predict the outcomes for X test

In [42]:
x_test

Unnamed: 0,Brand,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,BodyStyle,Seats,PlugType_Type 1 CHAdeMO,...,Segment_B,Segment_C,Segment_D,Segment_E,Segment_F,Segment_N,Segment_S,PowerTrain_AWD,PowerTrain_FWD,PowerTrain_RWD
21,30,5.1,217,425,171,930.0,1,6,7,False,...,False,False,True,False,False,False,False,True,False,False
44,26,12.3,130,195,166,170.0,1,1,4,False,...,False,False,False,False,False,False,False,False,True,False
102,3,7.5,190,400,238,480.0,1,6,5,False,...,False,False,False,True,False,False,False,True,False,False
69,8,6.0,180,430,209,410.0,1,6,5,False,...,False,False,True,False,False,False,False,True,False,False
80,31,7.3,160,340,171,470.0,1,1,5,False,...,False,True,False,False,False,False,False,False,False,True
41,10,9.9,155,255,154,210.0,1,6,5,False,...,True,False,False,False,False,False,False,False,True,False
73,3,5.5,190,390,244,460.0,1,6,5,False,...,False,False,False,True,False,False,False,True,False,False
9,1,6.3,180,400,193,540.0,1,6,5,False,...,False,False,True,False,False,False,False,True,False,False
46,21,7.3,150,335,173,210.0,1,3,5,False,...,True,False,False,False,False,False,False,False,True,False
35,20,7.3,157,325,172,390.0,1,1,5,False,...,False,True,False,False,False,False,False,False,True,False


In [43]:
data['FastCharge_KmH'].unique()

array(['940', '250', '620', '560', '190', '220', '420', '650', '540',
       '440', '230', '380', '210', '590', '780', '170', '260', '930',
       '850', '910', '490', '470', '270', '450', '350', '710', '240',
       '390', '570', '610', '340', '730', '920', nan, '550', '900', '520',
       '430', '890', '410', '770', '460', '360', '810', '480', '290',
       '330', '740', '510', '320', '500'], dtype=object)

In [44]:
x_test = x_test.replace('-', np.nan)

In [45]:
y_pred = model.predict(x_test)

In [46]:
y_pred

array([ 62250.66666667,  22977.5       ,  72741.63157895,  54549.16666667,
        39397.22222222,  32725.65      ,  72741.63157895,  54549.16666667,
        32725.65      ,  32725.65      ,  72741.63157895,  72741.63157895,
       149150.5       ,  32725.65      ,  37879.4       ,  36559.5       ,
        72741.63157895,  22030.        ,  36559.5       ,  22030.        ,
        22030.        ])

## Assess the model performance, by using sklearn metrics for regression

In [47]:
print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))

0.7272285852061902
11833.77081592871
432867232.64086515


## Initialize the XGBRegressor model, and use the fit function

Add values for the parameters: n_estimators, max_depth, learning_rate, and set the objective to "reg:squarederror"

Fit the model using the fit function

In [48]:
model2 = XGBRegressor(n_estimators = 50, max_depth = 5, learning_rate = 0.2, objective ='reg:squarederror')

model2.fit(x_train, y_train)

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:FastCharge_KmH: object

## Predict the outcomes for X test

In [453]:
y_pred2 = model2.predict(x_test)

In [455]:
y_pred2

array([ 54992.402,  75573.93 ,  72279.77 ,  36651.676,  55892.703,
        39745.223, 104804.41 ,  33551.652,  40984.414,  53505.477,
        95938.78 ,  65654.56 ,  49289.402,  33901.125,  32431.36 ,
        45243.836,  54475.883,  74741.99 ,  23598.312,  30871.182,
        30608.709], dtype=float32)

## Assess the model performance, by using sklearn metrics for regression

In [457]:
print(r2_score(y_test, y_pred2))
print(mean_absolute_error(y_test, y_pred2))
print(mean_squared_error(y_test, y_pred2))

0.9101185764354356
4510.181919642857
44073881.31966836


## Compare the performances of both model for at least three regression metircs