# **Decision Tree**
---

* Igual de versátil que SVM.
* Requiere muy pocos datos
* No requiere escalamiento ni centrar los datos.
* Es un modelo White Box

**CART**: *Classification And Regression Tree*. Es el algoritmo que utilizan los DTs. Es un *greedy algorithm*, busca solución global a partir de soluciones locales. **no garantiza encontrar la solución óptima**.

**Parámetros**:
* samples: Para cuátnas muestras aplica este nodo.
* value: Para cuántas muestras de clase aplica este nodo.
* Impureza:
    * gini --> impureza. gini = 0 es un nodo puro.
    * entropy.

**Regularización**:
* Incrementar valores *min_*, como *min_weight*.
* Decrementar valores *max_*, como *max_depth, max_samples_split*

![](assets/cv_gs.png)

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 

## **Dataset**
---

In [3]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv', delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [4]:
df.shape

(200, 6)

In [11]:
X = df.drop("Drug", axis=1).values

In [12]:
y = df["Drug"]

## **Preprocesamiento**
---

In [15]:
from sklearn.preprocessing import LabelEncoder

le_sex = LabelEncoder()
le_sex.fit(["F", "M"])
X[:,1] = le_sex.transform(X[:, 1])

le_BP = LabelEncoder()
le_BP.fit(["LOW", "NORMAL", "HIGH"])
X[:, 2] = le_BP.transform(X[:, 2])

le_chol = LabelEncoder()
le_chol.fit(["NORMAL", "HIGH"])
X[:, 3] = le_chol.transform(X[:, 3])

X[:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

## **Decision Tree**
---

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

In [17]:
X_train.shape, X_test.shape

((140, 5), (60, 5))

In [20]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion='entropy', max_depth=4)
dt

In [21]:
dt.fit(X_train, y_train)

In [22]:
dt.score(X_test, y_test)

0.9833333333333333

In [25]:
from sklearn.tree import export_graphviz
export_graphviz(dt, out_file='tree.dot', filled=True, feature_names=['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K'])
!dot -Tpng tree.dot -o tree.png

# **Regression Trees**
---

In [30]:
import pandas as pd
# Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
# Split our data into a training and testing data
from sklearn.model_selection import train_test_split

In [31]:
data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv")
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


In [32]:
data.shape

(506, 13)

In [34]:
data.isna().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64

## **Preprocessing**

In [35]:
data.dropna(inplace=True)

In [36]:
data.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

In [37]:
X = data.drop(columns=["MEDV"])
Y = data["MEDV"]

In [38]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21


In [39]:
Y

0      24.0
1      21.6
2      34.7
3      33.4
5      28.7
       ... 
499    17.5
500    16.8
502    20.6
503    23.9
504    22.0
Name: MEDV, Length: 394, dtype: float64

In [40]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)

## **Regression Tree**

In [43]:
regression_tree = DecisionTreeRegressor(criterion = 'squared_error')

In [44]:
regression_tree.fit(X_train, Y_train)

In [45]:
regression_tree.score(X_test, Y_test)

0.7371095053533683

In [46]:
prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

$ 3148.1012658227846


In [49]:
regression_tree = DecisionTreeRegressor(criterion = 'absolute_error')

In [50]:
regression_tree.fit(X_train, Y_train)

In [51]:
regression_tree.score(X_test, Y_test)

0.7791031062539148

In [52]:
prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean()*1000)

$ 2948.101265822785
