# Tema 5 - Ejercicio 
## Árboles de decisión

El dataset Carseats incluido en la librería ISLR incluye datos relativos a las ventas de sillitas de coche para niños de 400 establecimientos. 
Puede encontrarse información detallada sobre cada variable incluida el dataset en <br>
https://www.rdocumentation.org/packages/ISLR/versions/1.4/topics/Carseats.

Usando dicho dataset, construya un árbol de decisión, utilizando un 75% de la muestra como conjunto de entrenamiento, para predecir la variable Sales en
base al resto de variables e interprete los resultados, comentando las reglas obtenidas.

Para realizar esta prueba, previamente se recomienda convertir Sales en una variable categórica usando la función ifelse. Para ello, será necesario
establecer un punto de corte usando algún criterio predefinido (ie, valor por encima o por debajo de la media o la mediana).


Importamos dependencias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#in the doc: "Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now."
from sklearn.preprocessing import OneHotEncoder  

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

## Paso 1: importar datos

Importamos el fichero (lo exportamos previamente en RStudio)

In [3]:
carseats_0 = pd.read_csv(r"carseats.csv",sep=',')

## Paso 2: procesar datos

In [4]:
carseats_0.describe()

Unnamed: 0.1,Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,200.5,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9
std,115.614301,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528
min,1.0,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0
25%,100.75,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0
50%,200.5,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0
75%,300.25,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0
max,400.0,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0


Vemos que tenemos un campo "unnamed", y que las variables categóricas no aparecen. <br>
Para mostrar algunas estadísticas sobre ellas:

In [5]:
# We have some categorical variables. To show them
carseats_0.describe(include='object')

Unnamed: 0,ShelveLoc,Urban,US
count,400,400,400
unique,3,2,2
top,Medium,Yes,Yes
freq,219,282,258


Lo primero que hay que hacer es eliminar la primera columna, que no sirve para nada:

In [6]:
#remove first column
carseats = carseats_0.drop(carseats_0.columns[0], axis=1)

In [7]:
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


Debido a que la implementación de sckit-learn no puede tratar variables categóricas, hay que transformar las 3 variables categóricas mediante **"one hot encoding"**

In [8]:
carseats_cat = carseats.select_dtypes(include='object')
carseats_cat.head(5)

Unnamed: 0,ShelveLoc,Urban,US
0,Bad,Yes,Yes
1,Good,Yes,Yes
2,Medium,Yes,Yes
3,Medium,Yes,Yes
4,Bad,Yes,No


In [9]:
encoder = OneHotEncoder(sparse_output=False, handle_unknown='error')

In [10]:
carseats_cat_encoded = encoder.fit_transform(carseats_cat)
carseats_cat_encoded

array([[1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 1., 0., 1.],
       [0., 0., 1., ..., 1., 0., 1.],
       ...,
       [0., 0., 1., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 1., 0., 1.]], shape=(400, 7))

In [11]:
#categorical columns
carseats_cat.columns

Index(['ShelveLoc', 'Urban', 'US'], dtype='object')

In [12]:
#Categories the encoder found:
for cat in encoder.categories_:
    print(cat)

['Bad' 'Good' 'Medium']
['No' 'Yes']
['No' 'Yes']


In [13]:
# To combine both:
#
# categorical_columns = []
# for i, col in enumerate(carseats_cat.columns):
#     for cat in encoder.categories_[i]:
#         #print(f"{col}_{cat}"),
#         categorical_columns.append(f"{col}_{cat}")
#
# categorical_columns

In [14]:
#more pythonic style:
categorical_columns = [f"{col}_{cat}" for i, col in enumerate(carseats_cat.columns) for cat in encoder.categories_[i]]
categorical_columns

['ShelveLoc_Bad',
 'ShelveLoc_Good',
 'ShelveLoc_Medium',
 'Urban_No',
 'Urban_Yes',
 'US_No',
 'US_Yes']

In [15]:
#put the one-hot encoded features into their own dataframe
one_hot_features = pd.DataFrame(carseats_cat_encoded, columns=categorical_columns)
one_hot_features.head(5)

Unnamed: 0,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,0.0,0.0,1.0,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,1.0,0.0,0.0,0.0,1.0,1.0,0.0


In [16]:
carseats = carseats.select_dtypes(exclude='object')
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
0,9.5,138,73,11,276,120,42,17
1,11.22,111,48,16,260,83,65,10
2,10.06,113,35,10,269,80,59,12
3,7.4,117,100,4,466,97,55,14
4,4.15,141,64,3,340,128,38,13


In [17]:
#Putting all together
carseats =  carseats.join(one_hot_features)
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
0,9.5,138,73,11,276,120,42,17,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,11.22,111,48,16,260,83,65,10,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,10.06,113,35,10,269,80,59,12,0.0,0.0,1.0,0.0,1.0,0.0,1.0
3,7.4,117,100,4,466,97,55,14,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,4.15,141,64,3,340,128,38,13,1.0,0.0,0.0,0.0,1.0,1.0,0.0


In [18]:
carseats.describe()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9,0.24,0.2125,0.5475,0.295,0.705,0.355,0.645
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528,0.427618,0.409589,0.498362,0.456614,0.456614,0.479113,0.479113
min,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
75%,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
max,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Finalmente, queda tranformar la variable numérica "Sales" en una variable categórica tipo "Sí"/"No". Al igual que en R, consideramos que si el valor es superior a la media, el valor será "Sí".

In [19]:
sales_mean = np.mean(carseats["Sales"])
sales_mean

np.float64(7.496325000000001)

In [20]:
sales_max = np.max(carseats["Sales"])
sales_max

16.27

In [21]:
#use pd.cut
carseats["SalesCat"] = pd.cut(x=carseats["Sales"], bins=[0.0, sales_mean, sales_max], labels=["No", "Yes"])
carseats["SalesCat"].head(175)

0      Yes
1      Yes
2      Yes
3       No
4       No
      ... 
170    Yes
171    Yes
172    Yes
173     No
174    NaN
Name: SalesCat, Length: 175, dtype: category
Categories (2, object): ['No' < 'Yes']

In [22]:
carseats.loc[pd.isna(carseats["SalesCat"]), :].index

Index([174], dtype='int64')

In [23]:
#For a reason I cannot get, pd.cut above transforms the value in row 174 from 0.0 into "NaN":
# ..
# 173     No
# 174    NaN
# As the original value was 0.0 < sales_mean, we assign "No" manually
carseats["SalesCat"][174] = "No"
carseats["SalesCat"][174]

'No'

In [24]:
carseats["SalesCat"].head(175)

0      Yes
1      Yes
2      Yes
3       No
4       No
      ... 
170    Yes
171    Yes
172    Yes
173     No
174     No
Name: SalesCat, Length: 175, dtype: category
Categories (2, object): ['No' < 'Yes']

In [25]:
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes,SalesCat
0,9.5,138,73,11,276,120,42,17,1.0,0.0,0.0,0.0,1.0,0.0,1.0,Yes
1,11.22,111,48,16,260,83,65,10,0.0,1.0,0.0,0.0,1.0,0.0,1.0,Yes
2,10.06,113,35,10,269,80,59,12,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Yes
3,7.4,117,100,4,466,97,55,14,0.0,0.0,1.0,0.0,1.0,0.0,1.0,No
4,4.15,141,64,3,340,128,38,13,1.0,0.0,0.0,0.0,1.0,1.0,0.0,No


In [26]:
# features and target
y = carseats["SalesCat"]
X = carseats.drop("Sales",axis=1).drop("SalesCat",axis=1)

In [27]:
#to check if there are nan
#y.isnull().any().any()

Ahora ya podemos separar los datos en dos conjuntos diferentes (entrenamiento y test)

In [28]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"x_train.shape: {x_train.shape}, x_test.shape: {x_test.shape}, y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}") 

x_train.shape: (300, 14), x_test.shape: (100, 14), y_train.shape: (300,), y_test.shape: (100,)


## Pasos 3 y 4: entrenamiento y evaluación

In [29]:
def classifier_testing(clf, X_train, X_test, y_train, y_test):
    # Training
    clf.fit(X_train, y_train)

    #Predictions
    y_pred = clf.predict(X_test)

    #Accuracy
    clf_accuracy_score = accuracy_score(y_test, y_pred)
    print("Accuracy Score:\n", clf_accuracy_score, "\n")

    #Classification Report
    class_rep = classification_report(y_test, y_pred)
    print("Classification Report:\n", class_rep, "\n")

    #Confusion Matrix
    conf_mtx = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:\n", conf_mtx, "\n")

In [30]:
#dtree = DecisionTreeClassifier()
dtree = DecisionTreeClassifier(criterion='entropy')    # with gini (default option), slightly worse: 0.72
classifier_testing(dtree, x_train, x_test, y_train, y_test)

Accuracy Score:
 0.72 

Classification Report:
               precision    recall  f1-score   support

          No       0.70      0.76      0.73        50
         Yes       0.74      0.68      0.71        50

    accuracy                           0.72       100
   macro avg       0.72      0.72      0.72       100
weighted avg       0.72      0.72      0.72       100
 

Confusion Matrix:
 [[38 12]
 [16 34]] 



In [31]:
dtree_leaf8 = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=100)
classifier_testing(dtree_leaf8, x_train, x_test, y_train, y_test)

Accuracy Score:
 0.73 

Classification Report:
               precision    recall  f1-score   support

          No       0.72      0.76      0.74        50
         Yes       0.74      0.70      0.72        50

    accuracy                           0.73       100
   macro avg       0.73      0.73      0.73       100
weighted avg       0.73      0.73      0.73       100
 

Confusion Matrix:
 [[38 12]
 [15 35]] 



El resultado es muy parecido al obtenido en R (con pequeñas diferencias según el valor de max_leaf_nodes).

En R:<br>
&emsp; &emsp; &emsp; &emsp; Reference<br>
&emsp; Prediction  No Yes<br>
&emsp; &emsp; &emsp; No &emsp; 41  16<br>
&emsp; &emsp; &emsp; Yes &emsp; 9  33<br>
                                        
&emsp; &emsp; Accuracy : 0.7475<br>



## Paso 5: mejora del modelo

Vamos a utilizar aquí también la técnica de "boosting"

**Adaboost**, como en R:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_multiclass.html#sphx-glr-auto-examples-ensemble-plot-adaboost-multiclass-py  <br>

In [32]:
from sklearn.ensemble import AdaBoostClassifier

In [33]:
weak_learner = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=8)
n_estimators = 100

In [34]:
adaboost_clf = AdaBoostClassifier(
    estimator = weak_learner,
    n_estimators = n_estimators,
    random_state = 123,
)

In [35]:
classifier_testing(adaboost_clf, x_train, x_test, y_train, y_test)

Accuracy Score:
 0.81 

Classification Report:
               precision    recall  f1-score   support

          No       0.77      0.88      0.82        50
         Yes       0.86      0.74      0.80        50

    accuracy                           0.81       100
   macro avg       0.82      0.81      0.81       100
weighted avg       0.82      0.81      0.81       100
 

Confusion Matrix:
 [[44  6]
 [13 37]] 



Con boosting el resultado es ligeramente mejor que en R (eligiendo max_leaf_nodes=8)

En R:<br>
&emsp; &emsp; &emsp; &emsp; Reference<br>
&emsp; Prediction  No Yes<br>
&emsp; &emsp; &emsp; No &emsp; 40  14<br>
&emsp; &emsp; &emsp; Yes &emsp;10  35<br>
                                        
&emsp; &emsp; Accuracy : 0.7576<br>