# Tema 5 - Ejercicio 
## Árboles de decisión

El dataset Carseats incluido en la librería ISLR incluye datos relativos a las ventas de sillitas de coche para niños de 400 establecimientos. 
Puede encontrarse información detallada sobre cada variable incluida el dataset en <br>
https://www.rdocumentation.org/packages/ISLR/versions/1.4/topics/Carseats.

Usando dicho dataset, construya un árbol de decisión, utilizando un 75% de la muestra como conjunto de entrenamiento, para predecir la variable Sales en
base al resto de variables e interprete los resultados, comentando las reglas obtenidas.

Para realizar esta prueba, previamente se recomienda convertir Sales en una variable categórica usando la función ifelse. Para ello, será necesario
establecer un punto de corte usando algún criterio predefinido (ie, valor por encima o por debajo de la media o la mediana).


Importamos dependencias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#in the doc: "Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now."
from sklearn.preprocessing import OneHotEncoder  

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

## Paso 1: importar datos

Importamos el fichero (lo exportamos previamente en RStudio)

In [3]:
carseats_0 = pd.read_csv(r"carseats.csv",sep=',')


## Paso 2: procesar datos

In [4]:
carseats_0.describe()

Unnamed: 0.1,Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,200.5,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9
std,115.614301,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528
min,1.0,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0
25%,100.75,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0
50%,200.5,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0
75%,300.25,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0
max,400.0,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0


Vemos que tenemos un campo "unnamed", y que las variables categóricas no aparecen. <br>
Para mostrar algunas estadísticas sobre ellas:

In [5]:
# We have some categorical variables. To show them
carseats_0.describe(include='object')

Unnamed: 0,ShelveLoc,Urban,US
count,400,400,400
unique,3,2,2
top,Medium,Yes,Yes
freq,219,282,258


Lo primero que hay que hacer es eliminar la primera columna, que no sirve para nada:

In [6]:
#remove first column
carseats = carseats_0.drop(carseats_0.columns[0], axis=1)

In [7]:
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


Debido a que la implementación de sckit-learn no puede tratar variables categóricas, hay que transformar las 3 variables categóricas mediante **"one hot encoding"**

In [8]:
carseats_cat = carseats.select_dtypes(include='object')
carseats_cat

Unnamed: 0,ShelveLoc,Urban,US
0,Bad,Yes,Yes
1,Good,Yes,Yes
2,Medium,Yes,Yes
3,Medium,Yes,Yes
4,Bad,Yes,No
...,...,...,...
395,Good,Yes,Yes
396,Medium,No,Yes
397,Medium,Yes,Yes
398,Bad,Yes,Yes


In [9]:
encoder = OneHotEncoder(sparse_output=False, handle_unknown='error')

In [10]:
carseats_cat_encoded = encoder.fit_transform(carseats_cat)
carseats_cat_encoded

array([[1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 1., 0., 1.],
       [0., 0., 1., ..., 1., 0., 1.],
       ...,
       [0., 0., 1., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 1., 0., 1.]], shape=(400, 7))

In [11]:
carseats_cat.columns

Index(['ShelveLoc', 'Urban', 'US'], dtype='object')

In [12]:
#Categories the encoder found:
for cat in encoder.categories_:
    print(cat)

['Bad' 'Good' 'Medium']
['No' 'Yes']
['No' 'Yes']


In [13]:
# categorical_columns = []
# for i, col in enumerate(carseats_cat.columns):
#     for cat in encoder.categories_[i]:
#         #print(f"{col}_{cat}"),
#         categorical_columns.append(f"{col}_{cat}")
#
# categorical_columns

In [14]:
#more pythonic style:
categorical_columns = [f"{col}_{cat}" for i, col in enumerate(carseats_cat.columns) for cat in encoder.categories_[i]]
categorical_columns

['ShelveLoc_Bad',
 'ShelveLoc_Good',
 'ShelveLoc_Medium',
 'Urban_No',
 'Urban_Yes',
 'US_No',
 'US_Yes']

In [15]:
#put the one-hot encoded features into their own dataframe
one_hot_features = pd.DataFrame(carseats_cat_encoded, columns=categorical_columns)
one_hot_features.head(5)

Unnamed: 0,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,0.0,0.0,1.0,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,1.0,0.0,0.0,0.0,1.0,1.0,0.0


In [16]:
carseats = carseats.select_dtypes(exclude='object')
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
0,9.5,138,73,11,276,120,42,17
1,11.22,111,48,16,260,83,65,10
2,10.06,113,35,10,269,80,59,12
3,7.4,117,100,4,466,97,55,14
4,4.15,141,64,3,340,128,38,13


In [17]:
#Putting all together
carseats =  carseats.join(one_hot_features)
carseats.head(5)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
0,9.5,138,73,11,276,120,42,17,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,11.22,111,48,16,260,83,65,10,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,10.06,113,35,10,269,80,59,12,0.0,0.0,1.0,0.0,1.0,0.0,1.0
3,7.4,117,100,4,466,97,55,14,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,4.15,141,64,3,340,128,38,13,1.0,0.0,0.0,0.0,1.0,1.0,0.0


In [18]:
carseats.describe()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9,0.24,0.2125,0.5475,0.295,0.705,0.355,0.645
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528,0.427618,0.409589,0.498362,0.456614,0.456614,0.479113,0.479113
min,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
75%,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
max,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Separar datos en dos conjuntos diferentes (entrenamiento y test)

## Paso 3: entrenamiento