# Variables categóricas

En esta clase vamos a ver como utilizar pandas y scikit learn para transformar variables categóricas en algo que los modelos de machine learning puedan entender.

Vamos a utilizar un dataset armado a mano y bastante simple para aprender a utilizar scikit learn y pandas.

Luego, tendrán que aplicar lo aprendido sobre el dataset de la clase pasada (ecommerce).

In [17]:
#from google.colab import drive # La usamos para montar nuestra unidad de Google Drive
#drive.mount('/content/drive') # Montamos nuestra unidad de Google Drive

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
data = {'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color': ['Red', 'Yellow','Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Blue']}

df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Color
0,Hot,Red
1,Cold,Yellow
2,Very Hot,Blue
3,Warm,Blue
4,Hot,Red
5,Warm,Yellow
6,Warm,Red
7,Hot,Yellow
8,Hot,Yellow
9,Cold,Blue


In [20]:
df.Color.nunique()

3

## One hot encoding

En este simple caso, vemos que la variable Temperature puede ser considerada ordinal porque la temperatura va desde cold hasta very hot.

Por otro lado, en la variable color no vemos ningún orden, no podemos considerarla ordinal.

Vamos a aplicar one hot encoding en la variable color.

Esto se puede hacer con pandas o con el OneHotEncoder de scikit learn.

Comencemos con pandas.

Pandas nos brinda la funcion get_dummies():

In [21]:
dummies=pd.get_dummies(df.Color,dtype=int)
print(type(dummies))
dummies

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Blue,Red,Yellow
0,0,1,0
1,0,0,1
2,1,0,0
3,1,0,0
4,0,1,0
5,0,0,1
6,0,1,0
7,0,0,1
8,0,0,1
9,1,0,0


¿ Cómo agregamos estas columnas a nuestro dataset ?

Podemos concatenar horizontalmente este dataset de variables dummies a el original:

La próxima clase veremos en más detalle los métodos concat, merge, etcétera.

In [22]:
df_encoded = pd.concat([df, dummies], axis=1) # Concatenamos horizontalmente con axis=1 los dos dataframes

df_encoded

Unnamed: 0,Temperature,Color,Blue,Red,Yellow
0,Hot,Red,0,1,0
1,Cold,Yellow,0,0,1
2,Very Hot,Blue,1,0,0
3,Warm,Blue,1,0,0
4,Hot,Red,0,1,0
5,Warm,Yellow,0,0,1
6,Warm,Red,0,1,0
7,Hot,Yellow,0,0,1
8,Hot,Yellow,0,0,1
9,Cold,Blue,1,0,0


Ahora podemos eliminar la columna original

In [23]:
df_encoded = df_encoded.drop('Color', axis=1)
df_encoded

Unnamed: 0,Temperature,Blue,Red,Yellow
0,Hot,0,1,0
1,Cold,0,0,1
2,Very Hot,1,0,0
3,Warm,1,0,0
4,Hot,0,1,0
5,Warm,0,0,1
6,Warm,0,1,0
7,Hot,0,0,1
8,Hot,0,0,1
9,Cold,1,0,0


<span style='color:peru'>Tambien se podría haber hecho asi más compacto, y entendible.</span>

In [24]:
df_encoded = pd.get_dummies(df, columns=['Color'],dtype=int,drop_first=False)
df_encoded

Unnamed: 0,Temperature,Color_Blue,Color_Red,Color_Yellow
0,Hot,0,1,0
1,Cold,0,0,1
2,Very Hot,1,0,0
3,Warm,1,0,0
4,Hot,0,1,0
5,Warm,0,0,1
6,Warm,0,1,0
7,Hot,0,0,1
8,Hot,0,0,1
9,Cold,1,0,0


¿ Cómo hacemos lo mismo con scikit learn ?

Tenemos el OneHotEncoder en el módulo de preprocessing:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder



In [25]:
from sklearn.preprocessing import OneHotEncoder

In [26]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False , drop='first')

In [27]:
# handle_unknow: si se encuentra con una categoria desconocida entonces completa todas las columnas con 0's
# sparse matrix ??? que nos dice? para que se usa?
# sparse bool, default=True Will return sparse matrix if set True else will return an array.

Averiguar en la documentación: 
- ¿ Qué significa el `handle_unknown='ignore'` ?
- Que es "sparse" ?
- Que sucede si ponemos sparse = True?

Hacemos fit:

<span style='color:peru'>Aca el tema es que fit recibe un dataframe o un numpy.ndarray QUE SEA BIDIMENSIONAL.
por eso es que hace ***df.Color.values.reshape(-1,1)*** pero como tambien acepta un dataframe se podria hacer df[['Color']]</span>

In [28]:
df.Color.values # se ve que tiene el tipo correcto PERO NO ES BIDIMENSIONAL

array(['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow',
       'Yellow', 'Blue'], dtype=object)

In [29]:
df.Color.values.shape # tiene solo UNA dimension de 10 por eso hace lo que sigue

(10,)

In [30]:
df.Color.values.reshape(-1,1).shape # aqui se ve que ahora tiene DOS dimensiones una de 10 y otra de 1.
# Aqui el -1 le indica a reshape que automaticamente dimensione la dimensión 0 con la cantidad de filas que tiene el array.

(10, 1)

In [31]:
#Pero tambien se podria haber directamente entregar a fit un dataframe de un columna
df[['Color']]

Unnamed: 0,Color
0,Red
1,Yellow
2,Blue
3,Blue
4,Red
5,Yellow
6,Red
7,Yellow
8,Yellow
9,Blue


In [32]:
type(df[['Color']])

pandas.core.frame.DataFrame

In [33]:
df[['Color']].shape # Tambien funciona

(10, 1)

In [34]:
enc.fit(df[['Color']])

In [35]:
df.Color.values.reshape(-1,1)

array([['Red'],
       ['Yellow'],
       ['Blue'],
       ['Blue'],
       ['Red'],
       ['Yellow'],
       ['Red'],
       ['Yellow'],
       ['Yellow'],
       ['Blue']], dtype=object)

In [36]:
df[['Color']]

Unnamed: 0,Color
0,Red
1,Yellow
2,Blue
3,Blue
4,Red
5,Yellow
6,Red
7,Yellow
8,Yellow
9,Blue


- ¿ Qué pasa si sacamos el .reshape(-1, 1) ? -----> se rompe

- ¿ Qué otra forma se les ocurre para solucionar el error sin usar reshape ? <span style='color:peru'> Respondido más arriba </span>

In [37]:
encoded_color = enc.transform(df[['Color']])
encoded_color

array([[1., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.]])

Ahora, como agregamos esto a nuestro dataframe?

El método get_feature_names nos da los nombres de las nuevas features creadas:

In [38]:
enc.get_feature_names_out(['Color'])

array(['Color_Red', 'Color_Yellow'], dtype=object)

In [39]:
encoded_color_columns = enc.get_feature_names_out(['Color'])

In [40]:
encoded_color_df = pd.DataFrame(data=encoded_color, columns= encoded_color_columns)
encoded_color_df

Unnamed: 0,Color_Red,Color_Yellow
0,1.0,0.0
1,0.0,1.0
2,0.0,0.0
3,0.0,0.0
4,1.0,0.0
5,0.0,1.0
6,1.0,0.0
7,0.0,1.0
8,0.0,1.0
9,0.0,0.0


Ahora, como hicimos antes, podemos concatenar y eliminar la columna original:

In [41]:
pd.concat([df, encoded_color_df], axis=1).drop('Color', axis=1)

Unnamed: 0,Temperature,Color_Red,Color_Yellow
0,Hot,1.0,0.0
1,Cold,0.0,1.0
2,Very Hot,0.0,0.0
3,Warm,0.0,0.0
4,Hot,1.0,0.0
5,Warm,0.0,1.0
6,Warm,1.0,0.0
7,Hot,0.0,1.0
8,Hot,0.0,1.0
9,Cold,0.0,0.0


Muchas veces, en lugar de crear todas las columnas, se utiliza el atributo `drop='first'`.

Esto crea todas las columnas menos la primera (en nuestro caso no se crearía color_blue), esto es porque si ninguna de las otras es 1, significa que blue es 1. Sirve para ahorrarnos una columna.

En el caso de variables binarias, podemos crear una única columna utilizando:

`drop='if_binary'`

## Label encoder

Se utiliza de una forma muy similar a el OneHotEncoder de scikit learn.

In [42]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #creo el objeto

df['Temperature_label_encoded'] = le.fit_transform(df.Temperature)
df

Unnamed: 0,Temperature,Color,Temperature_label_encoded
0,Hot,Red,1
1,Cold,Yellow,0
2,Very Hot,Blue,2
3,Warm,Blue,3
4,Hot,Red,1
5,Warm,Yellow,3
6,Warm,Red,3
7,Hot,Yellow,1
8,Hot,Yellow,1
9,Cold,Blue,0


No se utiliza para datos ordinales ya que scikit learn le asigna un valor numérico pero sin tener en cuenta que nosotros queremos que cold sea menor que hot.

Cuando queremos especificar nosotros los valores numéricos para cada valor de la variable categórica, podemos utilizar la función .replace() de pandas.

Esta función recibe un diccionario en el que la key tiene que ser el valor que queremos transformar y el value el valor resultante que queremos.

Veamos un ejemplo:

In [43]:
df.Temperature.unique()

array(['Hot', 'Cold', 'Very Hot', 'Warm'], dtype=object)

In [44]:
mapping_dict = {
    'Cold': 1,
    'Warm': 2,
    'Hot': 3,
    'Very Hot': 4
}

temperature_ordinal = df.Temperature.replace(mapping_dict)
temperature_ordinal

0    3
1    1
2    4
3    2
4    3
5    2
6    2
7    3
8    3
9    1
Name: Temperature, dtype: int64

In [45]:
df['Temperature_ordinal']=df['Temperature'].replace(mapping_dict)
# df['Temperature_ordinal'] = temperature_ordinal
df

Unnamed: 0,Temperature,Color,Temperature_label_encoded,Temperature_ordinal
0,Hot,Red,1,3
1,Cold,Yellow,0,1
2,Very Hot,Blue,2,4
3,Warm,Blue,3,2
4,Hot,Red,1,3
5,Warm,Yellow,3,2
6,Warm,Red,3,2
7,Hot,Yellow,1,3
8,Hot,Yellow,1,3
9,Cold,Blue,0,1


In [46]:
df.drop('Temperature_label_encoded', axis=1, inplace=True)

In [47]:
df

Unnamed: 0,Temperature,Color,Temperature_ordinal
0,Hot,Red,3
1,Cold,Yellow,1
2,Very Hot,Blue,4
3,Warm,Blue,2
4,Hot,Red,3
5,Warm,Yellow,2
6,Warm,Red,2
7,Hot,Yellow,3
8,Hot,Yellow,3
9,Cold,Blue,1


# Discretización

Vamos a ver como hacerlo con sklearn. Para este caso vamos a utilizar otro dataset con una variable continua:

Creamos el dataset:

In [48]:
variable_continua = np.arange(200)
df_cont = pd.DataFrame({'X': variable_continua})

In [49]:
df_cont.head()

Unnamed: 0,X
0,0
1,1
2,2
3,3
4,4


In [50]:
edades = np.random.randint(10, 91, size=200)
df2=pd.DataFrame({'Edades':edades})
df2

Unnamed: 0,Edades
0,68
1,27
2,10
3,87
4,10
...,...
195,71
196,22
197,43
198,74


Aplicamos KBinsDiscretizer.

Tenemos que pasarle la cantidad de bins, encode y strategy.

Averiguar que significan estos parametros:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

In [51]:
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy = 'uniform')

In [52]:
# n_bins: cantidad de "cajas"(bines) en donde guardo mis datos
# encode: metodo usado para la transformación --- ‘onehot’ (nos devuelva sparse matrix) , ‘onehot-dense’ dense array , 'ordinal' te devuelve el  "bin identifier encoded"
# strategy:‘uniform’ todos los bines tinene el mismo ancho
#          :‘quantile’ todos los bines tienen la misma cant de datos
#          :'Values' in each bin have the same nearest center of a 1D k-means cluster.

En scikit learn siempre estuvimos aplicando el método fit y transform por separado. Scikit lern nos permite aplicar los dos en una linea con el método fit_transform:

In [53]:
df_cont['discretized'] = est.fit_transform(df_cont[['X']]) # Acá en lugar de hacer reshape(-1, 1) utilizamos doble [[]]



In [54]:
est.fit_transform(df_cont[['X']])



array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],

In [55]:
df_cont

Unnamed: 0,X,discretized
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0
...,...,...
195,195,4.0
196,196,4.0
197,197,4.0
198,198,4.0


In [56]:
df_cont.discretized.value_counts()

discretized
0.0    40
1.0    40
2.0    40
3.0    40
4.0    40
Name: count, dtype: int64

In [57]:
ed = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy = 'uniform', subsample=None)
edades_dicretizadas=ed.fit_transform(df2[['Edades']]).astype(int)
df2['discretized']=edades_dicretizadas
df2.head()

Unnamed: 0,Edades,discretized
0,68,3
1,27,1
2,10,0
3,87,4
4,10,0


In [58]:
intervalos=ed.bin_edges_[0]
df2_intervalos = pd.DataFrame({
    'Número de Bin': sorted(df2['discretized'].unique()),
    'Intervalo': list(zip(intervalos[:-1],intervalos[1:]))
})

# centros
# centros = (intervalos[:-1] + intervalos[1:]) / 2

df2_intervalos

Unnamed: 0,Número de Bin,Intervalo
0,0,"(10.0, 26.0)"
1,1,"(26.0, 42.0)"
2,2,"(42.0, 58.0)"
3,3,"(58.0, 74.0)"
4,4,"(74.0, 90.0)"


# Ejercicio

Vamos a levantar el dataset de la clase pasada (esta vez sin nulos) y transformar las variables categóricas.

Tienen que utilizar su criterio para decidir cuando conviene ordinal, one hot, etc.

Recuerden que las columnas del dataset son:


- id: Id del usuario 
- administrative: Número de veces que el usuario visito la sección "administrative"
- administrative_duration: Tiempo que el usuario paso en la sección administrative
- informational: Número de veces que el usuario visitó la sección "informational"
- informational_duration: Tiempo que el usuario paso en la sección informational
- productrelated: Número de veces que el usuario visitó la sección "products related"
- productrelated_duration: Tiempo que el usuario pasó en la sección 
- bouncerates: Porcentaje de visitantes que entran a la página e inmediatamente la dejan sin interactuar con la misma. Esta metrica solo se tiene en cuenta si es la primer página que se visitó del sitio web.
- exitrates: De la cantidad total de visitas a las páginas del sitio web, el porcentage de usuarios que lo abandonaron en esta página. Esto es, el  porcentaje de usuarios que su última visita al sitio fué en esta página.
- pagevalues: Este es el valor promedio del sitio web, indica la contribución que este sitio web hizo al visitante que llega a la página o sección de compra final.
-  specialday: Es una fecha especial o no (1 o 0)
- operatingsystems: Sistema operativo
- browser: Nombre del navegador
- region: Region geográfica del usuario
- traffictype: Tipo de tráfico web
- visitortype: Nuevo o uno que retorno al sitio
- Weekend: 1 si es fin de semana y 0 en otro caso
- revenue: 1 si el usuario hizo una compra y 0 en otro caso

In [59]:
df = pd.read_csv('onlineShopperFix.csv')

In [60]:
df.isna().sum()

Unnamed: 0                 0
id                         0
Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
revenue                    0
dtype: int64

In [61]:
columnas_categoricas = df.select_dtypes(include=['object', 'category'])
df.head()

Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,revenue
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.05,0.0,0.0,Dec,3.0,2.0,1.0,2.0,New_Visitor,0.0,0.0
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.02228,0.066667,0.0,0.0,Mar,2.0,2.0,8.0,1.0,Returning_Visitor,0.0,0.0
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,0.014286,0.0,0.0,Mar,4.0,2.0,4.0,14.0,Returning_Visitor,1.0,0.0
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,0.2,0.0,0.0,Mar,2.0,8.0,2.0,1.0,Returning_Visitor,0.0,0.0
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,0.15,0.0,0.0,Mar,3.0,2.0,1.0,1.0,Returning_Visitor,0.0,0.0


Transformar las variables:

- Month
- Visitor type
- weekend

Con los métodos que aprendimos.

Discretizar:
- ExitRates
- BounceRates


Investigar:

- ¿Cómo puedo saber desde que valor hasta que valor van cada uno de los "bins" en KBinsDiscretizer? (buscar los atributos del discretizer en la documentación)
- ¿ Qué pasa si en lugar de usar encode="ordinal" uso encode=‘onehot’ o ‘onehot-dense’?
- ¿Cuál es la diferencia entre strategy=‘uniform’ y strategy=‘quantile’ ?

In [62]:
# Month
import calendar

mesesdf=df.Month.unique()
print(mesesdf)
meses=list(calendar.month_abbr)
meses[0]='June'
print(meses)
nums=list(range(13))
print(nums)
mesesReemplazo=dict(zip(meses,nums))
mesesReemplazo['June']=6
print(mesesReemplazo)
df['Month_Ordinal']=df['Month'].replace(mesesReemplazo)
df.head()

['Dec' 'Mar' 'Oct' 'May' 'Nov' 'Aug' 'Jul' 'Sep' 'Feb' 'June']
['June', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
{'June': 6, 'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}


Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,...,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,revenue,Month_Ordinal
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.05,...,0.0,Dec,3.0,2.0,1.0,2.0,New_Visitor,0.0,0.0,12
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.02228,0.066667,...,0.0,Mar,2.0,2.0,8.0,1.0,Returning_Visitor,0.0,0.0,3
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,0.014286,...,0.0,Mar,4.0,2.0,4.0,14.0,Returning_Visitor,1.0,0.0,3
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,0.2,...,0.0,Mar,2.0,8.0,2.0,1.0,Returning_Visitor,0.0,0.0,3
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,0.15,...,0.0,Mar,3.0,2.0,1.0,1.0,Returning_Visitor,0.0,0.0,3


In [63]:
df['VisitorType'].unique()

array(['New_Visitor', 'Returning_Visitor', 'Other'], dtype=object)

In [64]:
reemp={'New_Visitor':0,'Returning_Visitor':1,'Other':2}
df['VisitorType_catg']=df['VisitorType'].replace(reemp)
df.head()

Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,...,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,revenue,Month_Ordinal,VisitorType_catg
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.05,...,Dec,3.0,2.0,1.0,2.0,New_Visitor,0.0,0.0,12,0
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.02228,0.066667,...,Mar,2.0,2.0,8.0,1.0,Returning_Visitor,0.0,0.0,3,1
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,0.014286,...,Mar,4.0,2.0,4.0,14.0,Returning_Visitor,1.0,0.0,3,1
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,0.2,...,Mar,2.0,8.0,2.0,1.0,Returning_Visitor,0.0,0.0,3,1
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,0.15,...,Mar,3.0,2.0,1.0,1.0,Returning_Visitor,0.0,0.0,3,1


In [65]:
df['Weekend'].unique()

array([0., 1.])

In [66]:
df.BounceRates.describe()

count    8251.000000
mean        0.021526
std         0.047036
min         0.000000
25%         0.000000
50%         0.003765
75%         0.016667
max         0.200000
Name: BounceRates, dtype: float64

In [67]:
bc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy = 'uniform', subsample=None)
bounce_discretized=bc.fit_transform(df[['BounceRates']]).astype(int)
df['Bounce_disc']=bounce_discretized
df.head()

Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,...,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,revenue,Month_Ordinal,VisitorType_catg,Bounce_disc
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.05,...,3.0,2.0,1.0,2.0,New_Visitor,0.0,0.0,12,0,2
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.02228,0.066667,...,2.0,2.0,8.0,1.0,Returning_Visitor,0.0,0.0,3,1,1
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,0.014286,...,4.0,2.0,4.0,14.0,Returning_Visitor,1.0,0.0,3,1,0
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,0.2,...,2.0,8.0,2.0,1.0,Returning_Visitor,0.0,0.0,3,1,9
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,0.15,...,3.0,2.0,1.0,1.0,Returning_Visitor,0.0,0.0,3,1,5


In [70]:
intervalos_bc=bc.bin_edges_[0]
df_intervalos_bc = pd.DataFrame({
    'Número de Bin': sorted(df['Bounce_disc'].unique()),
    'Intervalo': list(zip(intervalos_bc[:-1],intervalos_bc[1:]))
})
df_intervalos_bc

Unnamed: 0,Número de Bin,Intervalo
0,0,"(0.0, 0.02)"
1,1,"(0.02, 0.04)"
2,2,"(0.04, 0.06)"
3,3,"(0.06, 0.08)"
4,4,"(0.08, 0.1)"
5,5,"(0.1, 0.12)"
6,6,"(0.12, 0.14)"
7,7,"(0.14, 0.16)"
8,8,"(0.16, 0.18)"
9,9,"(0.18, 0.2)"


In [69]:
df.ExitRates.describe()

count    8251.000000
mean        0.042420
std         0.047457
min         0.000000
25%         0.014667
50%         0.025000
75%         0.048947
max         0.201960
Name: ExitRates, dtype: float64

In [71]:
ex = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy = 'uniform', subsample=None)
exit_discretized=ex.fit_transform(df[['ExitRates']]).astype(int)
df['Exit_disc']=exit_discretized
df.head()

Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,...,Browser,Region,TrafficType,VisitorType,Weekend,revenue,Month_Ordinal,VisitorType_catg,Bounce_disc,Exit_disc
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.05,...,2.0,1.0,2.0,New_Visitor,0.0,0.0,12,0,2,2
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.02228,0.066667,...,2.0,8.0,1.0,Returning_Visitor,0.0,0.0,3,1,1,3
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,0.014286,...,2.0,4.0,14.0,Returning_Visitor,1.0,0.0,3,1,0,0
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,0.2,...,8.0,2.0,1.0,Returning_Visitor,0.0,0.0,3,1,9,9
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,0.15,...,2.0,1.0,1.0,Returning_Visitor,0.0,0.0,3,1,5,7


In [75]:
intervalos_ex=ex.bin_edges_[0]
df_intervalos_ex = pd.DataFrame({
    'Número de Bin': sorted(df['Exit_disc'].unique()),
    # Esta vez a diferencia de la anterior hay que redondear los limites.
    'Intervalo': [(round(a,2),round(b,2)) for a,b in list(zip(intervalos_ex[:-1],intervalos_ex[1:]))] 
})
df_intervalos_ex

Unnamed: 0,Número de Bin,Intervalo
0,0,"(0.0, 0.02)"
1,1,"(0.02, 0.04)"
2,2,"(0.04, 0.06)"
3,3,"(0.06, 0.08)"
4,4,"(0.08, 0.1)"
5,5,"(0.1, 0.12)"
6,6,"(0.12, 0.14)"
7,7,"(0.14, 0.16)"
8,8,"(0.16, 0.18)"
9,9,"(0.18, 0.2)"
