![imagen](data/foto1.png)

### CAPAS DE PREPROCESADO DE KERAS

Igual que sklearn tenemos sus "transformers" y difernentes funciones para poder procesar los datos, para luego, si queremos incluirlos en un pipeline, en Keras existen "capas" de preprocesamiento que podemos incluir en el modelo de forma análoga

### El problema y el dataset

En este caso nuestro problema será clasificar una serie de fármacos a partir de features numéricas y features de texto, que tendremos que convertir para poder utilizar.

Vamos a trabajar con un dataset que hemos empleado antes con reviews de medicinas, pero en el que no utilizamos los campos de texto. Sobre él vamos a hacer el preprocesado usando esas capas equivalentes a lo que ya hemos empleado con sklearn. Eso nos permitirá introducir las capas de Embedding y de ahí a revisar un ejemplo de clasificación con texto en lenguaje natural que ahora haremos con modelos DL.

Primero, las importaciones y cargar el dataset, y echar un "vistazo":

In [1]:
import numpy as np
import pandas as pd
import random as rm
import re
import tensorflow as tf

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


In [2]:
df = pd.read_csv("./data/pharma_full.csv")
df.head()

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production
0,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ...",318440,398.0
1,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest...",888949,909.0
2,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...,264077,465.0
3,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...,542110,602.0
4,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above,83761,124.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   urlDrugName        3107 non-null   object 
 1   rating             3107 non-null   int64  
 2   effectiveness      3107 non-null   object 
 3   sideEffects        3107 non-null   object 
 4   condition          3106 non-null   object 
 5   benefitsReview     3089 non-null   object 
 6   sideEffectsReview  3032 non-null   object 
 7   commentsReview     3095 non-null   object 
 8   Sales              3107 non-null   int64  
 9   Production         3107 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 242.9+ KB


Las prepararemos un poco para que podamos emplear todos los tipos de capas

Los missings seguiremos tratándolos, por ahora, a parte

In [4]:
df_clean = df.fillna("No Value")

In [5]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   urlDrugName        3107 non-null   object 
 1   rating             3107 non-null   int64  
 2   effectiveness      3107 non-null   object 
 3   sideEffects        3107 non-null   object 
 4   condition          3107 non-null   object 
 5   benefitsReview     3107 non-null   object 
 6   sideEffectsReview  3107 non-null   object 
 7   commentsReview     3107 non-null   object 
 8   Sales              3107 non-null   int64  
 9   Production         3107 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 242.9+ KB


Claramente el target es "rating"

In [6]:
target = "rating"

### Split train/test

In [7]:
train_set, test_set  = train_test_split(df_clean, test_size= 0.2, random_state= 42)

In [8]:
train_set.shape

(2485, 10)

In [9]:
test_set.shape

(622, 10)

### MiniEda

Del miniEDA sólo vamos a hacer el análisis del target, nos vamos a quedar y a tratar todas las variables como features.

In [10]:
train_set[target].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
10,591
8,464
9,370
7,267
1,243
5,132
3,123
6,119
4,89
2,87


Es un dataset desbalanceado en el target. Luego veremos como equilibrarlo dentro de lo posible sin aplicar SMOTE ni undersampling (con campos con texto además tiene aún menos utilidad, en general)

Como ya sabrás a keras le gusta que las clases empiecen en cero así que lo tratamos rápido:

In [11]:
train_set[target] = train_set[target]-1
test_set[target] = test_set[target] - 1

### Tratamiento de Features

Vamos a hacer lo siguiente:  
1- Crear features nuevas (conteos de palabras por campo de texto libre)  
2- Escalado de las features numéricas  
3- Conversión de las categóricas  
4- Vectorización de las features de texto  

#### #1 Creación de features nuevas

Existen varios campos con texto libre y en breve los vectorizaremos pero antes puede ser interesante crear una serie de features nuevas que cuenten el número aproximado de palabras que hay en cada uno de esos campos. Estos campos son básicamente lo que tienen en su nombre la palabra "Review"

In [12]:
for col in [col for col in train_set.columns if "Review" in col]:
    train_set[col + "_wc"] = train_set[col].apply(lambda value: len(value.split()))
    test_set[col + "_wc"] = test_set[col].apply(lambda value: len(value.split()))


In [13]:
train_set

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production,benefitsReview_wc,sideEffectsReview_wc,commentsReview_wc
2686,zoloft,9,Highly Effective,No Side Effects,anxiety,becoming a normal functioning individual. it h...,I had no real side effects unless you consider...,taken once daily.,135166,335.0,62,15,3
1192,adipex-p,9,Highly Effective,Mild Side Effects,to lose 30 pounds,When BMI rating is over 27 its time to get tha...,jittery.. sometimes euphoric.. tired but cant ...,last results 3 months on adipex-p lost 30 poun...,739564,939.0,163,10,248
2738,tri-luma,7,Moderately Effective,Severe Side Effects,hyperpigmentation,lightening of acne scars,"redness, dryness, breakdown of skin",this topical compound was used to reduce hyper...,683247,843.0,4,5,42
2780,yasmin,2,Highly Effective,Moderate Side Effects,birth control,There was no chance of pregancy because i lost...,no libido what so ever. My breasts hurt like n...,I was given this to prevent pregnancy and i am...,222295,282.0,22,50,27
1735,requip,7,Considerably Effective,No Side Effects,restless leg syndrome,I took this medication to treat restless leg s...,"I did need to take more as time went on, and I...",I took one milligram at night to treat the sym...,344748,505.0,93,69,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3092,zyrtec,9,Highly Effective,Mild Side Effects,atopic eczema,The drug was very effective in counteracting s...,I felt slightly tired after taking Zyrtec. How...,This is a non-prescription drug and I took it ...,156028,356.0,74,54,51
1095,neurontin,0,Ineffective,Severe Side Effects,all over and various type pain,No Value,"Nausea, extreme sleepiness,severe headache, ji...",Day one I felt negative side effects after tak...,317886,337.0,2,10,120
1130,wellbutrin,6,Moderately Effective,Mild Side Effects,depression,"I experienced less depression, increased libid...",I have a dry mouth and eyes. Drug has decrease...,"Prescribed at 300 mg per day, have taken as mu...",556187,696.0,15,15,24
1294,wellbutrin,2,Considerably Effective,Moderate Side Effects,stop smoking,"The treatment was effective in reducing my ""ap...","After a while, I began to notice that NOTHING ...",I was taking the Bupropion as an aide to break...,728338,789.0,44,105,92


***

#### Escalado/Normalización

Lo primero es quedarnos con las columnas numéricas

In [14]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2485 entries, 2686 to 860
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   urlDrugName           2485 non-null   object 
 1   rating                2485 non-null   int64  
 2   effectiveness         2485 non-null   object 
 3   sideEffects           2485 non-null   object 
 4   condition             2485 non-null   object 
 5   benefitsReview        2485 non-null   object 
 6   sideEffectsReview     2485 non-null   object 
 7   commentsReview        2485 non-null   object 
 8   Sales                 2485 non-null   int64  
 9   Production            2485 non-null   float64
 10  benefitsReview_wc     2485 non-null   int64  
 11  sideEffectsReview_wc  2485 non-null   int64  
 12  commentsReview_wc     2485 non-null   int64  
dtypes: float64(1), int64(5), object(7)
memory usage: 271.8+ KB


Son claramente las que no son tipo `object`

In [15]:
numericas = [col for col in train_set.columns if train_set[col].dtype != "object" and col != target]
numericas

['Sales',
 'Production',
 'benefitsReview_wc',
 'sideEffectsReview_wc',
 'commentsReview_wc']

Y ahora las vamos a normalizar usando keras, y además vamos a entrenar un modelo baseline solo con estas features la numéricas.

### Normalization Layer

In [16]:
X_num = train_set[numericas].copy()
y_num = train_set[target]

Probemos primero sin normalizar, creando un diccionario para compensar las clases:


In [17]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, activation = "relu"),
    tf.keras.layers.Dense(10, activation = "softmax")
])
earlyS = tf.keras.callbacks.EarlyStopping(patience = 30, restore_best_weights= True, monitor = "val_acc")
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics =["acc"])


In [18]:
history = model.fit(X_num, y_num, validation_split= 0.2, epochs=300, callbacks= earlyS)

Epoch 1/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - acc: 0.1873 - loss: 16373993472.0000 - val_acc: 0.2233 - val_loss: 2.2786
Epoch 2/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2408 - loss: 2.2739 - val_acc: 0.2233 - val_loss: 2.2560
Epoch 3/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2413 - loss: 2.2490 - val_acc: 0.2233 - val_loss: 2.2363
Epoch 4/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2413 - loss: 2.2297 - val_acc: 0.2233 - val_loss: 2.2192
Epoch 5/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2413 - loss: 2.2129 - val_acc: 0.2233 - val_loss: 2.2044
Epoch 6/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2413 - loss: 2.1983 - val_acc: 0.2233 - val_loss: 2.1915
Epoch 7/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step

In [19]:
model.evaluate(test_set[numericas], test_set[target])

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2589 - loss: 2.2785 


[2.2815656661987305, 0.24276527762413025]

La cosa está fácil... Vamos con la normalización

In [20]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
norm_layer = tf.keras.layers.Normalization() # Realmente hace una estandardization
 # Necestiamos hacer un "fit" previo como con el StandardScaler
model = tf.keras.models.Sequential([
    norm_layer,
    tf.keras.layers.Dense(100, activation = "relu"),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics =["acc"])
norm_layer.adapt(X_num.to_numpy())
history = model.fit(X_num, y_num, validation_split = 0.2, epochs=300, callbacks= earlyS)

Epoch 1/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - acc: 0.1237 - loss: 2.3351 - val_acc: 0.1469 - val_loss: 2.2547
Epoch 2/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - acc: 0.1912 - loss: 2.2340 - val_acc: 0.2072 - val_loss: 2.1879
Epoch 3/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2329 - loss: 2.1686 - val_acc: 0.2133 - val_loss: 2.1440
Epoch 4/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.2454 - loss: 2.1260 - val_acc: 0.2193 - val_loss: 2.1139
Epoch 5/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - acc: 0.2514 - loss: 2.0972 - val_acc: 0.2233 - val_loss: 2.0922
Epoch 6/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - acc: 0.2551 - loss: 2.0766 - val_acc: 0.2233 - val_loss: 2.0757
Epoch 7/300
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - acc: 0

In [21]:
model.evaluate(test_set[numericas], test_set[target])

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.7225 - loss: 0.8456 


[0.8535881638526917, 0.7234726548194885]

In [22]:
y_pred  = [np.argmax(pred) for pred in model.predict(test_set[numericas])]

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


In [23]:
from sklearn.metrics import classification_report

In [24]:
print(classification_report(test_set[target], y_pred))

              precision    recall  f1-score   support

           0       0.78      1.00      0.87        62
           1       0.00      0.00      0.00        16
           2       0.68      0.65      0.67        23
           3       0.67      0.33      0.44        18
           4       0.62      0.67      0.64        27
           5       0.00      0.00      0.00        38
           6       0.23      0.14      0.18        83
           7       0.53      0.99      0.69        94
           8       0.99      0.85      0.91       110
           9       0.96      1.00      0.98       151

    accuracy                           0.72       622
   macro avg       0.55      0.56      0.54       622
weighted avg       0.67      0.72      0.68       622



Bien, pues ya tenemos baseline y una forma sencilla de incorporar la normalización a nuestro modelo, vamos con el resto de transformaciones, pero ya en la siguiente sesión.

***

### Transformación de variables categóricas

Vamos a transformar las categóricas y lo vamos a hacer con un "ordinal encoder" por un lado y con un "onehot" por otro, eso sí, usando las capas de Keras,y cuáles son categóricas:

In [25]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2485 entries, 2686 to 860
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   urlDrugName           2485 non-null   object 
 1   rating                2485 non-null   int64  
 2   effectiveness         2485 non-null   object 
 3   sideEffects           2485 non-null   object 
 4   condition             2485 non-null   object 
 5   benefitsReview        2485 non-null   object 
 6   sideEffectsReview     2485 non-null   object 
 7   commentsReview        2485 non-null   object 
 8   Sales                 2485 non-null   int64  
 9   Production            2485 non-null   float64
 10  benefitsReview_wc     2485 non-null   int64  
 11  sideEffectsReview_wc  2485 non-null   int64  
 12  commentsReview_wc     2485 non-null   int64  
dtypes: float64(1), int64(5), object(7)
memory usage: 271.8+ KB


Tiene pinta de que todas las que son objetc que no son Review

In [26]:
categoricals = [col for col in train_set.columns if "Review" not in col and train_set[col].dtype == "object"]
categoricals

['urlDrugName', 'effectiveness', 'sideEffects', 'condition']

Hagamos un check de cardinalidad:

In [27]:
for col in categoricals:
    print(f"Para <{col}>: {train_set[col].nunique()}")

Para <urlDrugName>: 463
Para <effectiveness>: 5
Para <sideEffects>: 5
Para <condition>: 1203


In [28]:
train_set.condition.value_counts()

Unnamed: 0_level_0,count
condition,Unnamed: 1_level_1
depression,182
acne,135
anxiety,51
insomnia,41
birth control,34
...,...
major sinus infection,1
bipolar/raciness/insomnia,1
anxiety/insomnia/,1
infected site of oreal surgery,1


Tenemos que quitar la url y condition. Esta además tendría que tener un tratamiento particular (entre vectorización y compresión de categorías, que no vamos a hacer por tiempo y espacio)

In [29]:
categoricals.remove("urlDrugName")
categoricals.remove("condition")
categoricals

['effectiveness', 'sideEffects']

### Ordinal Encoding layer (aka StringLookup)

La forma de hacer el categórical encoding es unsar una capa de stringlookup que básicamente puede hacer ella el mapeo o se le puede dar (como en un ordinal de sklearn)

Vamos a añadir las categóricas de dos formas, pero primero vamos a hacerles su ordinal encoding. El único pero es que hay que hacerlo feature a feature

In [30]:
X_train = train_set[numericas + categoricals].copy()
#X_train[categoricals] = ordinalEncoding(train_set[categoricals])
X_test = test_set[numericas + categoricals].copy()
#X_test[categoricals] = ordinalEncoding(test_set[categoricals])

In [31]:
tf.random.set_seed(42)
normalization_layer = tf.keras.layers.Normalization()
ordinalEncoding_layers = [tf.keras.layers.StringLookup() for cat in categoricals] # no le damos valor, pero aquí el argumento vocabulary permite pasar una lista,
                                                 # una tupla, un array, un tensor con los strings ordenados de forma que el primero recibira la clase 0, etc

#ordinalEncoding.adapt(train_set[categoricals]) # Al igual que la capa de normalización hay que hacerle su fit en este caso el método adapt
concat_layer = tf.keras.layers.Concatenate(axis = -1)
hidden_layer1 = tf.keras.layers.Dense(100, activation="relu")
output_layer = tf.keras.layers.Dense(10, activation = "softmax")


normalization_layer.adapt(train_set[numericas].to_numpy())
_ = [ordinalEncoding_layers[indice].adapt(train_set[col]) for indice,col in enumerate(categoricals)]

earlyS = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights= True)
input_num = tf.keras.layers.Input(shape=(train_set[numericas].shape[1],))
input_cats = [tf.keras.layers.Input(shape=(train_set[[col]].shape[1],), dtype = tf.string) for col in categoricals] # Importante decirle que entra una string
normalized = normalization_layer(input_num)
encoded = [ordinalEncoding_layers[i](input_cats[i]) for i in range(len(ordinalEncoding_layers))]
concat = concat_layer([normalized] + encoded)
hidden1 = hidden_layer1(concat)
output = output_layer(hidden1)

model = tf.keras.Model(inputs=[input_num] + input_cats, outputs=[output])

In [32]:
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-1),\
              metrics =["acc"])


In [33]:
salida = tf.keras.Model(inputs = input_cats, outputs = encoded)
salida([X_train[col] for col in categoricals])

[<tf.Tensor: shape=(2485, 1), dtype=int64, numpy=
 array([[1],
        [1],
        [3],
        ...,
        [3],
        [2],
        [1]])>,
 <tf.Tensor: shape=(2485, 1), dtype=int64, numpy=
 array([[2],
        [1],
        [4],
        ...,
        [1],
        [3],
        [5]])>]

In [34]:
X_train[[categoricals[0]]]

Unnamed: 0,effectiveness
2686,Highly Effective
1192,Highly Effective
2738,Moderately Effective
2780,Highly Effective
1735,Considerably Effective
...,...
3092,Highly Effective
1095,Ineffective
1130,Moderately Effective
1294,Considerably Effective


In [35]:
entradas = [X_train[numericas]] + [X_train[col] for col in categoricals]

In [36]:
history = model.fit(entradas, y_num, validation_split=0.2, epochs=100, callbacks= earlyS)

Epoch 1/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - acc: 0.2878 - loss: 2.1227 - val_acc: 0.2938 - val_loss: 1.9137
Epoch 2/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4050 - loss: 1.7566 - val_acc: 0.3461 - val_loss: 1.7071
Epoch 3/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - acc: 0.4244 - loss: 1.5807 - val_acc: 0.3662 - val_loss: 1.6023
Epoch 4/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4387 - loss: 1.4814 - val_acc: 0.3763 - val_loss: 1.5400
Epoch 5/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4578 - loss: 1.4167 - val_acc: 0.3843 - val_loss: 1.5157
Epoch 6/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4677 - loss: 1.3733 - val_acc: 0.3903 - val_loss: 1.4816
Epoch 7/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.

In [37]:
entrada_test = [X_test[numericas]] + [X_test[col] for col in categoricals]
model.evaluate(entrada_test, test_set[target])

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.9291 - loss: 0.2985 


[0.32268407940864563, 0.9083601236343384]

Algo hemos mejorado...¿no?

In [38]:
y_pred  = [np.argmax(pred) for pred in model.predict(entrada_test)]

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step


In [39]:
print(classification_report(test_set[target], y_pred))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95        62
           1       0.78      0.44      0.56        16
           2       0.00      0.00      0.00        23
           3       0.33      0.83      0.47        18
           4       0.74      0.74      0.74        27
           5       0.91      0.76      0.83        38
           6       0.94      0.95      0.95        83
           7       0.98      1.00      0.99        94
           8       1.00      0.99      1.00       110
           9       0.99      1.00      1.00       151

    accuracy                           0.91       622
   macro avg       0.76      0.77      0.75       622
weighted avg       0.90      0.91      0.90       622



#### Onehotencoding

Pero podemos hacer el onehot encoding de una vez configurando la StringLookup layer debidamente.  Además ahora usaremos la functional API para incluir la capa dentro del modelo (y no tener que hacer la conversión por fuera)

In [40]:
tf.random.set_seed(42)
normalization_layer = tf.keras.layers.Normalization()
ordinalEncoding_layers = [tf.keras.layers.StringLookup(output_mode= "one_hot") for cat in categoricals] # no le damos valor, pero aquí el argumento vocabulary permite pasar una lista,
                                                 # una tupla, un array, un tensor con los strings ordenados de forma que el primero recibira la clase 0, etc

#ordinalEncoding.adapt(train_set[categoricals]) # Al igual que la capa de normalización hay que hacerle su fit en este caso el método adapt
concat_layer = tf.keras.layers.Concatenate(axis = -1)
hidden_layer1 = tf.keras.layers.Dense(100, activation="relu")
output_layer = tf.keras.layers.Dense(10, activation = "softmax")


normalization_layer.adapt(train_set[numericas].to_numpy())
_ = [ordinalEncoding_layers[indice].adapt(train_set[col]) for indice,col in enumerate(categoricals)]

earlyS = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights= True)
input_num = tf.keras.layers.Input(shape=(train_set[numericas].shape[1],))
input_cats = [tf.keras.layers.Input(shape=(train_set[[col]].shape[1],), dtype = tf.string) for col in categoricals] # Importante decirle que entra una string
normalized = normalization_layer(input_num)
encoded = [ordinalEncoding_layers[i](input_cats[i]) for i in range(len(ordinalEncoding_layers))]
concat = concat_layer([normalized] + encoded)
hidden1 = hidden_layer1(concat)
output = output_layer(hidden1)

model = tf.keras.Model(inputs=[input_num] + input_cats, outputs=[output])

In [41]:
salida = tf.keras.Model(inputs = input_cats, outputs = encoded)
salida([X_train[col] for col in categoricals])

[<tf.Tensor: shape=(2485, 6), dtype=int64, numpy=
 array([[0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        ...,
        [0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0]])>,
 <tf.Tensor: shape=(2485, 6), dtype=int64, numpy=
 array([[0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0],
        ...,
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1]])>]

In [42]:
X_train[[categoricals[1]]]

Unnamed: 0,sideEffects
2686,No Side Effects
1192,Mild Side Effects
2738,Severe Side Effects
2780,Moderate Side Effects
1735,No Side Effects
...,...
3092,Mild Side Effects
1095,Severe Side Effects
1130,Mild Side Effects
1294,Moderate Side Effects


In [43]:
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-1),
              metrics =["acc"])


In [44]:
entradas = [X_train[numericas]] + [X_train[col] for col in categoricals]

In [45]:
history = model.fit(entradas, y_num, validation_split=0.2, epochs=100, callbacks= earlyS)

Epoch 1/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - acc: 0.2780 - loss: 1.9971 - val_acc: 0.3662 - val_loss: 1.6751
Epoch 2/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4478 - loss: 1.5460 - val_acc: 0.3964 - val_loss: 1.5310
Epoch 3/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4794 - loss: 1.3976 - val_acc: 0.4225 - val_loss: 1.4637
Epoch 4/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4927 - loss: 1.3219 - val_acc: 0.4286 - val_loss: 1.4248
Epoch 5/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.4972 - loss: 1.2755 - val_acc: 0.4427 - val_loss: 1.3955
Epoch 6/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.5081 - loss: 1.2413 - val_acc: 0.4427 - val_loss: 1.3748
Epoch 7/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.

In [46]:
model.evaluate(entrada_test,test_set[target])

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - acc: 0.9366 - loss: 0.2385 


[0.2522585093975067, 0.9340835809707642]

Un poco mejor que con el ordinal, pero sin pasarse y además hay que ver el impacto del desbalanceo

In [47]:
y_pred  = [np.argmax(pred) for pred in model.predict(entrada_test)]

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step


In [48]:
print(classification_report(test_set[target], y_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98        62
           1       0.75      0.38      0.50        16
           2       0.65      0.87      0.74        23
           3       0.82      0.50      0.62        18
           4       0.58      0.93      0.71        27
           5       0.86      0.66      0.75        38
           6       0.98      0.96      0.97        83
           7       0.99      1.00      0.99        94
           8       1.00      1.00      1.00       110
           9       1.00      1.00      1.00       151

    accuracy                           0.93       622
   macro avg       0.86      0.83      0.83       622
weighted avg       0.94      0.93      0.93       622



Se nos va de nuevo a la clase mayoritaria. En la siguiente sesión añadiremos la vectorización de las features de texto y te enseñare a configurar el class_weight para aliviar (no siempre mucho) el efecto del desbalanceo

***

Primero vamos a mostrar rápidamente como funciona la capa de vectorizacion de texto de keras y luego aplicaremos el proceso completo (recuerda que hay que limpiar el texto y tokenizarlo antes de hacer su vectorización). Vamos con la(s) capa(s) de vectorización

## Capas de Vectorizacion

El equivalente al CountVectorizer y al TfidfVectorizer de sklearn es la capa TextVectorization

In [49]:
text_vec_layer_count = tf.keras.layers.TextVectorization(output_mode = "count") # output_mode = "count" -> Countvectorizer
text_vec_layer_count.adapt(train_set[["sideEffectsReview"]])

In [50]:
text_vec_layer_count.get_vocabulary()[:10]

['[UNK]', 'i', 'the', 'and', 'to', 'a', 'of', 'my', 'it', 'was']

In [51]:
texto = train_set["sideEffectsReview"][2:3].values
print(texto)

['redness, dryness, breakdown of skin']


In [52]:
text_vec_layer_tfidf = tf.keras.layers.TextVectorization(output_mode= "tf_idf") # output_mode = "tf_idf" -> TfIdfVectorizer, existe un tercer modo (el que viene por defecto que veremos un poco más adelante)
text_vec_layer_tfidf.adapt(train_set["sideEffectsReview"])

In [53]:
vectors = text_vec_layer_count(train_set["sideEffectsReview"])

In [54]:
vectors

<tf.Tensor: shape=(2485, 6564), dtype=int64, numpy=
array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 8, 3, ..., 0, 0, 0],
       [0, 1, 2, ..., 0, 0, 0]])>

In [55]:
df_vectors = pd.DataFrame(vectors.numpy(),\
                          columns= text_vec_layer_count.get_vocabulary())

In [56]:
df_vectors

Unnamed: 0,[UNK],i,the,and,to,a,of,my,it,was,...,10142008,1014,1012,100mgthe,100mgs,100mgdoses,100110,1000mg,10000,072009
0,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,2,3,2,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,4,7,3,3,0,2,0,2,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2480,0,3,5,1,0,0,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0
2481,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2482,0,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2483,0,8,3,1,7,3,3,2,1,3,...,0,0,0,0,0,0,0,0,0,0


In [57]:
train_set.iloc[0]["sideEffectsReview"]

'I had no real side effects unless you consider lack of fear or worry one.'

In [58]:
df_vectors.T[df_vectors.T[0] != 0][0]


Unnamed: 0,0
i,1
of,1
side,1
effects,1
had,1
no,1
or,1
you,1
one,1
lack,1


#### Vectorizacion

Tendremos que hacer la normalización/limpieza del texto y "tokenizacion"

In [60]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [61]:
# Rescatando la que hicimos para la IMDB Reviews

from nltk.corpus import stopwords
replace_no_space = "(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)"
REPLACE_NO_SPACE = re.compile(replace_no_space)
replace_with_space = "(<br \s*/><br\s*/>)|(\-)|(\/)"
REPLACE_WITH_SPACE = re.compile(replace_with_space)
NO_SPACE = ""
SPACE = " "
dictionary = stopwords.words("english")


def clean(row):
    # Limpio signos y convierto a minúsculas
    dato = REPLACE_NO_SPACE.sub(NO_SPACE, row.lower())
    # Convierto los retornos de carro <br /><br /> en espacios y los guiones ("-")
    dato = REPLACE_WITH_SPACE.sub(SPACE, dato)
    # Quito cualquier link
    dato = " ".join([word for word in dato.split() if "http" not in word])
    # Quito los stopwords
    dato = " ".join([word for word in dato.split(" ") if word not in dictionary])
    return dato




Antes, identificamos las features que son texto natural:


In [62]:
text_feats = [col for col in train_set.columns if "Review" in col and "wc" not in col]
text_feats

['benefitsReview', 'sideEffectsReview', 'commentsReview']


Y ahora aplicamos

In [63]:
for col in text_feats:
    train_set[col] = train_set[col].apply(clean)
    test_set[col] = test_set[col].apply(clean)

In [64]:
train_set

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production,benefitsReview_wc,sideEffectsReview_wc,commentsReview_wc
2686,zoloft,9,Highly Effective,No Side Effects,anxiety,becoming normal functioning individual helped ...,real side effects unless consider lack fear wo...,taken daily,135166,335.0,62,15,3
1192,adipex-p,9,Highly Effective,Mild Side Effects,to lose 30 pounds,bmi rating time get taken care goes beyond tak...,jittery sometimes euphoric tired cant sleep zone,last results months adipex p lost pounds gaine...,739564,939.0,163,10,248
2738,tri-luma,7,Moderately Effective,Severe Side Effects,hyperpigmentation,lightening acne scars,redness dryness breakdown skin,topical compound used reduce hyperpigmentation...,683247,843.0,4,5,42
2780,yasmin,2,Highly Effective,Moderate Side Effects,birth control,chance pregancy lost percent libido skin clear...,libido ever breasts hurt like got bigger itche...,given prevent pregnancy going i'd rather child...,222295,282.0,22,50,27
1735,requip,7,Considerably Effective,No Side Effects,restless leg syndrome,took medication treat restless leg syndrome ef...,need take time went experienced augmentation r...,took one milligram night treat symtoms restles...,344748,505.0,93,69,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3092,zyrtec,9,Highly Effective,Mild Side Effects,atopic eczema,drug effective counteracting severe itching ca...,felt slightly tired taking zyrtec however seve...,non prescription drug took experiencing severe...,156028,356.0,74,54,51
1095,neurontin,0,Ineffective,Severe Side Effects,all over and various type pain,value,nausea extreme sleepinesssevere headache jitte...,day one felt negative side effects taking seco...,317886,337.0,2,10,120
1130,wellbutrin,6,Moderately Effective,Mild Side Effects,depression,experienced less depression increased libido p...,dry mouth eyes drug decreased effectiveness la...,prescribed mg per day taken much mg per day li...,556187,696.0,15,15,24
1294,wellbutrin,2,Considerably Effective,Moderate Side Effects,stop smoking,treatment effective reducing appetite smoking ...,began notice nothing seemed important also see...,taking bupropion aide break smoking habit also...,728338,789.0,44,105,92


In [65]:
print("\n\n".join(train_set["sideEffectsReview"][0:2].values))

real side effects unless consider lack fear worry one

jittery sometimes euphoric tired cant sleep zone


Es hora de aplicar la "capa" de vectorizacion, pero keras sólo admite una entrada o feature por capa así que hay que crear tres

In [66]:
vectorizer_layers = []
for indice,col in enumerate(text_feats):
    vectorizer_layers.append(tf.keras.layers.TextVectorization(output_mode="count"))
    vectorizer_layers[indice].adapt(train_set[col])

Y ahora creamos el modelo con la API funcional para intregarlo todo

In [67]:
tf.random.set_seed(43)
normalization_layer = tf.keras.layers.Normalization()
ordinalEncoding_layers = [tf.keras.layers.StringLookup(output_mode= "one_hot") for cat in categoricals] # no le damos valor, pero aquí el argumento vocabulary permite pasar una lista,
                                                 # una tupla, un array, un tensor con los strings ordenados de forma que el primero recibira la clase 0, etc

#ordinalEncoding.adapt(train_set[categoricals]) # Al igual que la capa de normalización hay que hacerle su fit en este caso el método adapt
concat_layer = tf.keras.layers.Concatenate(axis = -1)
hidden_layer1 = tf.keras.layers.Dense(100, activation="relu")
output_layer = tf.keras.layers.Dense(10, activation = "softmax")


normalization_layer.adapt(train_set[numericas].to_numpy())
_ = [ordinalEncoding_layers[indice].adapt(train_set[col]) for indice,col in enumerate(categoricals)]

earlyS = tf.keras.callbacks.EarlyStopping(patience=15, restore_best_weights= True)
input_num = tf.keras.layers.Input(shape=(train_set[numericas].shape[1],))
input_cats = [tf.keras.layers.Input(shape=(train_set[[col]].shape[1],), dtype = tf.string) for col in categoricals] # Importante decirle que entra una string
# Añadimos las entradas para las capas de vectorizacion
input_vects = [tf.keras.layers.Input(shape = train_set[col].shape[1:], dtype = tf.string) for col in text_feats]
normalized = normalization_layer(input_num)
encoded = [ordinalEncoding_layers[i](input_cats[i]) for i in range(len(ordinalEncoding_layers))]
# Ahora vectorizamos
vectorized = [vectorizer_layers[i](input_vects[i]) for i in range(len(text_feats))]
concat = concat_layer([normalized] + encoded + vectorized)
hidden1 = hidden_layer1(concat)
output = output_layer(hidden1)

model = tf.keras.Model(inputs=[input_num] + input_cats + input_vects, outputs=[output])

In [68]:
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.SGD(learning_rate= 2e-2, momentum = 0.9),
              metrics =["acc"])
entradas = [train_set[numericas]] + [train_set[col] for col in categoricals] \
                + [train_set[col] for col in text_feats]
history = model.fit(entradas, y_num, validation_split = 0.2, epochs=150, callbacks= earlyS)

Epoch 1/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 83ms/step - acc: 0.2532 - loss: 2.1040 - val_acc: 0.3924 - val_loss: 1.7146
Epoch 2/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 71ms/step - acc: 0.5055 - loss: 1.3673 - val_acc: 0.4125 - val_loss: 1.5790
Epoch 3/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 78ms/step - acc: 0.7225 - loss: 0.8827 - val_acc: 0.4245 - val_loss: 1.6223
Epoch 4/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 95ms/step - acc: 0.8919 - loss: 0.5332 - val_acc: 0.4205 - val_loss: 1.7238
Epoch 5/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 68ms/step - acc: 0.9582 - loss: 0.3085 - val_acc: 0.4185 - val_loss: 1.7929
Epoch 6/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 84ms/step - acc: 0.9797 - loss: 0.1886 - val_acc: 0.4185 - val_loss: 1.8105
Epoch 7/150
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 75ms/step - 

In [69]:
entradas_test = [test_set[numericas]] + [test_set[col] for col in test_set[categoricals]]\
                                         + [test_set[col] for col in text_feats]
model.evaluate(entradas_test, test_set[target])

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 67ms/step - acc: 0.4152 - loss: 1.5102


[1.5470683574676514, 0.41961413621902466]

In [70]:
y_pred = [np.argmax(pred) for pred in model.predict(entradas_test)]

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 73ms/step


In [71]:
print(classification_report(test_set[target], y_pred))

              precision    recall  f1-score   support

           0       0.59      0.89      0.71        62
           1       0.00      0.00      0.00        16
           2       0.60      0.13      0.21        23
           3       0.50      0.06      0.10        18
           4       0.00      0.00      0.00        27
           5       1.00      0.05      0.10        38
           6       0.22      0.10      0.13        83
           7       0.25      0.64      0.36        94
           8       0.27      0.04      0.06       110
           9       0.57      0.85      0.68       151

    accuracy                           0.42       622
   macro avg       0.40      0.27      0.24       622
weighted avg       0.41      0.42      0.34       622



Desbalanceado y el recall medio cae. Además son demasiadas features. En la próxima sesión, de caracter teórico veremos una forma más potente de vectorizar texto y para terminar esta veamos como "compensar"

In [72]:
y_num.value_counts(True)

Unnamed: 0_level_0,proportion
rating,Unnamed: 1_level_1
9,0.237827
7,0.18672
8,0.148893
6,0.107445
0,0.097787
4,0.053119
2,0.049497
5,0.047887
3,0.035815
1,0.03501


In [73]:
from sklearn.utils.class_weight import compute_class_weight

clases = np.array(sorted(y_num.unique()))
clases

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [74]:
dict_weight = {clases[i]:peso for i,peso in enumerate(compute_class_weight("balanced",classes= clases, y = y_num))}
dict_weight

{0: 1.022633744855967,
 1: 2.8563218390804597,
 2: 2.0203252032520327,
 3: 2.792134831460674,
 4: 1.8825757575757576,
 5: 2.088235294117647,
 6: 0.9307116104868914,
 7: 0.5355603448275862,
 8: 0.6716216216216216,
 9: 0.42047377326565144}

In [75]:
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.SGD(learning_rate= 1e-1, momentum = 0.9, nesterov = True),
              metrics =["acc"])

history = model.fit(entradas, y_num.to_numpy(), validation_split = 0.2, epochs=10, callbacks= earlyS, class_weight= dict_weight)

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 84ms/step - acc: 0.6210 - loss: 1.3108 - val_acc: 0.3823 - val_loss: 1.8190
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 101ms/step - acc: 0.8468 - loss: 0.4410 - val_acc: 0.4024 - val_loss: 2.4332
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 85ms/step - acc: 0.9686 - loss: 0.0864 - val_acc: 0.4105 - val_loss: 2.8015
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 80ms/step - acc: 0.9925 - loss: 0.0247 - val_acc: 0.3944 - val_loss: 2.8384
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 97ms/step - acc: 0.9982 - loss: 0.0112 - val_acc: 0.4024 - val_loss: 2.8666
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 73ms/step - acc: 1.0000 - loss: 0.0052 - val_acc: 0.4105 - val_loss: 2.9381
Epoch 7/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 74ms/step - acc: 

In [76]:

model.evaluate(entradas_test, test_set[target].to_numpy())

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 38ms/step - acc: 0.4148 - loss: 1.6926


[1.7677065134048462, 0.3858520984649658]

In [77]:
y_pred = [np.argmax(pred) for pred in model.predict(entradas_test)]

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 64ms/step


In [78]:
print(classification_report(test_set[target], y_pred))

              precision    recall  f1-score   support

           0       0.67      0.84      0.74        62
           1       0.09      0.06      0.07        16
           2       0.19      0.13      0.15        23
           3       0.13      0.11      0.12        18
           4       0.09      0.37      0.15        27
           5       0.35      0.29      0.32        38
           6       0.16      0.05      0.07        83
           7       0.28      0.44      0.34        94
           8       0.33      0.05      0.09       110
           9       0.64      0.73      0.68       151

    accuracy                           0.39       622
   macro avg       0.29      0.31      0.27       622
weighted avg       0.38      0.39      0.35       622

