<center>
<h1 style="font-family:verdana">
 💻 🧑 Classificació d'intencions 🧑 💻


<p> 🎯 <b>Objectiu</b>: en aquesta pràctica aprendrem a detectar la intenció de l'usuari a partir d'interaccions reals amb un xatbot. En el context de xatbots, la classificació d'intencions ajuda a entendre quina acció o resposta hauria de prendre el sistema en funció de la consulta de l'usuari.  


<p> ✨ <b>Contingut</b>: en primer lloc, farem servir una base de dades amb oracions d'interaccions en espanyol etiquetades com 19 intencions diferents. En segon lloc, realitzarem el preprocessament de les dades, és a dir, transformarem les dades perquè tinguen un format adequat per a ser introduïdes al model. I finalment, dissenyarem i entrenarem el model de classificació per detectar automàticament la intenció de les oracions.</p>  


<p> ✏ <b>Exercicis</b>: en cada secció anireu trobant exercicis que haureu d'anar resolent. </p>



---

<h2> Índex </h2>

1. [Inspecció del conjunt de dades](#section-one)
  * [Exercici 1](#ex-one)
2. [Preprocessament de dades](#section-two)
  * [Exercici 2](#ex-two)
  * [Exercici 3](#ex-three)
3. [Disseny del model i entrenament](#section-three)
  * [Exercici 4](#ex-four)
  * [Exercici 5](#ex-five)
4. [Lliurable](#section-four)
---

In [2]:
# %pip install tensorflow keras
# %pip install numpy pandas matplotlib scikit-learn seaborn
# %pip install nltk

In [3]:
import random
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, GlobalMaxPooling1D, Dropout, Conv1D, GlobalAveragePooling1D, LayerNormalization #Remove
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [4]:
# %pip install gdown
# %gdown "https://drive.google.com/uc?id=1u2wzXvsuscLeFHwXcDwMDaNDy0u_99-t"
# %tar -zxf nlu_ATIS_data.tar.gz

<h1><a name="section-one"> 1. Inspecció del conjunt de dades </a></h1>

A la carpeta `data` tenim els diferents fitxers CSV que utilitzarem per a aquesta pràctica.

En primer lloc, llegirem les dades dels fitxers CSV amb `pandas`.

In [5]:
train_data = pd.read_csv('./data/train.csv', header=None)
val_data = train_data.tail(900)
train_data = pd.read_csv('./data/train.csv', header=None, nrows=4078)
test_data = pd.read_csv('./data/test.csv', header=None)

print('Training size:', len(train_data))
print('Validation dataset size:', len(val_data))
print('Test dataset size:', len(test_data))

Training size: 4078
Validation dataset size: 900
Test dataset size: 893


Per a aquesta primera part de la pràctica ens centrarem en la primera columna dels arxius que correspon amb les **oracions** en anglès introduïdes per l'usuari. I en la tercera columna que correspon amb la **intenció** de cada oració, és a dir, cada oració tindrà una etiqueta.

Podeu executar la cel·la següent tantes vegades com vulgueu per veure instàncies d'aquest conjunt de dades.


In [6]:
random_number = random.randint(0, len(train_data)-1)
print('Random number:', random_number)

train_sentences = list(train_data[0])
train_labels = list(s.replace('"', '') for s in train_data[2])
train_labels = list(s.replace(' ', '') for s in train_labels)

print('Sentence: ', train_sentences[random_number])
print('Intent: ', train_labels[random_number])

Random number: 2269
Sentence:  flights from philadelphia to oakland
Intent:  flight


A continuació analitzarem quantes etiquetes diferents hi ha al dataset i quines són.

In [7]:
num_labels = 0
for label in set(train_labels):
  print(f'Label {num_labels}:', label.split('.')[-1])
  num_labels += 1

print(f'\nThere are a total of {num_labels} intent labels')

Label 0: abbreviation
Label 1: flight_no
Label 2: meal
Label 3: ground_service+ground_fare
Label 4: flight_time
Label 5: ground_fare
Label 6: airline
Label 7: flight+airfare
Label 8: restriction
Label 9: distance
Label 10: ground_service
Label 11: airfare+flight_time
Label 12: capacity
Label 13: cheapest
Label 14: airline+flight_no
Label 15: quantity
Label 16: flight
Label 17: airport
Label 18: airfare
Label 19: city
Label 20: aircraft
Label 21: aircraft+flight+flight_no

There are a total of 22 intent labels


<h1><a name="section-two"> 2. Preprocessament de dades </a></h1>

En primer lloc, haurem de tokenitzar les oracions. Això consisteix a convertir el text en representacions numèriques, ja que els models esperen unitats discretes.

En aquesta pràctica farem servir una tokenització senzilla, simplement dividirem les oracions en paraules i crearem un vocabulari basat en les paraules úniques de les dades d'entrenament. Cada paraula (token) tindrà assignat un ID únic.

Vegem com queda el vocabulari.

In [8]:
num_words=500
tokenizer = Tokenizer(num_words)
tokenizer.fit_on_texts(train_sentences)

vocab = tokenizer.word_index
print(vocab)

{'to': 1, 'from': 2, 'flights': 3, 'the': 4, 'on': 5, 'what': 6, 'me': 7, 'flight': 8, 'boston': 9, 'show': 10, 'san': 11, 'i': 12, 'denver': 13, 'a': 14, 'francisco': 15, 'in': 16, 'and': 17, 'atlanta': 18, 'pittsburgh': 19, 'is': 20, 'dallas': 21, 'baltimore': 22, 'all': 23, 'philadelphia': 24, 'like': 25, 'are': 26, 'list': 27, 'airlines': 28, 'of': 29, 'between': 30, 'that': 31, 'washington': 32, 'leaving': 33, 'please': 34, 'pm': 35, 'morning': 36, 'would': 37, 'fly': 38, 'for': 39, 'fare': 40, 'first': 41, 'wednesday': 42, 'after': 43, 'there': 44, 'oakland': 45, "'d": 46, 'ground': 47, 'you': 48, 'does': 49, 'trip': 50, 'transportation': 51, 'class': 52, 'arriving': 53, 'cheapest': 54, 'need': 55, 'city': 56, 'round': 57, 'with': 58, 'before': 59, 'which': 60, 'available': 61, 'have': 62, 'give': 63, 'at': 64, 'fares': 65, 'american': 66, 'afternoon': 67, 'one': 68, 'want': 69, 'how': 70, 'way': 71, 'new': 72, 'dc': 73, 'nonstop': 74, 'arrive': 75, 'earliest': 76, 'york': 77, 'g

---

 <h1><a name="ex-one"><center> ✏ Exercici 1 ✏</a></h1>

En aquest primer exercici us demanem que donat el vocabulari anterior convertiu la llista d'oracions de la partició d'entrenament, és a dir, `train_sentenes` en seqüències d'IDs.

Podeu trobar la documentació [aquí](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

In [9]:

train_sequences = tokenizer.texts_to_sequences(train_sentences)
print(train_sentences[4066])
print(train_sequences[4066])

# Aparecen 10 indices en la secuencia, porque la palabra actually no estaba entre las 500 palabras más frecuentes
# Por lo tanto, no tiene índice y no aparece en la secuencia.

i actually want to go from ontario to westchester via chicago
[12, 69, 1, 78, 2, 258, 1, 278, 279, 100]


Si ho heu fet correctament hauríeu d'obtenir això:

```
print(train_sentences[0])
print(train_sequences[0])

i want to fly from boston at 838 am and arrive in denver at 1110 in the morning
[12, 69, 1, 38, 2, 9, 64, 415, 84, 17, 75, 16, 13, 64, 493, 16, 4, 36]
```



---
A continuació haurem d'aconseguir que totes les seqüències tinguen una longitud fixa. Per a fer això primer fixarem la longitud segons la longitud màxima trobada a les seqüències del conjunt d'entrenament. I a continuació omplirem (*pad*) les seqüències que tinguen una longitud menor.


In [10]:
max_sequence_length = max(map(len, train_sequences))
train_pad_sequences = pad_sequences(train_sequences, maxlen=max_sequence_length)
print('Padded sequence: ', train_pad_sequences[0])

Padded sequence:  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  69   1  38   2   9  64 415  84
  17  75  16  13  64 493  16   4  36]


---

 <h1><a name="ex-two"><center> ✏ Exercici 2 ✏</a></h1>

Com l'ordre de les paraules sí que importa als models que utilitzarem en aquesta pràctica, és aconsellable que el *padding* estiga al final i no al principi. Busqueu [aquí](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences) com fer perquè el codi anterior afegisca els zeros al final i no al principi de la seqüència.

In [11]:

train_pad_sequences = pad_sequences(train_sequences, maxlen=max_sequence_length, padding='post')
print('Padded sequence: ', train_pad_sequences[0])

Padded sequence:  [ 12  69   1  38   2   9  64 415  84  17  75  16  13  64 493  16   4  36
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0]


Si ho heu fet correctament hauríeu d'obtenir això:

```
print('Padded sequence: ', train_pad_sequences[0])

Padded sequence:  [ 12  69   1  38   2   9  64 415  84  17  75  16  13  64 493  16   4  36
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0]
```


---

A continuació convertirem les classes d'intencions categòriques (*capacity*, *ground_service*, *flight*, etc.) en el que anomenem one-hot vector encoding. Aquesta tècnica s'utilitza per representar les dades categòriques com a vectors binaris. On cada vector representa una classe específica i l'element corresponent a la classe es posa a 1 i la resta d'elements es mantenen a 0.

Imaginem que tenim tres classes: *capacity*, *ground_service*, *flight*. Podríem codificar aquestes classes amb un vector únic de la forma següent:


```
   capacity -> [1, 0, 0]
   ground_service -> [0, 1, 0]
   flight -> [0, 0, 1]
```

Per aconseguir això primer codificarem les classes d'intenció en etiquetes numèriques.

In [12]:
label_encoder = LabelEncoder()
train_numerical_labels = label_encoder.fit_transform(train_labels)

print(f'Original labels: {train_labels}\n')
print(f'Encoded labels: {train_numerical_labels} \n')

Original labels: ['flight', 'flight', 'flight_time', 'airfare', 'airfare', 'flight', 'aircraft', 'flight', 'flight', 'ground_service', 'flight', 'flight', 'airport', 'flight', 'flight', 'airfare', 'ground_service', 'flight', 'flight', 'flight', 'flight', 'flight', 'flight', 'aircraft', 'airfare', 'flight', 'airline', 'flight', 'ground_service', 'flight', 'airfare', 'flight', 'flight', 'flight', 'flight', 'airfare', 'airline', 'flight', 'flight', 'flight', 'distance', 'flight', 'airline', 'airline', 'flight', 'airline', 'ground_service', 'abbreviation', 'flight', 'flight', 'flight_time', 'flight', 'flight', 'ground_fare', 'flight', 'abbreviation', 'flight', 'flight', 'flight', 'flight', 'flight', 'airline', 'flight', 'ground_service', 'airline', 'flight', 'flight', 'airport', 'flight', 'flight', 'abbreviation', 'flight', 'flight', 'flight', 'flight', 'aircraft', 'airfare', 'flight', 'flight', 'flight', 'flight', 'flight', 'flight', 'flight', 'airline', 'flight', 'flight', 'flight', 'fli

I a continuació convertim les etiquetes a vectors one-hot.

In [13]:
num_classes = len(np.unique(train_numerical_labels))
train_encoded_labels = to_categorical(train_numerical_labels, num_classes)

print('Example: \n')
print(f'Original label: {train_labels[0]}\n')
print(f'Numerical label: {train_numerical_labels[0]}\n')
print(f'One-hot: {train_encoded_labels[0]}\n')

Example: 

Original label: flight

Numerical label: 12

One-hot: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]



---

 <h1><a name="ex-three"><center> ✏ Exercici 3 ✏</a></h1>

Amb la partició de validació i test haurem de realitzar els mateixos passos. Per tant, en aquest exercici us demanem que obtingueu `val_pad_sequences`, `val_encoded_labels`, `test_pad_sequences` i `test_encoded_labels`.

In [14]:
val_sentences = list(val_data[0])
val_sequences = tokenizer.texts_to_sequences(val_sentences)
val_pad_sequences = pad_sequences(val_sequences, maxlen=max_sequence_length, padding='post')

test_sentences = list(test_data[0])
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_pad_sequences = pad_sequences(test_sequences, maxlen=max_sequence_length, padding='post')

In [15]:
# Cleaning labels

val_labels = list(s.replace('"', '') for s in val_data[2])
test_labels = list(s.replace('"', '') for s in test_data[2])

val_labels = list(s.replace(' ', '') for s in val_labels)
test_labels = list(s.replace(' ', '') for s in test_labels)

def remove_values_and_indices(input_list, values_to_remove, other_list):
    indices_to_remove = [idx for idx, item in enumerate(input_list) if item in values_to_remove]
    cleaned_list = [item for item in input_list if item not in values_to_remove]
    cleaned_other_list = [item for idx, item in enumerate(other_list) if idx not in indices_to_remove]
    return cleaned_list, np.array(cleaned_other_list)

values_to_remove = ['day_name','airfare+flight','flight+airline','flight_no+airline']
val_labels, val_pad_sequences = remove_values_and_indices(val_labels, values_to_remove, val_pad_sequences)
test_labels, test_pad_sequences = remove_values_and_indices(test_labels, values_to_remove, test_pad_sequences)


In [16]:
# Transforming cleaned labels to numerical and one-hot encoding

val_numerical_labels = label_encoder.transform(val_labels)
val_encoded_labels = to_categorical(val_numerical_labels, num_classes)

test_numerical_labels = label_encoder.transform(test_labels)
test_encoded_labels = to_categorical(test_numerical_labels, num_classes)

print(f'Original label: {val_labels[0]}\n')
print(f'Numerical label: {val_numerical_labels[0]}\n')
print(f'One-hot: {val_encoded_labels[0]}\n')


Original label: flight

Numerical label: 12

One-hot: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]



---

<h1><a name="section-three"> 3. Disseny del model i entrenament </a></h1>

En primer lloc, anem a comprovar si hi ha GPUs disponibles. A continuació si hi ha GPUs disponibles el codi assegurarà que *TensorFlow* només assigne memòria GPU quan siga necessari.

In [17]:
if tf.config.list_physical_devices('GPU'):
    print("GPU is available!")
else:
    print("GPU is not available. The model will be trained on CPU.")

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

GPU is not available. The model will be trained on CPU.


---

 <h1><a name="ex-four"><center> ✏ Exercici 4 ✏</a></h1>

En aquest exercici haureu de dissenyar l'arquitectura del model. El nostre model tindrà quatre capes:

1. La primera capa serà un **embedding**. Aquesta capa permetrà convertir les dades de text d'entrada, en vectors densos amb una mida fixa (*embedding_dim*). Aquesta representació més compacta permetrà per una part capturar la informació semàntica del text d'entrada, permetent així generalitzar millor i comprendre les relacions entre les paraules. I, per una altra banda, reduir la complexitat computacional, accelerant així el temps d'entrenament i inferència. En resum, aquesta capa assignarà a cada índex de cada paraula un vector dens de mida *embedding_dim*.

2. La segona capa serà un **pooling** layer. L'entrada d'aquesta capa serà un tensor 3D (*batch_size*, *sequence_length*, *embedding_dim*). Aquesta capa es centrarà a capturar la informació més important de la seqüència d'entrada, és a dir, prendrà el valor màxim de la seqüència, donant lloc a un tensor 2D (batch_size, embedding_dim).

3. La tercera capa serà una capa **densa**. És a dir, una capa completament connectada (*fully-connected*): cada neurona d'aquesta capa estarà connectada a totes les neurones de la capa anterior. La funció d'activació que utilitzarem serà una ReLU. Aquesta funció introdueix una no-linealitat al model permetent així aprendre relacions complexes en les dades.

4. L'última capa també serà una capa **densa**. En aquest cas la funció d'activació haurà de ser la funció Softmax. Aquesta funció es fa servir per a convertir els valors de la capa anterior (*logits*) en probabilitats normalitzades. El valor de cada element de sortida representarà la probabilitat que l'entrada pertanya a una classe específica.


📢  Les capes que haureu de fer servir les podreu trobar [aquí](https://www.tensorflow.org/api_docs/python/tf/keras/layers).





In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, MaxPooling1D, Dense, Flatten

#TODO
embedding_dim = 75

vocab_size = num_words + 1

model = Sequential()

# Capa d'Embedding
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
# Capa de MaxPooling
model.add(MaxPooling1D(pool_size=2))
# Flatten layer to convert 3D tensor to 2D
model.add(Flatten())
# Capa Densa amb activació ReLU
model.add(Dense(128, activation='relu'))
# Capa de sortida amb Softmax
model.add(Dense(num_classes, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


# Train the model
batch_size = 32
epochs = 2
model.fit(train_pad_sequences, train_encoded_labels, batch_size=batch_size, epochs=epochs, validation_data=(val_pad_sequences, val_encoded_labels))

# Evaluate the model on the test set
loss, accuracy = model.evaluate(test_pad_sequences, test_encoded_labels, batch_size=batch_size)
print(f"Test accuracy: {accuracy:.2f}")

model.summary()


Epoch 1/2




[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7570 - loss: 1.0214 - val_accuracy: 0.8256 - val_loss: 0.7053
Epoch 2/2
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7570 - loss: 1.0214 - val_accuracy: 0.8256 - val_loss: 0.7053
Epoch 2/2
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8683 - loss: 0.5204 - val_accuracy: 0.8856 - val_loss: 0.4736
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8683 - loss: 0.5204 - val_accuracy: 0.8856 - val_loss: 0.4736
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8277 - loss: 0.7078 
Test accuracy: 0.83
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8277 - loss: 0.7078 
Test accuracy: 0.83


Podeu veure a continuació les oracions que el model ha classificat incorrectament.

In [19]:
probs = model.predict(test_pad_sequences)
_predicted_labels = np.argmax(probs, axis=1)
predicted_labels = label_encoder.inverse_transform(_predicted_labels)
counter = 0

for i in range(0, len(predicted_labels)):
  if test_labels[i] != predicted_labels[i]:
    print('Sentence: ', test_sentences[i])
    print('Original label: ', test_labels[i])
    print('Predicted label: ', predicted_labels[i])
    print()
    counter += 1

print(f"Total misclassifications: {counter}")


[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
Sentence:  on april first i need a ticket from tacoma to san jose departing before 7 am
Original label:  airfare
Predicted label:  flight

Sentence:  show flight and prices kansas city to chicago on next wednesday arriving in chicago by 7 pm
Original label:  flight+airfare
Predicted label:  flight

Sentence:  i need a flight from tampa to milwaukee
Original label:  meal
Predicted label:  flight

Sentence:  i need a flight from milwaukee to seattle
Original label:  meal
Predicted label:  flight

Sentence:  please find a flight from orlando to kansas city
Original label:  airport
Predicted label:  flight

Sentence:  i would like to fly from columbus to phoenix through cincinnati in the afternoon
Original label:  flight
Predicted label:  airfare

Sentence:  what is the most expensive one way fare between detroit and westchester county
Original l

---

 <h1><a name="ex-five"><center> ✏ Exercici 5 ✏ </a></h1>

Modifiqueu els següents paràmetres del model anterior i analitzeu com afecten a la seva *accuracy*:

 1. **Preprocessament.** Modifiqueu el Tokenizer per canviar la mida del vostre vocabulari i afegiu nous passos de preprocessament. Alguns possibles canvis són canviar la mida del vocabulari, treure la capitalització o fer servir *lemmatització* o *stemming*.

 2. **Mida dels Embeddings.** Proveu diferents mides d'*Embeddings* i observeu com canvia l'*accuracy* del model. Heu d'explicar les vostres conclusions.

 3. **Xarxes Convolucionals.** Afegiu capes convolucionals al vostre model. Expliqueu amb detall els valors que heu provat i la vostra motivació a l'hora d'escollir-los. Recordeu, que també podeu provar diferents configuracions de *pooling*.

 4. **Xarxes Recurrents.**  Afegiu capes recurrents al vostre model (LSTM, GRU). Expliqueu amb detall els valors que heu provat i la vostra motivació.

 5. **Regularització.** Quan proveu configuracions amb més paràmetres veureu que el model comença a tenir *overfitting* molt prompte durant l'entrenament. Afegiu *Dropout* al vostre model. Heu d'explicar la vostra decisió de valors i de posició dins de la xarxa.

 6. **Balancejat de les classes.** Si analitzeu el dataset, veureu que la freqüència de les classes està molt desbalancejada. Keras us permet afegir un pes per a cada classe a l'hora de calcular la loss (Mireu el paràmetre "class_weigth" a la documentació https://keras.io/api/models/model_training_apis/). Calculeu un pes per a cada classe i afegiu-lo al mètode fit del vostre model.

 ---

## Diferents preprocessings

In [35]:
# Diferents preprocessings
from sklearn.metrics import f1_score

# Diferentes preprocessings
## Cambiar tamaño tokenizer
### 300 
num_words=300
tokenizer_300 = Tokenizer(num_words)
tokenizer_300.fit_on_texts(train_sentences)

train_sequences_300 = tokenizer_300.texts_to_sequences(train_sentences)
train_pad_sequences_300 = pad_sequences(train_sequences_300, maxlen=max_sequence_length, padding='post')

val_sequences_300 = tokenizer_300.texts_to_sequences(val_sentences)
val_pad_sequences_300 = pad_sequences(val_sequences_300, maxlen=max_sequence_length, padding='post')

# Rebuild model with vocab_size=301
model_300 = Sequential()
model_300.add(Embedding(input_dim=301, output_dim=75, input_length=max_sequence_length))
model_300.add(MaxPooling1D(pool_size=2))
model_300.add(Flatten())
model_300.add(Dense(128, activation='relu'))
model_300.add(Dense(num_classes, activation='softmax'))
model_300.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train using train and validate with val
model_300.fit(train_pad_sequences_300, train_encoded_labels, 
              batch_size=32, epochs=5, 
              validation_data=(val_pad_sequences_300, val_encoded_labels))

# Evaluate on validation set (NOT test yet)
val_loss, val_accuracy = model_300.evaluate(val_pad_sequences_300, val_encoded_labels, batch_size=32, verbose=0)

# Get predictions for F1-score
val_probs = model_300.predict(val_pad_sequences_300, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_true = np.argmax(val_encoded_labels, axis=1)
val_f1 = f1_score(val_true, val_pred, average='weighted')

print(f"Validation accuracy (vocab=300): {val_accuracy:.4f}")
print(f"Validation F1-score (vocab=300): {val_f1:.4f}")

Epoch 1/5




[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.7543 - loss: 1.0446 - val_accuracy: 0.8033 - val_loss: 0.7441
Epoch 2/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8649 - loss: 0.5379 - val_accuracy: 0.8789 - val_loss: 0.4944
Epoch 3/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9129 - loss: 0.3404 - val_accuracy: 0.9133 - val_loss: 0.3759
Epoch 4/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9392 - loss: 0.2333 - val_accuracy: 0.9233 - val_loss: 0.3113
Epoch 5/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9559 - loss: 0.1727 - val_accuracy: 0.9311 - val_loss: 0.2773
Validation accuracy (vocab=300): 0.9311
Validation F1-score (vocab=300): 0.9148


In [37]:
# 800 palabras
num_words=800
tokenizer_800 = Tokenizer(num_words)
tokenizer_800.fit_on_texts(train_sentences)

train_sequences_800 = tokenizer_800.texts_to_sequences(train_sentences)
train_pad_sequences_800 = pad_sequences(train_sequences_800, maxlen=max_sequence_length, padding='post')

val_sequences_800 = tokenizer_800.texts_to_sequences(val_sentences)
val_pad_sequences_800 = pad_sequences(val_sequences_800, maxlen=max_sequence_length, padding='post')

# Rebuild the model with the correct vocab_size
vocab_size = num_words + 1
model_800 = Sequential()
model_800.add(Embedding(input_dim=vocab_size, output_dim=75, input_length=max_sequence_length))
model_800.add(MaxPooling1D(pool_size=2))
model_800.add(Flatten())
model_800.add(Dense(128, activation='relu'))
model_800.add(Dense(num_classes, activation='softmax'))
model_800.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model_800.fit(train_pad_sequences_800, train_encoded_labels, 
              batch_size=32, epochs=5, 
              validation_data=(val_pad_sequences_800, val_encoded_labels))

# Evaluate on validation
val_loss, val_accuracy = model_800.evaluate(val_pad_sequences_800, val_encoded_labels, batch_size=32, verbose=0)
val_probs = model_800.predict(val_pad_sequences_800, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_true = np.argmax(val_encoded_labels, axis=1)
val_f1 = f1_score(val_true, val_pred, average='weighted')

print(f"Validation accuracy (vocab=800): {val_accuracy:.4f}")
print(f"Validation F1-score (vocab=800): {val_f1:.4f}")

Epoch 1/5




[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.7599 - loss: 1.0274 - val_accuracy: 0.7933 - val_loss: 0.7305
Epoch 2/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.8590 - loss: 0.5335 - val_accuracy: 0.8833 - val_loss: 0.4895
Epoch 3/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9166 - loss: 0.3362 - val_accuracy: 0.9189 - val_loss: 0.3686
Epoch 4/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9419 - loss: 0.2221 - val_accuracy: 0.9267 - val_loss: 0.3062
Epoch 5/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9578 - loss: 0.1509 - val_accuracy: 0.9333 - val_loss: 0.2779
Validation accuracy (vocab=800): 0.9333
Validation F1-score (vocab=800): 0.9178


In [43]:
# Preprocessing + training using Porter Stemming
import nltk
from nltk.stem import PorterStemmer
from sklearn.metrics import f1_score
nltk.download('punkt', quiet=True)

stemmer = PorterStemmer()

# Aplicar stemming y lowercase
train_sentences_stemmed = [' '.join([stemmer.stem(word.lower()) for word in sentence.split()]) for sentence in train_sentences]
val_sentences_stemmed = [' '.join([stemmer.stem(word.lower()) for word in sentence.split()]) for sentence in val_sentences]

# Tokenizar con stemming
num_words_stem = 500
tokenizer_stem = Tokenizer(num_words=num_words_stem)
tokenizer_stem.fit_on_texts(train_sentences_stemmed)

train_seq_stem = tokenizer_stem.texts_to_sequences(train_sentences_stemmed)
val_seq_stem = tokenizer_stem.texts_to_sequences(val_sentences_stemmed)

max_seq_stem = max(map(len, train_seq_stem))
train_pad_stem = pad_sequences(train_seq_stem, maxlen=max_seq_stem, padding='post')
val_pad_stem = pad_sequences(val_seq_stem, maxlen=max_seq_stem, padding='post')

# Construir modelo
vocab_size_stem = num_words_stem + 1
model_stem = Sequential()
model_stem.add(Embedding(input_dim=vocab_size_stem, output_dim=75, input_length=max_seq_stem))
model_stem.add(MaxPooling1D(pool_size=2))
model_stem.add(Flatten())
model_stem.add(Dense(128, activation='relu'))
model_stem.add(Dense(num_classes, activation='softmax'))
model_stem.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history_stem = model_stem.fit(train_pad_stem, train_encoded_labels, 
                              batch_size=32, epochs=5, 
                              validation_data=(val_pad_stem, val_encoded_labels))

# Evaluate on validation
val_loss, val_accuracy = model_stem.evaluate(val_pad_stem, val_encoded_labels, batch_size=32, verbose=0)
val_probs = model_stem.predict(val_pad_stem, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_true = np.argmax(val_encoded_labels, axis=1)
val_f1 = f1_score(val_true, val_pred, average='weighted')

print(f"[Stemming] Validation accuracy: {val_accuracy:.4f}")
print(f"[Stemming] Validation F1-score: {val_f1:.4f}")

Epoch 1/5




[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7690 - loss: 0.9536 - val_accuracy: 0.8400 - val_loss: 0.6735
Epoch 2/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8830 - loss: 0.4804 - val_accuracy: 0.9011 - val_loss: 0.4505
Epoch 3/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9282 - loss: 0.2946 - val_accuracy: 0.9267 - val_loss: 0.3302
Epoch 4/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9495 - loss: 0.1912 - val_accuracy: 0.9311 - val_loss: 0.2906
Epoch 5/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9674 - loss: 0.1270 - val_accuracy: 0.9400 - val_loss: 0.2608
[Stemming] Validation accuracy: 0.9400
[Stemming] Validation F1-score: 0.9280


In [45]:
# Preprocessing + training using WordNet Lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import f1_score
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

lemmatizer = WordNetLemmatizer()

# Aplicar lemmatization y lowercase
train_sentences_lem = [' '.join([lemmatizer.lemmatize(word.lower()) for word in sentence.split()]) for sentence in train_sentences]
val_sentences_lem = [' '.join([lemmatizer.lemmatize(word.lower()) for word in sentence.split()]) for sentence in val_sentences]

# Tokenizar
num_words_lem = 500
tokenizer_lem = Tokenizer(num_words=num_words_lem)
tokenizer_lem.fit_on_texts(train_sentences_lem)

train_seq_lem = tokenizer_lem.texts_to_sequences(train_sentences_lem)
val_seq_lem = tokenizer_lem.texts_to_sequences(val_sentences_lem)

max_seq_lem = max(map(len, train_seq_lem))
train_pad_lem = pad_sequences(train_seq_lem, maxlen=max_seq_lem, padding='post')
val_pad_lem = pad_sequences(val_seq_lem, maxlen=max_seq_lem, padding='post')

# Construir modelo
vocab_size_lem = num_words_lem + 1
model_lem = Sequential()
model_lem.add(Embedding(input_dim=vocab_size_lem, output_dim=75, input_length=max_seq_lem))
model_lem.add(MaxPooling1D(pool_size=2))
model_lem.add(Flatten())
model_lem.add(Dense(128, activation='relu'))
model_lem.add(Dense(num_classes, activation='softmax'))
model_lem.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history_lem = model_lem.fit(train_pad_lem, train_encoded_labels, 
                            batch_size=32, epochs=5, 
                            validation_data=(val_pad_lem, val_encoded_labels))

# Evaluate on validation
val_loss, val_accuracy = model_lem.evaluate(val_pad_lem, val_encoded_labels, batch_size=32, verbose=0)
val_probs = model_lem.predict(val_pad_lem, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_true = np.argmax(val_encoded_labels, axis=1)
val_f1 = f1_score(val_true, val_pred, average='weighted')

print(f"[Lemmatization] Validation accuracy: {val_accuracy:.4f}")
print(f"[Lemmatization] Validation F1-score: {val_f1:.4f}")

Epoch 1/5




[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7616 - loss: 1.0124 - val_accuracy: 0.8300 - val_loss: 0.6915
Epoch 2/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8754 - loss: 0.5007 - val_accuracy: 0.8933 - val_loss: 0.4408
Epoch 3/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9296 - loss: 0.2941 - val_accuracy: 0.9144 - val_loss: 0.3415
Epoch 4/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9490 - loss: 0.1937 - val_accuracy: 0.9289 - val_loss: 0.2912
Epoch 5/5
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9644 - loss: 0.1332 - val_accuracy: 0.9389 - val_loss: 0.2618
[Lemmatization] Validation accuracy: 0.9389
[Lemmatization] Validation F1-score: 0.9245


In [None]:
## Utilitzar n-grams

### Diferents embeddings

In [46]:
# Experiments: provar diferents mides d'embeddings
import pandas as pd
from sklearn.metrics import f1_score

embedding_sizes = [50, 75, 100, 150, 200]
results = []
vocab_size = num_words + 1
batch_size = 32
epochs = 3

for emb in embedding_sizes:
    print('---------------------------------------')
    print(f'Starting experiment: embedding_dim={emb}')
    
    # Construir model
    model_emb = Sequential()
    model_emb.add(Embedding(input_dim=vocab_size, output_dim=emb, input_length=max_sequence_length))
    model_emb.add(MaxPooling1D(pool_size=2))
    model_emb.add(Flatten())
    model_emb.add(Dense(128, activation='relu'))
    model_emb.add(Dense(num_classes, activation='softmax'))
    model_emb.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    # Entrenar con train, validar con val
    history = model_emb.fit(train_pad_sequences, train_encoded_labels, 
                           batch_size=batch_size, epochs=epochs, 
                           validation_data=(val_pad_sequences, val_encoded_labels), 
                           verbose=0)
    
    # Evaluar en validation
    val_loss, val_acc = model_emb.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
    
    # Calcular F1-score en validation
    val_probs = model_emb.predict(val_pad_sequences, verbose=0)
    val_pred = np.argmax(val_probs, axis=1)
    val_true = np.argmax(val_encoded_labels, axis=1)
    val_f1 = f1_score(val_true, val_pred, average='weighted')
    
    # Guardar resultados
    val_acc_final = history.history['val_accuracy'][-1]
    results.append({
        'embedding_dim': emb, 
        'val_accuracy': float(val_acc), 
        'val_f1_score': float(val_f1),
        'val_loss': float(val_loss),
        'train_accuracy': float(history.history['accuracy'][-1])
    })
    
    print(f'Finished embedding_dim={emb} -> Val accuracy: {val_acc:.4f}, Val F1: {val_f1:.4f}')

# Crear tabla resumen
df_results = pd.DataFrame(results).sort_values('embedding_dim').reset_index(drop=True)
print('\n=== Embedding experiment summary table ===')
print(df_results.to_string(index=False))
df_results.to_csv('embedding_experiments_summary.csv', index=False)
print('Saved results to embedding_experiments_summary.csv')

---------------------------------------
Starting experiment: embedding_dim=50




Finished embedding_dim=50 -> Val accuracy: 0.9033, Val F1: 0.8781
---------------------------------------
Starting experiment: embedding_dim=75




Finished embedding_dim=75 -> Val accuracy: 0.9100, Val F1: 0.8868
---------------------------------------
Starting experiment: embedding_dim=100




Finished embedding_dim=100 -> Val accuracy: 0.9256, Val F1: 0.9047
---------------------------------------
Starting experiment: embedding_dim=150




Finished embedding_dim=150 -> Val accuracy: 0.9367, Val F1: 0.9189
---------------------------------------
Starting experiment: embedding_dim=200




Finished embedding_dim=200 -> Val accuracy: 0.9300, Val F1: 0.9123

=== Embedding experiment summary table ===
 embedding_dim  val_accuracy  val_f1_score  val_loss  train_accuracy
            50      0.903333      0.878118  0.422167        0.900196
            75      0.910000      0.886798  0.378808        0.915890
           100      0.925556      0.904659  0.334355        0.932565
           150      0.936667      0.918924  0.279682        0.944090
           200      0.930000      0.912326  0.284074        0.947033
Saved results to embedding_experiments_summary.csv


### Afedir capes convolucionals

In [47]:
# Afegir capes convolucionals
from sklearn.metrics import f1_score

configs_conv = [
    {'filters': 64, 'kernel_size': 3, 'pooling': 'max'},
    {'filters': 128, 'kernel_size': 3, 'pooling': 'max'},
    {'filters': 64, 'kernel_size': 5, 'pooling': 'max'},
    {'filters': 128, 'kernel_size': 5, 'pooling': 'average'},
]

results_conv = []
embedding_dim = 75
vocab_size = num_words + 1
batch_size = 32
epochs = 3

for i, config in enumerate(configs_conv):
    print(f'\n=== Config {i+1}: filters={config["filters"]}, kernel={config["kernel_size"]}, pooling={config["pooling"]} ===')
    
    model_conv = Sequential()
    model_conv.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
    model_conv.add(Conv1D(filters=config['filters'], kernel_size=config['kernel_size'], activation='relu'))
    
    if config['pooling'] == 'max':
        model_conv.add(GlobalMaxPooling1D())
    else:
        model_conv.add(GlobalAveragePooling1D())
    
    model_conv.add(Dense(128, activation='relu'))
    model_conv.add(Dense(num_classes, activation='softmax'))
    model_conv.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    # Entrenar
    history = model_conv.fit(train_pad_sequences, train_encoded_labels, 
                            batch_size=batch_size, epochs=epochs,
                            validation_data=(val_pad_sequences, val_encoded_labels),
                            verbose=0)
    
    # Evaluar en validation
    val_loss, val_acc = model_conv.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
    
    # Calcular F1-score
    val_probs = model_conv.predict(val_pad_sequences, verbose=0)
    val_pred = np.argmax(val_probs, axis=1)
    val_true = np.argmax(val_encoded_labels, axis=1)
    val_f1 = f1_score(val_true, val_pred, average='weighted')
    
    results_conv.append({
        'config': f"filters={config['filters']}, kernel={config['kernel_size']}, pooling={config['pooling']}",
        'val_accuracy': float(val_acc),
        'val_f1_score': float(val_f1),
        'val_loss': float(val_loss),
        'train_accuracy': float(history.history['accuracy'][-1])
    })
    
    print(f'Val accuracy: {val_acc:.4f}, Val F1: {val_f1:.4f}')

df_conv = pd.DataFrame(results_conv)
print('\n=== Convolutional layers experiment summary ===')
print(df_conv.to_string(index=False))
df_conv.to_csv('conv_experiments_summary.csv', index=False)


=== Config 1: filters=64, kernel=3, pooling=max ===




Val accuracy: 0.9244, Val F1: 0.9009

=== Config 2: filters=128, kernel=3, pooling=max ===




Val accuracy: 0.9300, Val F1: 0.9122

=== Config 3: filters=64, kernel=5, pooling=max ===




Val accuracy: 0.9300, Val F1: 0.9113

=== Config 4: filters=128, kernel=5, pooling=average ===




Val accuracy: 0.8833, Val F1: 0.8462

=== Convolutional layers experiment summary ===
                                config  val_accuracy  val_f1_score  val_loss  train_accuracy
     filters=64, kernel=3, pooling=max      0.924444      0.900892  0.313098        0.925699
    filters=128, kernel=3, pooling=max      0.930000      0.912223  0.269351        0.935508
     filters=64, kernel=5, pooling=max      0.930000      0.911255  0.298243        0.929868
filters=128, kernel=5, pooling=average      0.883333      0.846157  0.489401        0.872977


In [48]:
# Afegir capes recurrents (LSTM i GRU)
from keras.layers import LSTM, GRU, Bidirectional
from sklearn.metrics import f1_score

configs_rnn = [
    {'type': 'LSTM', 'units': 64, 'bidirectional': False},
    {'type': 'LSTM', 'units': 128, 'bidirectional': False},
    {'type': 'LSTM', 'units': 64, 'bidirectional': True},
    {'type': 'GRU', 'units': 64, 'bidirectional': False},
    {'type': 'GRU', 'units': 128, 'bidirectional': False},
    {'type': 'GRU', 'units': 64, 'bidirectional': True},
]

results_rnn = []
embedding_dim = 75
vocab_size = num_words + 1
batch_size = 32
epochs = 3

for i, config in enumerate(configs_rnn):
    print(f'\n=== Config {i+1}: {config["type"]}, units={config["units"]}, bidirectional={config["bidirectional"]} ===')
    
    model_rnn = Sequential()
    model_rnn.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
    
    if config['type'] == 'LSTM':
        if config['bidirectional']:
            model_rnn.add(Bidirectional(LSTM(config['units'])))
        else:
            model_rnn.add(LSTM(config['units']))
    else:  # GRU
        if config['bidirectional']:
            model_rnn.add(Bidirectional(GRU(config['units'])))
        else:
            model_rnn.add(GRU(config['units']))
    
    model_rnn.add(Dense(128, activation='relu'))
    model_rnn.add(Dense(num_classes, activation='softmax'))
    model_rnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    # Entrenar
    history = model_rnn.fit(train_pad_sequences, train_encoded_labels,
                           batch_size=batch_size, epochs=epochs,
                           validation_data=(val_pad_sequences, val_encoded_labels),
                           verbose=0)
    
    # Evaluar en validation
    val_loss, val_acc = model_rnn.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
    
    # Calcular F1-score
    val_probs = model_rnn.predict(val_pad_sequences, verbose=0)
    val_pred = np.argmax(val_probs, axis=1)
    val_true = np.argmax(val_encoded_labels, axis=1)
    val_f1 = f1_score(val_true, val_pred, average='weighted')
    
    results_rnn.append({
        'config': f"{config['type']}, units={config['units']}, bidirectional={config['bidirectional']}",
        'val_accuracy': float(val_acc),
        'val_f1_score': float(val_f1),
        'val_loss': float(val_loss),
        'train_accuracy': float(history.history['accuracy'][-1])
    })
    
    print(f'Val accuracy: {val_acc:.4f}, Val F1: {val_f1:.4f}')

df_rnn = pd.DataFrame(results_rnn)
print('\n=== Recurrent layers experiment summary ===')
print(df_rnn.to_string(index=False))
df_rnn.to_csv('rnn_experiments_summary.csv', index=False)


=== Config 1: LSTM, units=64, bidirectional=False ===




Val accuracy: 0.7144, Val F1: 0.5954

=== Config 2: LSTM, units=128, bidirectional=False ===




Val accuracy: 0.7144, Val F1: 0.5954

=== Config 3: LSTM, units=64, bidirectional=True ===




Val accuracy: 0.9189, Val F1: 0.8982

=== Config 4: GRU, units=64, bidirectional=False ===




Val accuracy: 0.8578, Val F1: 0.8082

=== Config 5: GRU, units=128, bidirectional=False ===




Val accuracy: 0.8411, Val F1: 0.8067

=== Config 6: GRU, units=64, bidirectional=True ===




Val accuracy: 0.9233, Val F1: 0.9019

=== Recurrent layers experiment summary ===
                              config  val_accuracy  val_f1_score  val_loss  train_accuracy
 LSTM, units=64, bidirectional=False      0.714444      0.595448  1.195156        0.741295
LSTM, units=128, bidirectional=False      0.714444      0.595448  1.218132        0.741295
  LSTM, units=64, bidirectional=True      0.918889      0.898222  0.344093        0.909514
  GRU, units=64, bidirectional=False      0.857778      0.808186  0.590852        0.815841
 GRU, units=128, bidirectional=False      0.841111      0.806684  0.655135        0.770231
   GRU, units=64, bidirectional=True      0.923333      0.901920  0.309552        0.924718


### Afegir Dropout

In [49]:
# Afegir Dropout per evitar overfitting
from sklearn.metrics import f1_score

dropout_configs = [
    {'dropout_embedding': 0.2, 'dropout_dense': 0.3},
    {'dropout_embedding': 0.3, 'dropout_dense': 0.5},
    {'dropout_embedding': 0.4, 'dropout_dense': 0.5},
    {'dropout_embedding': 0.5, 'dropout_dense': 0.5},
]

results_dropout = []
embedding_dim = 75
vocab_size = num_words + 1
batch_size = 32
epochs = 5

for i, config in enumerate(dropout_configs):
    print(f'\n=== Config {i+1}: dropout_emb={config["dropout_embedding"]}, dropout_dense={config["dropout_dense"]} ===')
    
    model_drop = Sequential()
    model_drop.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
    model_drop.add(Dropout(config['dropout_embedding']))
    model_drop.add(MaxPooling1D(pool_size=2))
    model_drop.add(Flatten())
    model_drop.add(Dense(128, activation='relu'))
    model_drop.add(Dropout(config['dropout_dense']))
    model_drop.add(Dense(num_classes, activation='softmax'))
    
    model_drop.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    # Entrenar
    history = model_drop.fit(train_pad_sequences, train_encoded_labels,
                            batch_size=batch_size, epochs=epochs,
                            validation_data=(val_pad_sequences, val_encoded_labels),
                            verbose=0)
    
    # Evaluar en validation
    val_loss, val_acc = model_drop.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
    
    # Calcular F1-score
    val_probs = model_drop.predict(val_pad_sequences, verbose=0)
    val_pred = np.argmax(val_probs, axis=1)
    val_true = np.argmax(val_encoded_labels, axis=1)
    val_f1 = f1_score(val_true, val_pred, average='weighted')
    
    # Calcular diferencia train-val per veure overfitting
    train_acc_final = history.history['accuracy'][-1]
    val_acc_final = history.history['val_accuracy'][-1]
    overfitting_gap = train_acc_final - val_acc_final
    
    results_dropout.append({
        'dropout_embedding': config['dropout_embedding'],
        'dropout_dense': config['dropout_dense'],
        'val_accuracy': float(val_acc),
        'val_f1_score': float(val_f1),
        'train_accuracy': float(train_acc_final),
        'overfitting_gap': float(overfitting_gap)
    })
    
    print(f'Val Acc: {val_acc:.4f}, Val F1: {val_f1:.4f}, Train: {train_acc_final:.4f}, Gap: {overfitting_gap:.4f}')

df_dropout = pd.DataFrame(results_dropout)
print('\n=== Dropout experiment summary ===')
print(df_dropout.to_string(index=False))
df_dropout.to_csv('dropout_experiments_summary.csv', index=False)


=== Config 1: dropout_emb=0.2, dropout_dense=0.3 ===




Val Acc: 0.9300, Val F1: 0.9111, Train: 0.9409, Gap: 0.0109

=== Config 2: dropout_emb=0.3, dropout_dense=0.5 ===




Val Acc: 0.9211, Val F1: 0.9001, Train: 0.9267, Gap: 0.0056

=== Config 3: dropout_emb=0.4, dropout_dense=0.5 ===




Val Acc: 0.9200, Val F1: 0.9012, Train: 0.9183, Gap: -0.0017

=== Config 4: dropout_emb=0.5, dropout_dense=0.5 ===




Val Acc: 0.9200, Val F1: 0.8976, Train: 0.9007, Gap: -0.0193

=== Dropout experiment summary ===
 dropout_embedding  dropout_dense  val_accuracy  val_f1_score  train_accuracy  overfitting_gap
               0.2            0.3      0.930000      0.911050        0.940902         0.010902
               0.3            0.5      0.921111      0.900120        0.926680         0.005569
               0.4            0.5      0.920000      0.901213        0.918342        -0.001658
               0.5            0.5      0.920000      0.897630        0.900687        -0.019313


### Blancejar les dades

In [50]:
# Calcular class weights per balancejar les classes
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import f1_score

# Calcular pesos de les classes
class_weights = compute_class_weight('balanced', 
                                    classes=np.unique(train_numerical_labels), 
                                    y=train_numerical_labels)
class_weight_dict = dict(enumerate(class_weights))

print("Class weights calculats:")
for i, weight in class_weight_dict.items():
    label_name = label_encoder.inverse_transform([i])[0]
    count = np.sum(train_numerical_labels == i)
    print(f"  {label_name} (count={count}): {weight:.3f}")

embedding_dim = 75
vocab_size = num_words + 1
batch_size = 32
epochs = 5

# Model sense class weights
print('\n=== Training WITHOUT class weights ===')
model_no_weights = Sequential()
model_no_weights.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
model_no_weights.add(MaxPooling1D(pool_size=2))
model_no_weights.add(Flatten())
model_no_weights.add(Dense(128, activation='relu'))
model_no_weights.add(Dense(num_classes, activation='softmax'))
model_no_weights.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history_no_weights = model_no_weights.fit(train_pad_sequences, train_encoded_labels,
                                         batch_size=batch_size, epochs=epochs,
                                         validation_data=(val_pad_sequences, val_encoded_labels),
                                         verbose=0)

val_loss_no_w, val_acc_no_w = model_no_weights.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
val_probs = model_no_weights.predict(val_pad_sequences, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_true = np.argmax(val_encoded_labels, axis=1)
val_f1_no_w = f1_score(val_true, val_pred, average='weighted')

print(f'Validation accuracy WITHOUT weights: {val_acc_no_w:.4f}')
print(f'Validation F1-score WITHOUT weights: {val_f1_no_w:.4f}')

# Model amb class weights
print('\n=== Training WITH class weights ===')
model_with_weights = Sequential()
model_with_weights.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length))
model_with_weights.add(MaxPooling1D(pool_size=2))
model_with_weights.add(Flatten())
model_with_weights.add(Dense(128, activation='relu'))
model_with_weights.add(Dense(num_classes, activation='softmax'))
model_with_weights.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history_with_weights = model_with_weights.fit(train_pad_sequences, train_encoded_labels,
                                              batch_size=batch_size, epochs=epochs,
                                              class_weight=class_weight_dict,
                                              validation_data=(val_pad_sequences, val_encoded_labels),
                                              verbose=0)

val_loss_with_w, val_acc_with_w = model_with_weights.evaluate(val_pad_sequences, val_encoded_labels, batch_size=batch_size, verbose=0)
val_probs = model_with_weights.predict(val_pad_sequences, verbose=0)
val_pred = np.argmax(val_probs, axis=1)
val_f1_with_w = f1_score(val_true, val_pred, average='weighted')

print(f'Validation accuracy WITH weights: {val_acc_with_w:.4f}')
print(f'Validation F1-score WITH weights: {val_f1_with_w:.4f}')

# Comparación
print(f'\n=== Comparison ===')
print(f'Accuracy improvement: {(val_acc_with_w - val_acc_no_w):.4f}')
print(f'F1-score improvement: {(val_f1_with_w - val_f1_no_w):.4f}')

# Guardar resultats
results_weights = pd.DataFrame({
    'approach': ['without_weights', 'with_weights'],
    'val_accuracy': [float(val_acc_no_w), float(val_acc_with_w)],
    'val_f1_score': [float(val_f1_no_w), float(val_f1_with_w)],
    'val_loss': [float(val_loss_no_w), float(val_loss_with_w)]
})
print('\n', results_weights.to_string(index=False))
results_weights.to_csv('class_weights_comparison.csv', index=False)

Class weights calculats:
  abbreviation (count=120): 1.545
  aircraft (count=68): 2.726
  aircraft+flight+flight_no (count=1): 185.364
  airfare (count=333): 0.557
  airfare+flight_time (count=1): 185.364
  airline (count=125): 1.483
  airline+flight_no (count=2): 92.682
  airport (count=14): 13.240
  capacity (count=13): 14.259
  cheapest (count=1): 185.364
  city (count=16): 11.585
  distance (count=15): 12.358
  flight (count=3023): 0.061
  flight+airfare (count=15): 12.358
  flight_no (count=11): 16.851
  flight_time (count=46): 4.030
  ground_fare (count=15): 12.358
  ground_service (count=203): 0.913
  ground_service+ground_fare (count=1): 185.364
  meal (count=3): 61.788
  quantity (count=50): 3.707
  restriction (count=2): 92.682

=== Training WITHOUT class weights ===




Validation accuracy WITHOUT weights: 0.9344
Validation F1-score WITHOUT weights: 0.9228

=== Training WITH class weights ===




Validation accuracy WITH weights: 0.7978
Validation F1-score WITH weights: 0.8270

=== Comparison ===
Accuracy improvement: -0.1367
F1-score improvement: -0.0958

        approach  val_accuracy  val_f1_score  val_loss
without_weights      0.934444      0.922842  0.278805
   with_weights      0.797778      0.827002  0.865408


### Grid-Search

In [None]:
# Grid Search

Experiments per provar quina es la mida optima de vocab, basat en q quina metrica ? accuracy, es la millor (nomes si es classes estan balancejades), si ni la f1-score es la millor.

provar diferents mides embeddings - posarli grafiques per decidir quina es la millor

xarxes conv -> potser hi ha una millor opcio que fer el maxim (??)

fer experiments per probar quin kernel es millor

provar drop out diferents valors-

class weights, hi ha classes desbalanceadas, si hay alguna muy frecuente puede hacer que se equivoque mas en las mayoritarias y menos en las minoritarias  


<h1><a name="section-four"> 4. Lliurable </a></h1>

Heu d'entregar un document PDF de com a **màxim 10 pàgines** que incloga els resultats de tots els exercicis així com una explicació de cadascun dels resultats i de la modificació que heu fet. L'estructura del document és:

1. Introducció.
2. Experiments i Resultats (amb raonament).
3. Conclusions.

No cal que afegiu el vostre codi al document, podeu entregar el *notebook* juntament amb el document.

 ---