## Natural language inference

### Contradictory dear watson
***

Project: https://www.kaggle.com/competitions/contradictory-my-dear-watson

__Primer approach al problema de NLI__

El objetivo de este notebook es explorar el dataset y testear distintos enfoques al problema: 
    
- Hipothesis Baseline Approach
- Classic Machine learning 
- Neural Networks


Datasets: 
- Train
- Test
- Submission

Labels: 

- entailment = 0 
- neutral = 1
- contradiction = 2

Modelos a comparar: 

- bert_base_multi

- deberta_v3_base_multi

- distil_bert_base_multi

- xlm_roberta_base_multi

- xlm_roberta_large_multi

__Modelo Bert Keras__

https://keras.io/api/keras_nlp/models/bert/


Referencias: 

https://www.kaggle.com/code/alexia/kerasnlp-starter-notebook-contradictory-dearwatson


__librerias a instalar__

In [None]:
#!pip install -q keras-nlp --upgrade
#!pip install seaborn
#pip install tensorflow
#pip install tensorflow-text

__importar librerias__

In [None]:
import numpy as np 
import pandas as pd 
import tensorflow as tf
from tensorflow import keras
import keras_nlp
import seaborn as sns
import matplotlib.pyplot as plt
import os

print("TensorFlow version:", tf.__version__)
print("KerasNLP version:", keras_nlp.__version__)

__Conf__

In [None]:
try:
    # detect and init the TPU
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.TPUStrategy(resolver)
    print("All devices: ", tf.config.list_logical_devices('TPU'))
except ValueError:
    strategy = tf.distribute.get_strategy()  # default strategy if no TPU available

In [None]:
RESULT_DICT = {
    0 : "entailment",
    1 : "neutral",
    2 : "contradiction"
}

__data__

In [None]:
df_train = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
df_test = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')

In [None]:
df_test.head()

In [None]:
def display_pair_of_sentence(x):
    print( "Premise : " + x['premise'])
    print( "Hypothesis: " + x['hypothesis'])
    print( "Language: " + x['language'])
    print( "Label: " + str(x['label']))
    print()

df_train.head(5).apply(lambda x : display_pair_of_sentence(x), axis=1)

df_train.shape

In [None]:
f, ax = plt.subplots(figsize=(12, 4))

sns.set_color_codes("pastel")
sns.despine()
ax = sns.countplot(data=df_train, 
                   y="label",
                   order = df_train['label'].value_counts().index)

abs_values = df_train['label'].value_counts(ascending=False)
rel_values = df_train['label'].value_counts(ascending=False, normalize=True).values * 100
lbls = [f'{p[0]} ({p[1]:.0f}%)' for p in zip(abs_values, rel_values)]

ax.bar_label(container=ax.containers[0], labels=lbls)

ax.set_yticklabels([RESULT_DICT[index] for index in abs_values.index])

ax.set_title("Distribution of labels in the training set")

In [None]:
f, ax = plt.subplots(figsize=(10, 10))

# Plot the total crashes
sns.set_color_codes("pastel")
sns.despine()
ax = sns.countplot(data=df_train, 
                   y="language",
                   order = df_train['language'].value_counts().index)

abs_values = df_train['language'].value_counts(ascending=False)
rel_values = df_train['language'].value_counts(ascending=False, normalize=True).values * 100
lbls = [f'{p[0]} ({p[1]:.0f}%)' for p in zip(abs_values, rel_values)]

ax.bar_label(container=ax.containers[0], labels=lbls)

ax.set_title("Distribution of languages in the training set")

In [None]:
df_train["premise_length"] = df_train["premise"].apply(lambda x : len(x))
df_train["hypothesis_length"] = df_train["hypothesis"].apply(lambda x : len(x))
df_train[["hypothesis_length", "premise_length"]].describe()

__Data preprocessing__

Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT.

The BertClassifier model can be configured with a preprocessor layer, in which case it will automatically apply preprocessing to raw inputs during fit(), predict(), and evaluate(). This is done by default when creating the model with from_preset().

Bert is only trained in English corpus. That's why people use multilingual Bert or XLM-Roberta for this competition.

Here are some models for multi-language NLP available in Keras NLP:

bert_base_multi

deberta_v3_base_multi

distil_bert_base_multi

xlm_roberta_base_multi

xlm_roberta_large_multi

In [None]:
VALIDATION_SPLIT = 0.2
TRAIN_SIZE = int(df_train.shape[0]*(1-VALIDATION_SPLIT))
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

Here's a utility function that splits the example into an (x, y) tuple that is suitable for model.fit().

By default, keras_nlp.models.BertClassifier will tokenize and pack together raw strings using a "[SEP]" token during training.

Therefore, this label splitting is all the data preparation that we need to perform.

In [None]:
def split_labels(x, y):
    return (x[0], x[1]), y


training_dataset = (
    tf.data.Dataset.from_tensor_slices(
        (
            df_train[['premise','hypothesis']].values,
            keras.utils.to_categorical(df_train['label'], num_classes=3)
        )
    )
)

train_dataset = training_dataset.take(TRAIN_SIZE)
val_dataset = training_dataset.skip(TRAIN_SIZE)

# Apply the preprocessor to every sample of train, val and test data using `map()`.
# [`tf.data.AUTOTUNE`](https://www.tensorflow.org/api_docs/python/tf/data/AUTOTUNE) and `prefetch()` are options to tune performance, see
# https://www.tensorflow.org/guide/data_performance for details.

train_preprocessed = train_dataset.map(split_labels, tf.data.AUTOTUNE).batch(BATCH_SIZE, drop_remainder=True).cache().prefetch(tf.data.AUTOTUNE)
val_preprocessed = val_dataset.map(split_labels, tf.data.AUTOTUNE).batch(BATCH_SIZE, drop_remainder=True).cache().prefetch(tf.data.AUTOTUNE)

In [None]:
# Load a BERT model.
with strategy.scope():
    classifier = keras_nlp.models.BertClassifier.from_preset("bert_base_multi", num_classes=3)

    # in distributed training, the recommendation is to scale batch size and learning rate with the numer of workers.
    classifier.compile(optimizer=keras.optimizers.Adam(1e-5*strategy.num_replicas_in_sync),
                       loss=keras.losses.CategoricalCrossentropy(from_logits=True),
                       metrics=['accuracy'])
    
    classifier.summary()

__Fine tunning BERT__

In [None]:
EPOCHS=30
history = classifier.fit(train_preprocessed,
                         epochs=EPOCHS,
                         validation_data=val_preprocessed
                        )