# 🤗 Transformers Episodio 4 - Vamos de posicion 1200 a 700

Estamos experimentado con la librería Hugging Face Transformers con la competencia Disaster Tweets. 
En el episodio anterior llegamos a `0.8` con 3 epochs sobre pocos samples finetuneando BERT. Vamos a levantar 2 puntos y llegar a la posición 700 (tratar).

* [twitch.tv/dataista0](http://twitch.tv/dataista0)
* [Competencia de Kaggle](https://www.kaggle.com/c/nlp-getting-started/)
* [Libreria usada](https://huggingface.co/)


In [69]:
import utils
import importlib
importlib.reload(utils)

from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
from utils import get_submission, submit, get_tokenizer_and_model,\
                  compute_metrics, load_dfs, tokenize, get_train_args

# Hacer todo la libreria mas verbose
import transformers
transformers.logging.set_verbosity_info()

def read_scores():
    s = !kaggle competitions submissions nlp-getting-started 
    df = pd.DataFrame([(l.split()[0], l.split()[-2]) for l in s[2:]], columns=["Archivo", "Score"]).set_index("Archivo")
    display(df)
    return df

utils.nb_set_width()

In [70]:
pred_file_name = "episodio_4_pred_1.csv"
submit_message = "Episodio 4 Prediccion 1"

model_name = "bert-base-cased"
tokenizer, model = get_tokenizer_and_model(model_name)

df_base, df_test = load_dfs()
df_train, df_val = train_test_split(df_base, test_size=0.1)
ds_train, ds_val, ds_test = tokenize(tokenizer, df_train, df_val, df_test)

loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /home/dataista/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.7.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /home/dataista/.cache/huggingface/

In [71]:
df_val.shape

(762, 2)

In [72]:
df_train.shape

(6851, 2)

In [29]:
trainer = Trainer(model=model, args=get_train_args(), 
                  train_dataset=ds_train, 
                  eval_dataset=ds_val,
                  compute_metrics=compute_metrics)

trainer.train()

df_res = get_submission(trainer, ds_test, pred_file_name)
#submit(pred_file_name, submit_message)

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5347,0.473811,0.805774


Hay un echo
stdout=
kaggle competitions submit -f episodio_4_pred_1.csv -m "Episodio 4 Prediccion 1" nlp-getting-started

stderr=



In [33]:
submit(pred_file_name, submit_message)

stdout=
Successfully submitted to Natural Language Processing with Disaster Tweets
stderr=
  0%|          | 0.00/22.2k [00:00<?, ?B/s]100%|██████████| 22.2k/22.2k [00:00<00:00, 120kB/s]



In [54]:
df_scores = read_scores()

Unnamed: 0_level_0,Score
Archivo,Unnamed: 1_level_1
episodio_4_pred_1.csv,0.81274
bert_first_submission.csv,0.80049


# Mejora desde episodio 3

1.22 de F1 en Leaderboard
388 posiciones más arriba

Unico cambio: de 4000 samples a 6851 con 1 sola epoch (10 minutos, 10 GB de memoria de GPU)

In [61]:
round(df_scores.astype(float).diff(-1).iloc[0, 0] * 100, 2)

1.22

In [62]:
1220 - 832

388

# Proximos pasos


**Ideas** 
* Mas samples
* Mas epochs
* Otros modelos
* Preprocesamiento mejor
* Entender mas parametros de BERT
* Checkpoints fundamental!


## Tema checkpoints

* load_best_model_at_end
* save_strategy
* save_steps

### Parametros de [TrainingArgs](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#trainingarguments)

```
output_dir (str) – The output directory where the model predictions and checkpoints will be written.

overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves.

```

Parametro de [Trainer.train()](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer.train)
> _resume_from_checkpoint_ (str or bool, optional) – If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

In [66]:
trainer.train(resume_from_checkpoint=True)

Loading model from /home/dataista/git/twitch-streams/data/checkpoint-500).
***** Running training *****
  Num examples = 6851
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 857
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 0
  Continuing training from global step 500
  Will skip the first 0 epochs then the first 500 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))




Epoch,Training Loss,Validation Loss,Accuracy
1,0.5347,0.492033,0.790026


***** Running Evaluation *****
  Num examples = 762
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=857, training_loss=0.17376625301401144, metrics={'train_runtime': 297.6317, 'train_samples_per_second': 23.018, 'train_steps_per_second': 2.879, 'total_flos': 2567041886668800.0, 'train_loss': 0.17376625301401144, 'epoch': 1.0})

In [68]:
trainer.save_state()