# Huggingface GPT-2 Spanish

## Authors
The model was trained by Jorge Ortiz Fuentes (Chile) and Alejandro Oñate Latorre (Spain), members of [DeepESP](https://t.me/joinchat/VoEp1bPrDYEexc6h), an open-source community on Natural Language Processing in Spanish. Thanks to the members of the community who collaborated with funding for the initial tests.

Colab developed by Jorge Ortiz Fuentes and [Mathias Gatti](http://mathigatti.com/).

## Cautions
The model generates texts according to the patterns learned in the training corpus. These data were not filtered, therefore, the model could generate offensive or discriminatory content.





# 1. Try model

In [None]:
#@title ⇠ Download and load GPT-2-spanish (Takes a few minutes)
from IPython.display import clear_output
!pip install git+https://github.com/huggingface/transformers
clear_output()
from transformers import pipeline
model = "flax-community/gpt-2-spanish" #@param ['DeepESP/gpt2-spanish-medium', 'DeepESP/gpt2-spanish', "flax-community/gpt-2-spanish"]
generator2 = pipeline('text-generation', model=model, device=0)

The module name flax_hyphen_community/gpt_hyphen_2_hyphen_spanish (originally flax-community/gpt-2-spanish) is not a valid Python identifier. Please rename the original module to avoid import issues.
Device set to use cuda:0


In [None]:
# Write here how you want the text to start
initial_text = '''La puerta del faro se abrió una noche sin viento.'''

# Change this if you want the final text to be shorter or longer.
# If you want GPT-2 to generate a long text it will take more time.
max_length = 200

generated_text = generator2(initial_text, do_sample=True, pad_token_id=50256, max_length=max_length)[0]["generated_text"]
print(generated_text)

Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


La puerta del faro se abrió una noche sin viento. Se abrió una puerta, y la casa se encendió sin saber que iba a pasar.
Los niños le preguntaban que qué estaba haciendo en la casa, ya que hacía mucho que no veía una sola luz.
El niño que estaba escondido en el interior de la casa, no sabía que hacer. El niño que se encontraba escondido, se había hecho daño.
El niño que estaba escondido, se puso la mano alrededor de la cabeza, y se volvió a la casa para preguntar a los niños de la casa, por qué no estaba allí.
El niño que se encontraba escondido, le preguntó a los niños que estaban en la casa, por qué no estaba allí.
El niño que estaba escondido, le dijo a los niños de la casa, que el niño que estaba escondido, no entendía porque no estaba allí.
El niño que estaba escondido, le dijo a los niños de la casa, que el niño que estaba escondido, no entendía porque no estaba allí.
El niño que estaba escondido, se acercó a la ventana, y se quedó mirando la ventana, sin saber qué hacer, pues se 

# 2. Fine-tune model

In the tutorial, we are going to fine-tune a Spanish GPT-2 from the [Huggingface model hub](https://huggingface.co/models). As fine-tune, we are using [this horoscope dataset](https://gist.github.com/mathigatti/bb1045ffb48377e4557eac42e54a34db/).

In [None]:
#@title  ⬅ Run this cell to restart the notebook in order to free memory space
#@markdown It's going to look like an error but that's fine
import os
os.kill(os.getpid(), 9)

In [None]:
#@title ⬅ Run this cell to install huggingface and download a sample dataset
!pip install git+https://github.com/huggingface/transformers
#!wget https://gist.githubusercontent.com/mathigatti/bb1045ffb48377e4557eac42e54a34db/raw/90c11844ab8a962a15e7a3705cb1a916ea2e925e/horoscopo.txt

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-f88epzgf
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-f88epzgf
  Resolved https://github.com/huggingface/transformers to commit f22cb1e8689905091acb07e238b670aadde1e4ee
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub==1.0.0.rc2 (from transformers==4.57.0.dev0)
  Downloading huggingface_hub-1.0.0rc2-py3-none-any.whl.metadata (14 kB)
Collecting typer-slim (from huggingface-hub==1.0.0.rc2->transformers==4.57.0.dev0)
  Downloading typer_slim-0.19.2-py3-none-any.whl.metadata (16 kB)
INFO: pip is looking at multiple versions of tokenizers to determine which version is compatible with other requirements. This could take a while.
Collecting t

The next step is to download the tokenizer, which we use. We use the tokenizer from the `gpt2-spanish-medium` model on [huggingface](https://huggingface.co/DeepESP/gpt2-spanish-medium).

In [None]:
from transformers import AutoTokenizer

# Modelo GPT-2 Small
model = "flax-community/gpt-2-spanish" #@param ['DeepESP/gpt2-spanish-medium', 'DeepESP/gpt2-spanish', "flax-community/gpt-2-spanish"]

tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/811 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

# Prepare the dataset and build a ``TextDataset``

The next step is to build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. If you want to know more about Dataset in Pytroch you can check out this [youtube video](https://www.youtube.com/watch?v=PXOzkkB5eH0&ab_channel=PythonEngineer).

In [None]:
!pip uninstall -y wandb

Found existing installation: wandb 0.21.4
Uninstalling wandb-0.21.4:
  Successfully uninstalled wandb-0.21.4


In [None]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator


train_path = 'train.txt'
# Just a dummy example, we should use a different dataset for testing
test_path = 'test.txt'

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained(model)

training_args = TrainingArguments(
    output_dir="./gpt2-gerchef", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=1, # batch size for training
    per_device_eval_batch_size=1,  # batch size for evaluation
    eval_steps=400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

The module name flax_hyphen_community/gpt_hyphen_2_hyphen_spanish (originally flax-community/gpt-2-spanish) is not a valid Python identifier. Please rename the original module to avoid import issues.


# Train and save the model

To train the model we can simply run `Trainer.train()`. It can take a few hours depending the size of you dataset and the number of epochs you chose.

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,3.5402
1000,3.3928
1500,3.3563
2000,3.3185
2500,3.3129
3000,3.3304
3500,3.3188
4000,3.3185
4500,3.3178
5000,3.306


TrainOutput(global_step=102108, training_loss=3.097180782737587, metrics={'train_runtime': 3423.4867, 'train_samples_per_second': 29.826, 'train_steps_per_second': 29.826, 'total_flos': 6670001700864000.0, 'train_loss': 3.097180782737587, 'epoch': 3.0})

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [None]:
trainer.save_model()

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "./gpt2-final"

# Guarda SOLO lo esencial
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('./gpt2-final/tokenizer_config.json',
 './gpt2-final/special_tokens_map.json',
 './gpt2-final/vocab.json',
 './gpt2-final/merges.txt',
 './gpt2-final/added_tokens.json',
 './gpt2-final/tokenizer.json')

In [None]:
!zip -r model_ft.zip ./gpt2-final

  adding: gpt2-final/ (stored 0%)
  adding: gpt2-final/tokenizer_config.json (deflated 79%)
  adding: gpt2-final/vocab.json (deflated 60%)
  adding: gpt2-final/special_tokens_map.json (deflated 52%)
  adding: gpt2-final/added_tokens.json (stored 0%)
  adding: gpt2-final/config.json (deflated 51%)
  adding: gpt2-final/model.safetensors (deflated 7%)
  adding: gpt2-final/generation_config.json (deflated 24%)
  adding: gpt2-final/tokenizer.json (deflated 82%)
  adding: gpt2-final/merges.txt (deflated 57%)


In [None]:
from google.colab import files
files.download("model_ft.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Test the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [None]:
from transformers import pipeline
horoscope_generator = pipeline('text-generation',model='./gpt2-gerchef', tokenizer=model, device=0)

Device set to use cuda:0


In [None]:
from transformers import AutoTokenizer, pipeline

model_path = "./gpt2-gerchef"
tokenizer = AutoTokenizer.from_pretrained(model_path)

generator = pipeline(
    "text-generation",
    model=model_path,        # <- string con la ruta
    tokenizer=tokenizer,
    device=0
)

Device set to use cuda:0


In [None]:
generated_text = generator('Un antiguo mapa con tinta borroneada apareció en la mesa del despacho.', pad_token_id=50256)[0]['generated_text']
print(generated_text)

Un antiguo mapa con tinta borroneada apareció en la mesa del despacho. Le
pregunté en qué consistía y me contestó que era en un informe, por lo
que tenía que ir a buscar el papel, y que, si no lo encontraba, le
echaba una ojeada a ver si era un documento que había en el fondo del
archivo.

--¿Qué es esto?--le pregunté.--Es un informe que se me ha olvidado
y, como no lo encontré en ningún otro lugar, he pensado que era un
papel. Es un documento que he perdido.

--¿Es un documento?--replicó.--No.

--¿De dónde ha salido?

--De una reunión de accionistas que se celebró en un café.

--¿Cómo ha sido el primer día del año?

--No recuerdo.

--¿Quién era, al parecer, el más guapo, el más elegante?--pregunté.

--El día de la reunión era el primero de enero.

--¿Y qué había sido de mí?

--¿Cómo he podido conocer a todo el mundo?

--Nada.

--¿


In [None]:
eval_results = trainer.evaluate()
print(eval_results)
import math
ppl = math.exp(eval_results["eval_loss"])
print("Perplexity:", ppl)

NameError: name 'trainer' is not defined