# Overview

**Objective:** Generate text that replicates the stylistic attributes present in Gustavo Adolfo Becquer's books.

**Input:** Dataset for finetuning (concatenated books).

*   [Legends, Tales, and Poems](https://www.gutenberg.org/ebooks/10814)
*   [Obras escogidas](https://www.gutenberg.org/ebooks/53552)

**Expected Output:** Finetuned model (.pt file).


This code includes:

*   Downloading and preprocessing datasets from gutemberg.org
*   Tokenizing and finetuning [gpt2-spanish](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fhuggingface.co%2FDeepESP%2Fgpt2-spanish%3Ftext%3DQuisiera%2Bsaber%2Bque%2Bva%2Ba%2Bsuceder)
*   Saving the output model to gdrive



Follow the code below or [go straight to the results](#conclusion)

# Basic imports

In [None]:
!pip install transformers datasets torch
!pip install transformers[torch]
!pip install accelerate -U

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [None]:
import numpy as np
import tensorflow as tf
import os
import re

# Data cleaning and preprocessing

In [None]:
files_gutemberg = {
    "Leyendas, cuentos y poemas": { "file_name":"leyendas_cuentos_poemas.txt","url":"https://www.gutenberg.org/files/10814/10814-8.txt","encoding":"ISO-8859-1" },\
    "Obras escogidas": { "file_name":"obras_escogidas.txt","url":"https://www.gutenberg.org/files/53552/53552-8.txt","encoding":"ISO-8859-1" }
    }

## Doc 1: Leyendas, cuentos y poemas

In [None]:
path = tf.keras.utils.get_file("Leyendas, cuentos y poemas", "https://www.gutenberg.org/files/10814/10814-8.txt")
text = open(path, "rb").read().decode(encoding="ISO-8859-1")
print(f"Leyendas, cuentos y poemas")
print(f"Length of text: {len(text):,} characters")
print()


Downloading data from https://www.gutenberg.org/files/10814/10814-8.txt
Leyendas, cuentos y poemas
Length of text: 698,992 characters



In [None]:
# Remove anything before the first sentence of the book.
sequence = "(_Cartas Literarias_)"

# Split the string based on the sequence
split_result = text.split(sequence, 1) # splits just 1 time

# Check if the sequence was found in the string
if len(split_result) > 1:
    result_string = split_result[1].lstrip()
else:
    result_string = text

In [None]:
# Remove anything after the last sentence of the book.
sequence = "Á quienes no conozco!"

# Split the string based on the sequence
split_result = result_string.split(sequence, 1) # splits just one time

# Check if the sequence was found in the string
if len(split_result) > 1:
    result_string = split_result[0]
else:
    result_string = text

In [None]:
# Remove footnotes in English (text inside brackets).

# Use regular expressions to remove text inside brackets
cleaned_document = re.sub(r'\[.*?\]', '', result_string, flags=re.DOTALL)

# Print the cleaned document
print(cleaned_document)


In [None]:
type(cleaned_document)

str

In [None]:
# Get the current working directory
current_directory = os.getcwd()

# Print the current working directory
print("Current working directory:", current_directory)

Current working directory: /content


In [None]:
# Save file in the working directory

# Specify the file path where you want to save the file
file_path = "gutemberg_cuentos_leyendas.txt"

# Open the file in write mode ('w')
with open(file_path, 'w') as file:
    # Write the string to the file
    file.write(cleaned_document)

# The file is automatically closed when the 'with' block exits


## Doc 2: Obras escogidas

In [None]:
path = tf.keras.utils.get_file("Obras escogidas", "https://www.gutenberg.org/files/53552/53552-8.txt")
text = open(path, "rb").read().decode(encoding="ISO-8859-1")
print(f"Obras escogidas")
print(f"Length of text: {len(text):,} characters")
print()


Downloading data from https://www.gutenberg.org/files/53552/53552-8.txt
Obras escogidas
Length of text: 585,523 characters



In [None]:
# Original text with new lines
original_text = "el monumento erigido al\npoeta."

# Convert new lines to ISO-8859-1 encoding with appropriate newline character
iso_8859_1_text = original_text.encode('iso-8859-1').replace(b'\n', b'\r\n')
  # replace(b'\n', b'\r\n') replaces the newline character (b'\n') with the Windows-style newline character (b'\r\n').
  # Adjust the replacement value according to the desired newline character format.

# Print the encoded text
print(iso_8859_1_text)


b'el monumento erigido al\r\npoeta.'


In [None]:
substring = "el monumento erigido al\r\npoeta."

if substring in text:
    print("Substring found!")
else:
    print("Substring not found.")

Substring found!


In [None]:
# Remove anything before the first sentence of the book.
sequence = "el monumento erigido al\r\npoeta."

# Split the string based on the sequence
split_result = text.split(sequence, 1) # splits just 1 time

# Check if the sequence was found in the string
if len(split_result) > 1:
    result_string = split_result[1].lstrip()
else:
    result_string = text

In [None]:
print(result_string)

In [None]:
# Remove anything after the last sentence of the book.
sequence = "creo que le sucedería lo mismo." # Not including the 'Rimas' section.

# Split the string based on the sequence
split_result = result_string.split(sequence, 1) # splits just one time

# Check if the sequence was found in the string
if len(split_result) > 1:
    result_string = split_result[0]
else:
    result_string = text



In [None]:
print(result_string)

In [None]:
# Remove editorial comments (in brackets).

# Use regular expressions to remove text inside brackets
cleaned_document_2 = re.sub(r'\[.*?\]', '', result_string, flags=re.DOTALL)


In [None]:
print(cleaned_document_2)

In [None]:
# Remove editorial underscores that denote italics, keep text in between.

# Remove underscores using str.replace() method
cleaned_document_2 = cleaned_document_2.replace("_", "")

# Print the cleaned string
print(cleaned_document_2)


In [None]:
type(cleaned_document_2)

str

In [None]:
# Get the current working directory
current_directory = os.getcwd()

# Print the current working directory
print("Current working directory:", current_directory)

Current working directory: /content


In [None]:
# Save file in the working directory to get path

# Specify the file path where you want to save the file
file_path = "gutemberg_obras_escogidas.txt"

# Open the file in write mode ('w')
with open(file_path, 'w') as file:
    # Write the string to the file
    file.write(cleaned_document_2)

# The file is automatically closed when the 'with' block exits


## Join both datasets

In [None]:
# Two strings to join
string1 = cleaned_document
string2 = cleaned_document_2

# Join the strings together
joined_string = string1 + string2

# Specify the file path where you want to create the text file
file_path = "gutemberg_joined.txt"

# Open the file in write mode and write the joined string
with open(file_path, 'w') as file:
    file.write(joined_string)

print("Text file created successfully.")


Text file created successfully.


In [None]:
print('The total length of both documents is {} characters.'.format(len(joined_string)))

The total length of both documents is 750699 characters.


# Finetuning gpt2-spanish

## Setup to save model in Gdrive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!ls /content/gdrive

In [None]:
import os
DATADIR = os.path.join('/content/gdrive', 'My Drive', 'XXX', 'XXX', 'XXX') # Replace with your Gdrive folder path

In [None]:
DATADIR

## About the model

[**DeepESP/gpt2-spanish**](https://huggingface.co/DeepESP/gpt2-spanish?text=Quisiera+saber+que+va+a+suceder)

License: [mit](https://choosealicense.com/licenses/mit/)

Why is it a good fit for this project?

* It was trained on literary text, the writing field of our writer.
* The tokenizer was trained from scratch with the Spanish corpus, to capture the morphosyntactic differences between English and Spanish.

## Finetuning

In [None]:
# Load model directly with Hugging Face Auto Classes
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# Constants
MODEL_NAME = "gpt2-spanish"
TRAIN_FILE = "gutemberg_joined.txt"
TOKENIZER_FILE = "tokenizer.json"
OUTPUT_DIR = DATADIR # Going to Gdrive

In [None]:
# Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("DeepESP/gpt2-spanish")
model = AutoModelForCausalLM.from_pretrained("DeepESP/gpt2-spanish")


In [None]:
# Prepare the dataset
train_dataset = TextDataset(
    tokenizer = tokenizer,
    file_path = TRAIN_FILE,
    block_size = 128
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm = False
)


In [None]:
# Define training arguments and start fine-tuning
training_args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    overwrite_output_dir = False,
    per_device_train_batch_size = 8,
    num_train_epochs = 1,
    save_steps = 10_000,
    save_total_limit = 2,
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
)

trainer.train()

## Save finetuned model

From [How to save our model to google Drive and reuse it](https://medium.com/@ml_kid/how-to-save-our-model-to-google-drive-and-reuse-it-2c1028058cb2)

In [None]:
model_save_name = 'gpt2-spanish_Becquer-joined.pt'
path = DATADIR
torch.save(model.state_dict(), path)

## Use model for text generation

In [None]:
# Save finetuning files
model.save_pretrained(OUTPUT_DIR)

print(f"Model fine-tuned and saved to {OUTPUT_DIR}")


In [None]:
prompt = "Será verdad que"  # Change this to your desired starting prompt
max_length = 150  # Change this to desired output length

input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to("cuda")

output = model.generate(input_ids, max_length=max_length, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)


<a name="conclusion"></a>
# Conclusion

The model was finetuned successfully, but it generates repetitive nonsense--no resemblance to the writing style of our author.

For improvements and fixes to the output, see `Decoding_methods-gpt2_finetuned.ipynb`, where I use different decoding methods for language generation with Transformers.

## Output example

Prompt:
`"Será verdad que..."`

Output:
`Será verdad que no se ha de
hablar de la verdad, y que no se ha de
haber de la verdad, y que no se ha de
haber de la verdad, sino de la verdad, y que
se ha de creer que no se ha de creer en nada, y que
se ha de creer que no se ha de creer en nada, y que
se ha de creer en algo, y que no se ha de creer en nada, y que
se ha de creer en algo, y que no se ha de creer en nada, y que
se ha de creer en algo, y que no se ha de creer en nada, y que
se ha de creer en algo, y que se`