

# **1. Prepare the Dataset**



**Step 1: Download and Load the Raw Text**

In [11]:
pip install transformers datasets torch tqdm



In [1]:
# Load the raw text
with open("text.txt", "r", encoding="utf-8") as file:
    raw_data = file.read()

In [2]:
# Display the first 500 characters
print(raw_data[:500])

The Project Gutenberg eBook of Le soleil intérieur
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.




**Step 2: Clean the Text**

In [3]:
import re

def clean_text(text):
    # Remove Project Gutenberg headers and footers
    text = re.sub(r"(\*\*\* START OF.*?\*\*\*)|(\*\*\* END OF.*?\*\*\*)", "", text, flags=re.DOTALL)
    # Remove special characters
    text = re.sub(r"[^\w\s.,;:!?'-]", "", text)
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [4]:
# Clean the raw text
cleaned_data = clean_text(raw_data)

In [5]:
# Save cleaned text
with open("cleaned_text.txt", "w", encoding="utf-8") as file:
    file.write(cleaned_data)

In [6]:
# Display the first 500 characters of the cleaned text
print(cleaned_data[:500])

The Project Gutenberg eBook of Le soleil intérieur This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title:


**Step 3: Segment the Text**

In [7]:
# Segment text into sentences
segments = cleaned_data.split('.')
segments = [segment.strip() + "." for segment in segments if segment.strip()]

In [8]:
# Save segmented text
with open("segmented_text.txt", "w", encoding="utf-8") as file:
    file.writelines(segment + "\n" for segment in segments)

In [9]:
# Display first 5 segments
print(segments[:5])

['The Project Gutenberg eBook of Le soleil intérieur This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever.', 'You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.', 'gutenberg.', 'org.', 'If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.']


#**2. Fine-Tune GPT-2**

****Step 1: Load Pre-trained GPT-2****

In [12]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [13]:
# Add a special token for end-of-segment if needed
tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 768)

****Step 2: Tokenize the Dataset****

In [14]:
from datasets import Dataset

# Create a dataset object
data = Dataset.from_dict({"text": segments})

In [15]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/3119 [00:00<?, ? examples/s]

In [16]:
# Display an example of tokenized data
print(tokenized_data[0])

{'text': 'The Project Gutenberg eBook of Le soleil intérieur This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever.', 'input_ids': [464, 4935, 20336, 46566, 286, 1004, 6195, 346, 493, 2634, 5034, 333, 770, 47179, 318, 329, 262, 779, 286, 2687, 6609, 287, 262, 1578, 1829, 290, 749, 584, 3354, 286, 262, 995, 379, 645, 1575, 290, 351, 2048, 645, 8733, 16014, 13, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,

In [25]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.47.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.47.0-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successfully uninstalled tokenizers-0.20

**Step 3: Set Up the Training Arguments**

In [17]:
from transformers import TrainingArguments

# Optimized training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    eval_strategy="epoch",  # Replace with `eval_strategy`
    learning_rate=3e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    save_steps=500,
    save_total_limit=1,
    logging_dir="./logs",
    logging_steps=50,
    fp16=True,
    push_to_hub=False,
)

**Step 4: Train the Model**

In [18]:
from transformers import Trainer, DataCollatorForLanguageModeling
from sklearn.model_selection import train_test_split

In [19]:
# Data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [20]:
# Split dataset into train and evaluation datasets (90% train, 10% eval)
train_texts, eval_texts = train_test_split(segments, test_size=0.1, random_state=42)

In [21]:
# Tokenize both datasets
train_data = Dataset.from_dict({"text": train_texts}).map(tokenize_function, batched=True)
eval_data = Dataset.from_dict({"text": eval_texts}).map(tokenize_function, batched=True)

Map:   0%|          | 0/2807 [00:00<?, ? examples/s]

Map:   0%|          | 0/312 [00:00<?, ? examples/s]

In [22]:
# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,  # Provide the evaluation dataset here
    data_collator=data_collator,
)

In [23]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,4.3745,4.206096


TrainOutput(global_step=351, training_loss=4.5507548595765375, metrics={'train_runtime': 147.5901, 'train_samples_per_second': 19.019, 'train_steps_per_second': 2.378, 'total_flos': 183361683456000.0, 'train_loss': 4.5507548595765375, 'epoch': 1.0})

# **3. Save and Test the Model**

**Step 1: Save the Fine-Tuned Model**

In [24]:
# Save the fine-tuned model and tokenizer
model.save_pretrained("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

('./gpt2-finetuned/tokenizer_config.json',
 './gpt2-finetuned/special_tokens_map.json',
 './gpt2-finetuned/vocab.json',
 './gpt2-finetuned/merges.txt',
 './gpt2-finetuned/added_tokens.json')

**Step 2: Generate Text**

In [25]:
from transformers import pipeline

# Load the fine-tuned model
generator = pipeline("text-generation", model="./gpt2-finetuned", tokenizer=tokenizer)

Device set to use cuda:0


In [26]:
# Generate text
prompt = "Dans la lumière du matin,"
generated_text = generator(prompt, max_length=50, num_return_sequences=1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [27]:
# Display the generated text
print(generated_text[0]["generated_text"])

Dans la lumière du matin, ces lettres dans la nuit comme quil y de souffait une même. Tout les autorité et la neuve révolution le mère,
