### Importing Libraies


The code snippet begins with a series of import statements, each serving a specific purpose in facilitating the subsequent tasks related to natural language processing (NLP) and model training. Firstly, the Dataset class is imported from the datasets library, which is part of the Hugging Face ecosystem. This class provides functionalities for handling datasets, including loading, processing, and manipulating data. Following this, the DataCollatorWithPadding class is imported from the transformers library. This class is instrumental in preparing input data for model training by padding sequences to ensure uniform length, a crucial step for batch processing.

Next, several essential components for fine-tuning a pre-trained GPT-2 model are imported from the transformers library. These include the Trainer and TrainingArguments classes, which facilitate the training process by defining training parameters and orchestrating the training loop, respectively. Additionally, the GPT2LMHeadModel and GPT2Tokenizer classes are imported, representing the pre-trained GPT-2 model architecture and tokenizer, respectively. These components are pivotal for fine-tuning the GPT-2 model on custom data and generating text.

Continuing with the imports, the code brings in the PorterStemmer class from the nltk.stem module. Stemming, a technique for reducing words to their root form, can aid in preprocessing textual data by normalizing word variations. Additionally, the SnowballStemmer class is imported from the same module, offering another stemming algorithm for text normalization. Furthermore, the string module is imported to access utility functions for string manipulation, such as handling punctuation.

The code also imports essential libraries for data manipulation and analysis. Specifically, the torch library is imported for tensor computations, which are fundamental for neural network operations. Additionally, the pandas library is imported for efficient data manipulation, particularly for working with tabular data structures such as DataFrames. Moreover, the spacy library is imported for advanced NLP tasks, including tokenization, part-of-speech tagging, and named entity recognition.

Furthermore, the code imports the train_test_split function from the sklearn.model_selection module, which is useful for splitting datasets into training and validation sets during model development. Finally, the re module is imported for regular expression operations, offering powerful tools for pattern matching and text manipulation. Additionally, the gc module is imported for garbage collection, ensuring efficient memory management during code execution. Lastly, the warnings module is imported to handle and suppress any warnings that may arise during code execution, ensuring a clean and uninterrupted workflow. Overall, these import statements lay the groundwork for conducting various NLP tasks and fine-tuning transformer models for creative text generation.

In [1]:
# Importing the Dataset class from the datasets library
from datasets import Dataset

# Importing the DataCollatorWithPadding class from the transformers library
from transformers import DataCollatorWithPadding

# Importing the Trainer and TrainingArguments classes, GPT2LMHeadModel and GPT2Tokenizer from the transformers library
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer

# Importing the PorterStemmer class from the nltk.stem module
from nltk.stem import PorterStemmer

# Importing the string module
import string

# Importing the torch library for tensor computations
import torch

# Importing the pandas library for data manipulation
import pandas as pd

# Importing the SnowballStemmer class from the nltk.stem module
from nltk.stem import SnowballStemmer

# Importing the spacy library for advanced NLP tasks
import spacy

# Importing the train_test_split function from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# Importing the re module for regular expression operations
import re


import spacy


import gc

# Importing the warnings module to handle warnings
import warnings

# Ignoring any warnings that might be generated when running the code
warnings.filterwarnings('ignore')


2024-03-08 14:54:04.371814: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-08 14:54:04.371896: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-08 14:54:04.373531: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Combining Data In one DF

The load_and_preprocess_data function serves as a critical component for preparing the dataset before it is used for training or further processing. The function takes two arguments, source_file and target_file, which represent the paths to the source and target files containing the data, respectively.

Within the function, the data is loaded from the specified source and target files using the open function in read mode ('r'). The read method is then used to read the contents of each file line by line, and the splitlines method is applied to split the text into individual lines, effectively creating lists of strings for both the source and target data.

Subsequently, the source and target data are combined into a single DataFrame named df using the pd.DataFrame constructor from the Pandas library. This DataFrame consists of three columns: 'source', 'target', and 'tag'. The 'source' column contains the text prompts, while the 'target' column contains the corresponding story continuations.

To facilitate further analysis, the function extracts any tags present in the source data and creates a new column named 'tag' to store them. This is achieved by applying the re.findall function from the re module, which searches for patterns matching the specified regular expression (r'\[(.*?)\]') within each source text. These tags, if found, are then stored in the 'tag' column.

After extracting the tags, the function proceeds to remove them from the source text using the re.sub function, which substitutes any occurrences of the tag pattern with an empty string (''). This effectively cleanses the source text of any tags, ensuring that only the raw text remains.

Finally, any rows containing missing values (NaN) are removed from the DataFrame using the dropna method, ensuring data integrity and consistency.

In summary, the load_and_preprocess_data function effectively loads, preprocesses, and structures the dataset, readying it for subsequent tasks such as model training or analysis. It encapsulates essential data preprocessing steps, including data loading, combining, tag extraction, tag removal, and handling missing values, ensuring that the dataset is well-prepared and suitable for further processing.

In [2]:

def load_and_preprocess_data(source_file, target_file):
    # Load the data
    with open(source_file, 'r') as f:
        source = f.read().splitlines()

    with open(target_file, 'r') as f:
        target = f.read().splitlines()

    # Combine the data into one DataFrame
    df = pd.DataFrame({
        'source': source,
        'target': target
    })

    # Extract tags and create a new column for them
    df['tag'] = df['source'].apply(lambda x: re.findall(r'\[(.*?)\]', x))

    # Remove the tags from the 'source' column
    df['source'] = df['source'].apply(lambda x: re.sub(r'\[(.*?)\]', '', x))

    # Remove any rows with missing values
    df = df.dropna()
    
    return df


# Load and preprocess the data
df=load_and_preprocess_data('/kaggle/input/writing-prompts/writingPrompts/valid.wp_source','/kaggle/input/writing-prompts/writingPrompts/valid.wp_target')



In [3]:
df

Unnamed: 0,source,target,tag
0,Every person in the world undergoes a `` good...,"Clancy Marguerian , 154 , private first class ...",[ WP ]
1,Space mining is on the rise . The Space tanke...,„… and the little duckling will never be able ...,[ WP ]
2,`` I wo n't have time to explain all of this ...,I wo n't have the time to explain all of this ...,[ WP ]
3,Write about a song . Each sentence must start...,* '' [ Sally ] ( https : //www.youtube.com/wat...,[ CW ]
4,You live in Skyrim . It is your job to keep l...,Light is a marvelous thing . It alone can turn...,[ EU ]
...,...,...,...
15615,You are a teenager with the ability to measur...,I decided to go with a 1-15 scale instead of 1...,[ WP ]
15616,"As your dying wish , you ask that your body i...",The shock hit me hard as my lungs filled with ...,[ WP ]
15617,A young child stumbles upon a serial killer d...,`` Your mommy and daddy did n't raise you righ...,[ WP ]
15618,Write from the perspective of a dog who think...,She wants me to get into the car . It 's just ...,[ WP ]


### Text Normalization: This includes converting all text to lower case, which can help ensure that your algorithm does not treat the same words in different cases as different words.

In the following code snippet, several text normalization and cleansing operations are performed on the source and target columns of the DataFrame df. Let's break down each line and discuss its purpose:

df['source'] = df['source'].str.lower(): This line converts all text in the 'source' column to lowercase using the str.lower() method. Converting text to lowercase helps standardize the text data and ensures consistency in subsequent processing steps, such as tokenization and modeling. It prevents the model from treating the same words with different cases as different entities.

df['target'] = df['target'].str.lower(): Similar to the first line, this line converts all text in the 'target' column to lowercase, following the same rationale as above.

df['target'] = df['target'].str.replace('„', ''): Here, the str.replace() method is used to remove any occurrences of the character '„' from the text in the 'target' column. This character may represent a specific type of quotation mark or symbol that is not relevant to the task at hand. Removing such characters helps clean the text data and ensures that the model focuses on meaningful information.

df['target'] = df['target'].str.replace('”', ''): Similarly, this line removes any occurrences of the character '”' from the text in the 'target' column. This character might represent another type of quotation mark or special character that could interfere with subsequent processing or modeling tasks.

df['target'] = df['target'].str.replace('< newlin >', ' '): In this line, the str.replace() method is used to replace the string '< newlin >' with a space (' ') in the text of the 'target' column. This operation likely aims to handle newline characters that were represented as '< newlin >' in the text data. Replacing them with spaces ensures that the text remains coherent and does not introduce unnecessary artifacts during further processing.

Overall, these operations contribute to data cleaning and normalization, which are essential preprocessing steps in natural language processing tasks. By standardizing the text data and removing irrelevant characters or symbols, these operations help ensure that the dataset is well-prepared for subsequent analysis or modeling tasks, ultimately improving the quality and effectiveness of the downstream processes.

In [4]:
df['source'] = df['source'].str.lower()
df['target'] = df['target'].str.lower()


In [5]:
df['target'] = df['target'].str.replace('„', '')
df['target'] = df['target'].str.replace('”', '')
df['target'] = df['target'].str.replace('< newlin >', ' ')


### Removing Punctuation: Punctuation can provide less value when training language models, and removing it can reduce the size of the vocabulary your model needs to learn.

In [6]:
df['source'] = df['source'].str.translate(str.maketrans('', '', string.punctuation))
df['target'] = df['target'].str.translate(str.maketrans('', '', string.punctuation))


In [7]:
df

Unnamed: 0,source,target,tag
0,every person in the world undergoes a goodne...,clancy marguerian 154 private first class of...,[ WP ]
1,space mining is on the rise the space tanker...,… and the little duckling will never be able t...,[ WP ]
2,i wo nt have time to explain all of this to ...,i wo nt have the time to explain all of this t...,[ WP ]
3,write about a song each sentence must start ...,sally https wwwyoutubecomwatch v6qyvil0...,[ CW ]
4,you live in skyrim it is your job to keep li...,light is a marvelous thing it alone can turn ...,[ EU ]
...,...,...,...
15615,you are a teenager with the ability to measur...,i decided to go with a 115 scale instead of 11...,[ WP ]
15616,as your dying wish you ask that your body is...,the shock hit me hard as my lungs filled with ...,[ WP ]
15617,a young child stumbles upon a serial killer d...,your mommy and daddy did nt raise you right ...,[ WP ]
15618,write from the perspective of a dog who think...,she wants me to get into the car it s just so...,[ WP ]


In [8]:
df['target'] = df['target'].str.replace('newline', '')
df['source'] = df['source'].str.replace('newline', '')

### Lemmatization: These techniques are used to reduce words to their root form. This can help your model generalize better to variations of words.

In this section of the code, the Natural Language Processing (NLP) library spaCy is utilized for lemmatization, a process that involves reducing words to their base or root form. Here's a detailed explanation of each step:

nlp = spacy.load('en_core_web_sm'): This line loads the English language model from spaCy. The model, 'en_core_web_sm', is a small English pipeline trained on web text data and includes components for tokenization, part-of-speech tagging, dependency parsing, and named entity recognition.

df['source'] = df['source'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)])): This line applies lemmatization to each word in the 'source' column of the DataFrame df. It uses a lambda function to iterate over each row (x) in the 'source' column. Within the lambda function, each sentence (x) is processed by spaCy's NLP pipeline (nlp(x)), which tokenizes the text into individual tokens. Then, for each token in the sentence, the lemma (base form) of the token is extracted (token.lemma_). Finally, the lemmatized tokens are joined back into a string using ' '.join().

df['target'] = df['target'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)])): Similar to the previous line, this line performs lemmatization on each word in the 'target' column of the DataFrame df. It follows the same approach of applying spaCy's NLP pipeline to each sentence (x), extracting lemmas for each token, and joining the lemmatized tokens back into a string.

In [None]:
nlp = spacy.load('en_core_web_sm')

# Apply lemmatization to each word in the 'source' and 'target' columns of your DataFrame
df['source'] = df['source'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)]))
df['target'] = df['target'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)]))



In [9]:
train_df, valid_df = train_test_split(df, test_size=0.2, random_state=42)

In [10]:
# Clear GPU memory
torch.cuda.empty_cache()


In [11]:
# Force garbage collection
gc.collect()


287

This section of the code involves preparing the dataset for training, tokenizing the input data, and initializing the model for fine-tuning. Here's a breakdown of each step:

1. **Load Dataset Function**: The `load_dataset` function takes three arguments: `train_df`, `valid_df`, and `tokenizer`. It creates training and validation datasets from pandas DataFrames (`train_df` and `valid_df`) using the `Dataset.from_pandas` method provided by the Hugging Face Datasets library. This function also defines an inner function `tokenize_function` to tokenize the input examples. Within this function, the `tokenizer` is applied to the input sentences (`examples["source"]`) with truncation and padding enabled to ensure uniform input lengths. Additionally, the function creates a new key-value pair in the tokenized inputs dictionary by copying the input IDs to the "labels" key. After defining the tokenization function, it maps this function to both the training and validation datasets using the `map` method with `batched=True`, which enables batch processing for efficiency.

2. **Training Arguments**: The `TrainingArguments` object `training_args` is initialized to specify various training settings. These settings include the output directory for saving the trained model (`output_dir`), the number of training epochs (`num_train_epochs`), the batch size per GPU (`per_device_train_batch_size`), and the frequency of saving checkpoints (`save_steps`). Additionally, `save_total_limit` sets the maximum number of checkpoints to keep.

3. **Tokenizer Initialization**: The GPT2 tokenizer (`tokenizer`) is initialized from the pre-trained "distilgpt2" model using `GPT2Tokenizer.from_pretrained`. This tokenizer is specifically designed for GPT-2 models and handles tokenization, special tokens, and padding.

4. **Dataset Preparation**: The `load_dataset` function is called to prepare the training and validation datasets (`train_dataset` and `valid_dataset`) by tokenizing the input examples using the specified `tokenizer`. The tokenization function `tokenize_function` is applied to each dataset, ensuring that the input data is properly tokenized and formatted for training.

5. **Data Collator Initialization**: The `DataCollatorWithPadding` object `data_collator` is initialized with the `tokenizer` to handle padding of input sequences during training. This collator ensures that sequences within each batch have the same length by padding shorter sequences with the appropriate padding token.

6. **Model Initialization**: The GPT2 language model (`model`) is initialized from the pre-trained "distilgpt2" model using `GPT2LMHeadModel.from_pretrained`. This model is a variant of the GPT-2 architecture optimized for efficiency and reduced memory footprint while maintaining strong performance in language generation tasks.

7. **Trainer Initialization**: Finally, the `Trainer` object `trainer` is initialized with the specified `model`, `training_args`, training dataset (`train_dataset`), evaluation dataset (`valid_dataset`), and data collator (`data_collator`). This trainer will be responsible for executing the fine-tuning process, utilizing the specified training arguments, and monitoring training progress.

In [12]:
def load_dataset(train_df, valid_df, tokenizer):
    train_dataset = Dataset.from_pandas(train_df)
    valid_dataset = Dataset.from_pandas(valid_df)

    def tokenize_function(examples):
        tokenized_inputs = tokenizer(examples["source"], truncation=True, padding="max_length")
        tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
        return tokenized_inputs

    train_dataset = train_dataset.map(tokenize_function, batched=True)
    valid_dataset = valid_dataset.map(tokenize_function, batched=True)

    return train_dataset, valid_dataset

training_args = TrainingArguments(
    output_dir="./distilgpt2_story_gen",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    save_steps=10_000,
    save_total_limit=2,
)

tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

# Make sure to define train_df and valid_df before this line
train_dataset, valid_dataset = load_dataset(train_df, valid_df, tokenizer)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = GPT2LMHeadModel.from_pretrained("distilgpt2")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
)

trainer.train()


  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,0.1669
1000,0.102
1500,0.0965
2000,0.0896
2500,0.0853
3000,0.0788
3500,0.0753
4000,0.0756
4500,0.0721
5000,0.0699


TrainOutput(global_step=18744, training_loss=0.06136047417823443, metrics={'train_runtime': 5995.7929, 'train_samples_per_second': 6.252, 'train_steps_per_second': 3.126, 'total_flos': 9795492586192896.0, 'train_loss': 0.06136047417823443, 'epoch': 3.0})

**Save Trained Model:**
The trained model is saved using the save_model method of the trainer object. The model is saved to the specified directory (./creative_writing_distilgpt2_story_gen) with the name pytorch_model.bin. This file contains the parameters and architecture of the trained model, allowing it to be loaded and utilized later without the need for retraining. Saving the model enables easy deployment in production environments or sharing with others for further experimentation.

**Save Tokenizer:** 
Similarly, the tokenizer used for tokenizing input sequences during training is saved using the save_pretrained method of the tokenizer object. The tokenizer is also saved to the same directory (./creative_writing_distilgpt2_story_gen) and stored in a separate file (tokenizer_config.json). This file contains the configuration settings of the tokenizer, including special tokens, vocabulary, and tokenization rules. Saving the tokenizer ensures consistency in tokenization when using the model for inference or further fine-tuning. Additionally, it allows for easy replication of the tokenization process across different environments or systems.

In [13]:
# Save the trained model
trainer.save_model("./creative_writing_distilgpt2_story_gen")

# Save the tokenizer
tokenizer.save_pretrained("./creative_writing_distilgpt2_story_gen")


('./creative_writing_distilgpt2_story_gen/tokenizer_config.json',
 './creative_writing_distilgpt2_story_gen/special_tokens_map.json',
 './creative_writing_distilgpt2_story_gen/vocab.json',
 './creative_writing_distilgpt2_story_gen/merges.txt',
 './creative_writing_distilgpt2_story_gen/added_tokens.json')

In [20]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./creative_writing_distilgpt2_story_gen")
tokenizer = GPT2Tokenizer.from_pretrained("./creative_writing_distilgpt2_story_gen")

# Define the prompts for text generation
prompts = ["Once upon a time", "In a galaxy far far away", "In the heart of the city"]

# Generate text for each prompt
for prompt in prompts:
    # Encode the prompt to tokens
    encoded_prompt = tokenizer.encode(prompt, return_tensors="pt")
    
    # Generate text
    output_sequences = model.generate(
        input_ids=encoded_prompt,
        attention_mask=encoded_prompt.ne(tokenizer.pad_token_id),  # Create attention mask
        pad_token_id=tokenizer.pad_token_id,  # Set pad_token_id
        max_length=200,
        temperature=0.9,
        top_k=1,
        top_p=0.9,
        repetition_penalty=1.0,
        do_sample=True,
        num_return_sequences=1,
    )
    
    # Decode the output sequences to text
    generated_text = tokenizer.decode(output_sequences[0], clean_up_tokenization_spaces=True)
    
    print(f"Prompt: {prompt}\nGenerated Text: {generated_text}\n")


Prompt: Once upon a time
Generated Text: Once upon a time  you are given a deal by a higher power that grants you eternal life  the catch  you have to kill one person every year  if you fail do so  even a minute too late  you will die <|endoftext|>

Prompt: In a galaxy far far away
Generated Text: In a galaxy far far away  you ve been assigned to the first manned mission to the far future  but as you approach your destination  you notice that the mission has been abandoned <|endoftext|>

Prompt: In the heart of the city
Generated Text: In the heart of the city  a man is banished to the wilderness for 20 years  write his diary entries for his first and last days of exile <|endoftext|>

