# Language Generation Task Using Transfer Learning with Hugging Face Models

In this example, I'll demonstrate how to use a pre-trained language model from Hugging Face, fine-tune it for a specific task, and then use it for text generation. We'll use the GPT-2 model, which is well-suited for language generation tasks.

# Step 1: Install Required Libraries
First, we'll install the necessary libraries. Hugging Face's transformers library provides easy access to pre-trained models and tokenizers.

In [None]:
!pip install transformers[torch]
!pip install datasets
!pip install accelerate>=0.21.0

In [20]:
import torch

# Step 2: Load a Pre-Trained Model and Tokenizer
We'll load the pre-trained GPT-2 model and its tokenizer. The tokenizer is responsible for converting text into tokens that the model can understand.

In [21]:
# We start by loading a pre-trained GPT-2 model and its tokenizer from the
# Hugging Face transformers library. The pre-trained model already has learned
# representations from vast amounts of text data, which we can leverage for our task.
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model
model_name = 'gpt2'
model = GPT2LMHeadModel.from_pretrained(model_name)

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Use end-of-sequence token as the padding token


# Step 3: Fine-Tune the Model
Fine-tuning a pre-trained model involves training it on a specific dataset to adapt it to a particular task. We'll use the Hugging Face datasets library to load a dataset. For this example, let's fine-tune the model on a small dataset of text.

In [None]:
# Fine-tuning involves training the pre-trained model on a new, task-specific dataset.
#This process adjusts the model's weights slightly to better fit the new data.
#We use the Trainer class from the transformers library, which simplifies the training
# process. The TrainingArguments specify the training configurations,
# such as the number of epochs, batch size, and learning rate.
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

# Load a dataset (we'll use the wikitext dataset for demonstration)
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = dataset['train']
test_dataset = dataset['test']

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], return_tensors='pt', truncation=True, padding='max_length', max_length=250)

train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=['text'])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=['text'])

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()


# Step 4: Use the Fine-Tuned Model for Text Generation
After fine-tuning, we can use the model to generate text based on a given prompt. The generate method allows us to specify the maximum length of the generated text and other parameters.

In [None]:
# Once the model is fine-tuned, we can use it for text generation. We provide
# a text prompt and use the generate method to produce a continuation of the text.
# The tokenizer converts the text into tokens, which the model processes to
# generate the output tokens. Finally, we decode the tokens back into human-readable text.

# Define a text prompt
prompt = "Once upon a time"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=100, num_return_sequences=1)

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


# Try it yourself

1. Dataset Modification: Modify the dataset to use a different text dataset available in Hugging Face's datasets library. How does the model's output change with the new dataset?
Text Generation Parameters:

2. Modify the code to generate text with different parameters such as max_length, num_return_sequences, and temperature. How do these parameters affect the generated text?

3. Model Performance: Fine-tune the model for more epochs. How does the quality of the generated text change with additional training?

4. Output Analysis: Generate text with different prompts and analyze the coherence and relevance of the output. Are there any patterns in how the model responds to different prompts?