In [0]:
# Install necessary libraries 
%pip install transformers torch datasets

# Import required modules
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch


[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


2024-11-20 14:28:25.973685: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-20 14:28:26.008677: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Introduction to Hugging Face Transformers

Hugging Face Transformers is a popular library for working with state-of-the-art pre-trained models for NLP and Generative AI. These models are:

- Easy to integrate.
- Pre-trained on diverse datasets.
- Optimized for a variety of tasks like text generation, summarization, and translation.

We'll use the Hugging Face `pipeline` API to simplify working with Generative AI models.


In [0]:
# Load a pre-trained GPT model
generator = pipeline('text-generation', model='gpt2')

# Generate text
prompt = "The future of AI is"
result = generator(prompt, max_length=50, num_return_sequences=1)

# Display generated text
print("Generated Text:", result[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: The future of AI is at stake, they worry about its safety when it becomes too "aggressive". As we have explained, the threat is that AI will eventually lead to AI that does not understand human reasoning. Humans have learned a lot over the last


## Explanation of GPT Model

GPT (Generative Pre-trained Transformer) is an autoregressive language model that predicts the next word in a sequence. Key features:

- **Transformer Architecture**: Utilizes self-attention mechanisms for efficient training.
- **Pre-Training**: Trained on large text datasets to learn language structure.
- **Fine-Tuning**: Can be fine-tuned for specific tasks like text generation, summarization, and more.

In the previous example, we used GPT-2, which is capable of generating coherent and contextually relevant text based on a given prompt.


## Fine-Tuning Overview

Fine-tuning allows you to adapt a pre-trained model to specific tasks or domains. This is done by training the model on task-specific datasets while leveraging its pre-trained knowledge.

In this section, we will explore:
- How to load and preprocess a dataset.
- Perform basic fine-tuning on a smaller model.

Fine-tuning is resource-intensive and may require GPU/TPU instances.


In [0]:
# Load a sample dataset from Hugging Face
from datasets import load_dataset

# Load a small dataset for text generation fine-tuning
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:10000]")  
print("Sample data:", dataset)




Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Sample data: Dataset({
    features: ['text'],
    num_rows: 10000
})


In [0]:
# Set a padding token for the tokenizer (use the end-of-sequence token as the padding token)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained('gpt2')
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding="max_length",
        max_length=128
    )

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Ensure the model knows about the new padding token
model.resize_token_embeddings(len(tokenizer))

# Display a tokenized example
print("Tokenized Sample:", tokenized_dataset[0])


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Tokenized Sample: {'text': '', 'input_ids': [50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'attention_mask': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [0]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [0]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./potato",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=2,
    save_steps=10
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)



In [0]:
# Prepare the dataset for causal language modeling (shifted labels)
def preprocess_data(examples):
    inputs = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )
    inputs["labels"] = inputs["input_ids"].copy()  # Set labels as a copy of input_ids
    return inputs

# Apply preprocessing
processed_dataset = tokenized_dataset.map(preprocess_data, batched=True)
# Define Trainer with processed dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset,  # Use processed dataset with labels
    tokenizer=tokenizer  # Pass tokenizer to handle padding
)


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [0]:
# WILL EVENTUALLY CRASH DUE TO MEMORY LIMITATION
trainer.train()

Epoch,Training Loss,Validation Loss


[0;31m---------------------------------------------------------------------------[0m
[0;31mRuntimeError[0m                              Traceback (most recent call last)
File [0;32m<command-4287111371344631>, line 2[0m
[1;32m      1[0m [38;5;66;03m# WILL EVENTUALLY CRASH DUE TO MEMORY LIMITATION[39;00m
[0;32m----> 2[0m [43mtrainer[49m[38;5;241;43m.[39;49m[43mtrain[49m[43m([49m[43m)[49m

File [0;32m/databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:451[0m, in [0;36msafe_patch.<locals>.safe_patch_function[0;34m(*args, **kwargs)[0m
[1;32m    436[0m [38;5;28;01mif[39;00m (
[1;32m    437[0m     active_session_failed
[1;32m    438[0m     [38;5;129;01mor[39;00m autologging_is_disabled(autologging_integration)
[0;32m   (...)[0m
[1;32m    446[0m     [38;5;66;03m# skipped[39;00m
[1;32m    450[0m     ):
[0;32m--> 451[0m         [38;5;28;01mreturn[39;00m [43moriginal[49m[43m([49m[38;5;241;43m*[39;49m[43ma

## Deploying Fine-Tuned Models

Once a model is fine-tuned, it can be deployed for inference. Databricks supports integration with MLflow for tracking and deploying machine learning models, including those fine-tuned for Generative AI tasks.

You can log the model with MLflow and serve it via APIs for production use cases.
