<a href="https://colab.research.google.com/github/anuraglahon16/-A-Hands-On-Course-on-Deep-Learning/blob/master/Transformer_Workshop_Code_Temple_Examples_of_Different_Model_Architectures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BART (Encoder-Decoder Style Model)

BART, or Bidirectional AutoRegressive Transformer found in [this](https://arxiv.org/pdf/1910.13461v1.pdf) paper, is a Encoder-Decoder style model that leverages the traditional architecture found in the "Attention is All You Need" paper. They make a simple modification to the activation function from ReLU to GeLU.

This model excels at a number of tasks, including but not limited to: Machine Translation, Summarization, Categorization of Input Sentences, and Question Answering.

We'll showcase BART with a Text Summarization fine-tuning task today.

In [7]:
!pip install rouge-score evaluate transformers accelerate -qU

In [8]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline, set_seed
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments

import datasets
from datasets import load_metric, Dataset
from datasets import DatasetDict

First up, we'll load our model and tokenizer!

In [9]:
#device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "facebook/bart-base"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

We'll be using the `billsum` dataset from Hugging Face which you can be found [here](https://huggingface.co/datasets/billsum/viewer/default).

Each of the rows contains a block of text from a legal bill - and then a plain english summary.

In [10]:
from datasets import load_dataset

dataset = load_dataset("billsum", split="ca_test")

Downloading readme:   0%|          | 0.00/6.87k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [11]:
dataset

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

We'll create a train/test/eval split to train our model.

In [12]:
import math

total_rows = 500
test_val_ratio = 0.2

val_rows = total_rows + math.floor(total_rows * test_val_ratio)
test_rows = val_rows + math.floor(total_rows * test_val_ratio)

subset_dataset = datasets.DatasetDict(
    {
        "train" : Dataset.from_dict(dataset[:total_rows]),
        "validation" : Dataset.from_dict(dataset[total_rows:val_rows]),
        "test" : Dataset.from_dict(dataset[val_rows:test_rows])
    }
)

In [13]:
subset_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 100
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 100
    })
})

We need to preprocess our data into tokenized representations.

These tokenized representations are what the model will actually see during training!

In [14]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["summary"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
tokenized_datasets = subset_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [16]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

We can remove all unessecary text columns.

In [17]:
tokenized_datasets = tokenized_datasets.remove_columns(dataset.column_names)

We'll set up an evaluation pipeline that will help us monitor our model's performance!

In [18]:
!pip install nltk -qU

In [19]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [20]:
import evaluate

rouge_score = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [21]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]

    # Compute ROUGE scores
    result = rouge_score.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    return {k: round(v, 4) for k, v in result.items()}

Now we can finally get to training!

We're going to train with the `Seq2Seq` objective as we're trying to convert one long sequence into a shorter sequence.

In [22]:
batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_ckpt

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-CNN-DailyNews",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps)

In [23]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,)

In [24]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.6089,2.004466,0.1737,0.1016,0.1572,0.1635
2,2.0595,1.897163,0.182,0.1091,0.1648,0.1732
3,1.7999,1.839696,0.1883,0.1092,0.1679,0.176
4,1.6338,1.861869,0.1872,0.1125,0.1682,0.1754
5,1.4844,1.834686,0.1846,0.106,0.1639,0.1728
6,1.3902,1.858965,0.1893,0.111,0.1686,0.1769
7,1.2919,1.857108,0.1827,0.1059,0.1637,0.171
8,1.2593,1.859874,0.1817,0.1035,0.1634,0.1716




TrainOutput(global_step=504, training_loss=1.682692120945643, metrics={'train_runtime': 775.8812, 'train_samples_per_second': 5.155, 'train_steps_per_second': 0.65, 'total_flos': 2438945832960000.0, 'train_loss': 1.682692120945643, 'epoch': 8.0})

Now we can push our model to the Hugging Face Hub to test and play around with!

In [25]:
!pip install huggingface-hub -qU

In [28]:
#from huggingface_hub import notebook_login

#notebook_login()

In [30]:
#trainer.push_to_hub("ai-maker-space/Transformers-Workshop-BART-Summarization")

You can check out the final model [here](https://huggingface.co/ai-maker-space/Transformers-Workshop-BART-Summarization?text=SECTION+1.+ENVIRONMENTAL+INFRASTRUCTURE.+%28a%29+Jackson+County%2C+Mississippi.--Section+219+of+the+Water+Resources+Development+Act+of+1992+%28106+Stat.+4835%3B+110+Stat.+3757%29+is+amended--+%281%29+in+subsection+%28c%29%2C+by+striking+paragraph+%285%29+and+inserting+the+following%3A+%60%60%285%29+Jackson+county%2C+mississippi.--Provision+of+an+alternative+water+supply+and+a+project+for+the+elimination+or+control+of+combined+sewer+overflows+for+Jackson+County%2C+Mississippi.%27%27%3B+and+%282%29+in+subsection+%28e%29%281%29%2C+by+striking+%60%60%2410%2C000%2C000%27%27+and+inserting+%60%60%2420%2C000%2C000%27%27.+%28b%29+Manchester%2C+New+Hampshire.--Section+219%28e%29%283%29+of+the+Water+Resources+Development+Act+of+1992+%28106+Stat.+4835%3B+110+Stat.+3757%29+is+amended+by+striking+%60%60%2410%2C000%2C000%27%27+and+inserting+%60%60%2420%2C000%2C000%27%27.+%28c%29+Atlanta%2C+Georgia.--Section+219%28f%29%281%29+of+the+Water+Resources+Development+Act+of+1992+%28106+Stat.+4835%3B+113+Stat.+335%29+is+amended+by+striking+%60%60%2425%2C000%2C000+for%27%27.+%28d%29+Paterson%2C+Passaic+County%2C+and+Passaic+Valley%2C+New+Jersey.--+Section+219%28f%29%282%29+of+the+Water+Resources+Development+Act+of+1992+%28106+Stat.+4835%3B+113+Stat.+335%29+is+amended+by+striking+%60%60%2420%2C000%2C000+for%27%27.+%28e%29+Elizabeth+and+North+Hudson%2C+New+Jersey.--Section+219%28f%29+of+the+Water+Resources+Development+Act+of+1992+%28106+Stat.+4835%3B+113+Stat.+335%29+is+amended--+%281%29+in+paragraph+%2833%29%2C+by+striking+%60%60%2420%2C000%2C000%27%27+and+inserting+%60%60%2410%2C000%2C000%27%27%3B+and+%282%29+in+paragraph+%2834%29--+%28A%29+by+striking+%60%60%2410%2C000%2C000%27%27+and+inserting+%60%60%2420%2C000%2C000%27%27%3B+and+%28B%29+by+striking+%60%60in+the+city+of+North+Hudson%27%27+and+inserting+%60%60for+the+North+Hudson+Sewerage+Authority%27%27.+SEC.+2.+UPPER+MISSISSIPPI+RIVER+ENVIRONMENTAL+MANAGEMENT+PROGRAM.+Section+1103%28e%29%285%29+of+the+Water+Resources+Development+Act+of+1986+%2833+U.S.C.+652%28e%29%285%29%29+%28as+amended+by+section+509%28c%29%283%29+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+340%29%29+is+amended+by+striking+%60%60paragraph+%281%29%28A%29%28i%29%27%27+and+inserting+%60%60paragraph+%281%29%28B%29%27%27.+SEC.+3.+DELAWARE+RIVER%2C+PENNSYLVANIA+AND+DELAWARE.+Section+346+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+309%29+is+amended+by+striking+%60%60economically+acceptable%27%27+and+inserting+%60%60environmentally+acceptable%27%27.+SEC.+4.+PROJECT+REAUTHORIZATIONS.+Section+364+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+313%29+is+amended--+%281%29+by+striking+%60%60Each%27%27+and+all+that+follows+through+the+colon+and+inserting+the+following%3A+%60%60Each+of+the+following+projects+is+authorized+to+be+carried+out+by+the+Secretary%2C+and+no+construction+on+any+such+project+may+be+initiated+until+the+Secretary+determines+that+the+project+is+technically+sound%2C+environmentally+acceptable%2C+and+economically+justified%3A%27%27%3B+%282%29+by+striking+paragraph+%281%29%3B+and+%283%29+by+redesignating+paragraphs+%282%29+through+%286%29+as+paragraphs+%281%29+through+%285%29%2C+respectively.+SEC.+5.+SHORE+PROTECTION.+Section+103%28d%29%282%29%28A%29+of+the+Water+Resources+Development+Act+of+1986+%2833+U.S.C.+2213%28d%29%282%29%28A%29%29+%28as+amended+by+section+215%28a%29%282%29+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+292%29%29+is+amended+by+striking+%60%60or+for+which+a+feasibility+study+is+completed+after+that+date%2C%27%27+and+inserting+%60%60except+for+a+project+for+which+a+District+Engineer%27s+Report+is+completed+by+that+date%2C%27%27.+SEC.+6.+COMITE+RIVER%2C+LOUISIANA.+Section+371+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+321%29+is+amended--+%281%29+by+inserting+%60%60%28a%29+In+General.--%27%27+before+%60%60The%27%27%3B+and+%282%29+by+adding+at+the+end+the+following%3A+%60%60%28b%29+Crediting+of+Reduction+in+Non-Federal+Share.--The+project+cooperation+agreement+for+the+Comite+River+Diversion+Project+shall+include+a+provision+that+specifies+that+any+reduction+in+the+non-+Federal+share+that+results+from+the+modification+under+subsection+%28a%29+shall+be+credited+toward+the+share+of+project+costs+to+be+paid+by+the+Amite+River+Basin+Drainage+and+Water+Conservation+District.%27%27.+SEC.+7.+CHESAPEAKE+CITY%2C+MARYLAND.+Section+535%28b%29+of+the+Water+Resources+Development+Act+of+1999+%28113+Stat.+349%29+is+amended+by+striking+%60%60the+city+of+Chesapeake%27%27+each+place+it+appears+and+inserting+%60%60Chesapeake+City%27%27.+SEC.+8.+CONTINUATION+OF+SUBMISSION+OF+CERTAIN+REPORTS+BY+THE+SECRETARY+OF+THE+ARMY.+%28a%29+Recommendations+of+Inland+Waterways+Users+Board.--Section+302%28b%29+of+the+Water+Resources+Development+Act+of+1986+%2833+U.S.C.+2251%28b%29%29+is+amended+in+the+last+sentence+by+striking+%60%60The%27%27+and+inserting+%60%60Notwithstanding+section+3003+of+Public+Law+104-66+%2831+U.S.C.+1113+note%3B+109+Stat.+734%29%2C+the%27%27.+%28b%29+List+of+Authorized+but+Unfunded+Studies.--Section+710%28a%29+of+the+Water+Resources+Development+Act+of+1986+%2833+U.S.C.+2264%28a%29%29+is+amended+in+the+first+sentence+by+striking+%60%60Not%27%27+and+inserting+%60%60Notwithstanding+section+3003+of+Public+Law+104-66+%2831+U.S.C.+1113+note%3B+109+Stat.+734%29%2C+not%27%27.+%28c%29+Reports+on+Participation+of+Minority+Groups+and+Minority-Owned+Firms+in+Mississippi+River-Gulf+Outlet+Feature.--Section+844%28b%29+of+the+Water+Resources+Development+Act+of+1986+%28100+Stat.+4177%29+is+amended+in+the+second+sentence+by+striking+%60%60The%27%27+and+inserting+%60%60Notwithstanding+section+3003+of+Public+Law+104-66+%2831+U.S.C.+1113+note%3B+109+Stat.+734%29%2C+the%27%27.+%28d%29+List+of+Authorized+but+Unfunded+Projects.--Section+1001%28b%29%282%29+of+the+Water+Resources+Development+Act+of+1986+%2833+U.S.C.+579a%28b%29%282%29%29+is+amended+in+the+first+sentence+by+striking+%60%60Every%27%27+and+inserting+%60%60Notwithstanding+section+3003+of+Public+Law+104-66+%2831+U.S.C.+1113+note%3B+109+Stat.+734%29%2C+every%27%27.+SEC.+9.+AUTHORIZATIONS+FOR+PROGRAM+PREVIOUSLY+AND+CURRENTLY+FUNDED.+%28a%29+Program+Authorization.--The+program+described+in+subsection+%28c%29+is+hereby+authorized.+%28b%29+Authorization+of+Appropriations.--Funds+are+hereby+authorized+to+be+appropriated+for+the+Department+of+Transportation+for+the+program+authorized+in+subsection+%28a%29+in+amounts+as+follows%3A+%281%29+Fiscal+year+2000.--For+fiscal+year+2000%2C+%2410%2C000%2C000.+%282%29+Fiscal+year+2001.--For+fiscal+year+2001%2C+%2410%2C000%2C000.+%283%29+Fiscal+year+2002.--For+fiscal+year+2002%2C+%247%2C000%2C000.+%28c%29+Applicability.--The+program+referred+to+in+subsection+%28a%29+is+the+program+for+which+funds+appropriated+in+title+I+of+Public+Law+106-+69+under+the+heading+%60%60FEDERAL+RAILROAD+ADMINISTRATION%27%27+are+available+for+obligation+upon+the+enactment+of+legislation+authorizing+the+program.+Speaker+of+the+House+of+Representatives.+Vice+President+of+the+United+States+and+President+of+the+Senate.)!

## BERT (Encoder Only Architecture)

We'll be using BERT (found in [this paper]()) as our example of an Encoder-only transformer model.

BERT-style models excel at Sentiment Analysis, Question Answering, Text Prediction, and other language comprehension tasks.

The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

[Here are some details](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) about the dataset from Scikit Learn!

Let's load the data and get it into a usable format!

In [31]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(subset = "train")
test = fetch_20newsgroups(subset = "test")

Have a look around the data! Take note of things like data types, column names, and everything else!

Also take note of how many classes there are in our labels and mark it down in the cell below!

In [32]:
NUM_LABELS = 20

In [33]:
import pandas as pd

X, y = pd.Series(train["data"]), pd.Series(train["target"])
X_test, y_test = pd.Series(test["data"]), pd.Series(test["target"])

Now that we have our raw data - let's convert that into some pd.Series objects using pandas!

Now let's get the Hugging Face datasets ([documentation here](https://huggingface.co/docs/datasets/index)) library so we can convert our data into a more usable format.

In [34]:
train_df = pd.DataFrame({
    "text" : X,
    "label" : y
})

test_df = pd.DataFrame({
    "text" : X_test,
    "label" : y_test
})

train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)

Now we can cast our label columns to datasets.features.ClassLabel objects using class_encode_column! (documentation [here](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.class_encode_column))

In [35]:
train_ds = train_ds.class_encode_column("label")
test_ds = test_ds.class_encode_column("label")

Stringifying the column:   0%|          | 0/11314 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/11314 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/7532 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/7532 [00:00<?, ? examples/s]

We'll want to first convert our separate series objects into a combined pd.DataFrame with columns: text and label, for our Xs and ys respectively.

After that, it's as easy as loading the pd.DataFrame into a Dataset object!

In [36]:
data_dsd = train_ds.train_test_split(test_size=0.1, seed=19, stratify_by_column="label")

In [37]:
data_dsd['validation'] = data_dsd['test']
data_dsd['test'] = test_ds

In [38]:
data_dsd

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10182
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7532
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1132
    })
})

In this task, we'll be fine-tuning a simple classifier for our above data using Hugging Face's transformers library (documentation [here](https://huggingface.co/docs/accelerate/index)) as well as the accelerate library. (documentation [here](https://huggingface.co/docs/transformers/index))

Before we dive in, let's take a pit stop to discuss what fine-tuning is - in broad strokes.

- Fine-tuning is a transfer learning approach where a pre-trained machine learning model is further trained on new data, often to specialize in a certain task. This process can involve training the entire network or only a subset of it, with untrained layers remaining 'frozen'.

- This method is prevalent in Natural Language Processing (NLP) and convolutional neural networks. In the latter, early layers capturing lower-level features are typically frozen, while in NLP, large models like GPT-2 are fine-tuned for specific tasks, improving their performance. However, full fine-tuning can be computationally costly and might lead to overfitting.

- Although fine-tuning is commonly executed through supervised learning, it can also be done using weak supervision or reinforcement learning. For instance, language models like ChatGPT and Sparrow are fine-tuned using reinforcement learning from human feedback.

Okay, so now that we have had a brief overview of what fine-tuning actually is - let's set ourselves up to do some!

In [39]:
bert_model_id = "distilbert-base-uncased"

In [40]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(bert_model_id)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [41]:
MAX_LEN = 256

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=MAX_LEN)

In [42]:
tokenized_text = data_dsd.map(preprocess_function, batched=True)

Map:   0%|          | 0/10182 [00:00<?, ? examples/s]

Map:   0%|          | 0/7532 [00:00<?, ? examples/s]

Map:   0%|          | 0/1132 [00:00<?, ? examples/s]

In [43]:
tokenized_text

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 10182
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 7532
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1132
    })
})

In [44]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")

We'll want to include the attention_mask, input_ids, and label for each set - as well as shuffling the training set.

In [45]:
BATCH_SIZE = 16

tf_train_set = tokenized_text["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],
    shuffle=True,
    batch_size=BATCH_SIZE,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_text["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids","label"],
    shuffle=False,
    batch_size=BATCH_SIZE,
    collate_fn=data_collator,
    )

tf_test_set = tokenized_text["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids","label"],
    shuffle=False,
    batch_size=BATCH_SIZE,
    collate_fn=data_collator,
    )

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [46]:
from transformers import create_optimizer

EPOCHS = 3
batches_per_epoch = len(tokenized_text["train"]) // BATCH_SIZE
total_train_steps = int(batches_per_epoch * EPOCHS)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [47]:
from transformers import TFAutoModelForSequenceClassification

my_bert = TFAutoModelForSequenceClassification.from_pretrained(bert_model_id, num_labels=NUM_LABELS)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

We're going to use the naive example of accuracy for this notebook - but feel free to use whatever metric you believe will work best.

In [48]:
my_bert.compile(optimizer=optimizer,  metrics=['accuracy'])

In [49]:
%%time
my_bert.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Epoch 1/3


ResourceExhaustedError: Graph execution error:

Detected at node tf_distil_bert_for_sequence_classification/distilbert/transformer/layer_._2/ffn/Gelu/truediv defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start

  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 377, in dispatch_queue

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 250, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 748, in __init__

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-49-ef3b5215c231>", line 1, in <cell line: 1>

  File "/usr/local/lib/python3.10/dist-packages/google/colab/_shell.py", line 334, in run_cell_magic

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2473, in run_cell_magic

  File "<decorator-gen-54>", line 2, in time

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/magic.py", line 187, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/magics/execution.py", line 1327, in time

  File "<timed eval>", line 1, in <module>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1807, in fit

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1401, in train_function

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1384, in step_function

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1373, in run_step

  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1641, in train_step

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 590, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 712, in run_call_with_unpacked_inputs

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 720, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 712, in run_call_with_unpacked_inputs

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 403, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 315, in call

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 319, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 276, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 230, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/activations.py", line 348, in gelu

failed to allocate memory
	 [[{{node tf_distil_bert_for_sequence_classification/distilbert/transformer/layer_._2/ffn/Gelu/truediv}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_17594]

In [50]:
bert_loss, bert_acc = my_bert.evaluate(tf_test_set)



In [51]:
HUGGINGFACE_ACCT_NAME = "ai-maker-space"
MODEL_NAME = "Transformers-Workshop-BERT-NewsGroupClassification"

In [54]:
#my_bert.push_to_hub(f"{HUGGINGFACE_ACCT_NAME}/{MODEL_NAME}")
#tokenizer.push_to_hub(f"{HUGGINGFACE_ACCT_NAME}/{MODEL_NAME}")

## GPT-2 (Decoder-only Architecture)

Next up, and perhaps more importantly, we have our GPT-style models. These models are built from decoder-only architecture and work in an autoregressive fashion. Essentially, these models generate tokens one-by-one in sequence based on the tokens that precede it.

You can read more about GPT-2 in [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

Decoder-only Architectures excel at text generation, language modeling, and creative writing.

We're going to spending a lot of time on this style architecture in our course - so we'll be zooming through this section!

We'll be leveraging a lyric dataset to fine-tune our GPT-2-small model, you can find the dataset [here]()

In [55]:
lyric_dataset = load_dataset("brunokreiner/genius-lyrics")

Downloading readme:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/663M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [56]:
lyric_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'id', 'lyrics', 'is_english', 'genres_list', 'popularity', 'release_date', 'artist_id', 'artist_name', 'artist_popularity', 'artist_followers', 'artist_picture_url'],
        num_rows: 480855
    })
})

In [57]:
import math

total_rows = 500
test_val_ratio = 0.2

val_rows = total_rows + math.floor(total_rows * test_val_ratio)
test_rows = val_rows + math.floor(total_rows * test_val_ratio)

subset_dataset = datasets.DatasetDict(
    {
        "train" : Dataset.from_dict(lyric_dataset["train"][:total_rows]),
        "validation" : Dataset.from_dict(lyric_dataset["train"][total_rows:val_rows]),
        "test" : Dataset.from_dict(lyric_dataset["train"][val_rows:test_rows])
    }
)

In [58]:
subset_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'id', 'lyrics', 'is_english', 'genres_list', 'popularity', 'release_date', 'artist_id', 'artist_name', 'artist_popularity', 'artist_followers', 'artist_picture_url'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'id', 'lyrics', 'is_english', 'genres_list', 'popularity', 'release_date', 'artist_id', 'artist_name', 'artist_popularity', 'artist_followers', 'artist_picture_url'],
        num_rows: 100
    })
    test: Dataset({
        features: ['Unnamed: 0', 'id', 'lyrics', 'is_english', 'genres_list', 'popularity', 'release_date', 'artist_id', 'artist_name', 'artist_popularity', 'artist_followers', 'artist_picture_url'],
        num_rows: 100
    })
})

In [59]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

gpt_model_id = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(gpt_model_id)
model = AutoModelForCausalLM.from_pretrained(gpt_model_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

1

In [60]:
def tokenize_function(examples):
    return tokenizer(examples["lyrics"])

In [61]:
tokenized_datasets = subset_dataset.map(tokenize_function, batched=True, num_proc=1)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1189 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [62]:
tokenized_datasets = tokenized_datasets.remove_columns(lyric_dataset["train"].column_names)

In [63]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 100
    })
})

In [64]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [65]:
block_size = int(tokenizer.model_max_length / 4)

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=1,
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [66]:
model_name = "lyric-gpt"

num_train_epochs = 30

training_args = TrainingArguments(
    f"output/{model_name}",
    overwrite_output_dir=True,
    evaluation_strategy = "epoch",
    learning_rate=1.00e-4,
    weight_decay=0.01,
    num_train_epochs=num_train_epochs,
    save_total_limit=10,
    save_strategy='epoch',
    save_steps=1,
    report_to=None,
    logging_steps=5,
    do_eval=True,
    eval_steps=1,
    load_best_model_at_end=True,
)

In [67]:
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"]
)

In [68]:
from transformers import get_cosine_schedule_with_warmup
train_dataloader = trainer.get_train_dataloader()
num_train_steps = len(train_dataloader)
trainer.create_optimizer_and_scheduler(num_train_steps)
trainer.lr_scheduler = get_cosine_schedule_with_warmup(
      trainer.optimizer,
      num_warmup_steps=0,
      num_training_steps=num_train_steps
)

trainer.model.config.task_specific_params['text-generation'] = {
                    'do_sample': True,
                    'min_length': 100,
                    'max_length': 200,
                    'temperature': 1.,
                    'top_p': 0.95,
                    }

In [69]:
import torch
torch.cuda.empty_cache()

data = trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.5339,3.197953
2,2.8989,3.196337
3,3.0206,3.238493
4,2.5322,3.315781
5,2.3109,3.386788
6,2.1483,3.528725
7,2.3554,3.622318
8,2.0529,3.784566
9,1.939,3.910752
10,1.6137,4.043099


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Now that we've trained our model - let's upload it to the Hugging Face Hub!

In [71]:
#trainer.push_to_hub("ai-maker-space/Transformers-Workshop-GPT-Generation")

In [73]:
#tokenizer.push_to_hub("ai-maker-space/Transformers-Workshop-GPT-Generation")

You can find the model [here](https://huggingface.co/ai-maker-space/Transformers-Workshop-GPT-Generation?text=I+am)!

Let's see how the generation works!

In [74]:
start = "I am"
num_sequences =  5
min_length =  100
max_length =   160
temperature = 1
top_p = 0.95
top_k = 50
repetition_penalty =  1.4

encoded_prompt = tokenizer(start, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt = encoded_prompt.to(trainer.model.device)
output_sequences = trainer.model.generate(
                        input_ids=encoded_prompt,
                        max_length=max_length,
                        min_length=min_length,
                        temperature=float(temperature),
                        top_p=float(top_p),
                        top_k=int(top_k),
                        do_sample=True,
                        repetition_penalty=repetition_penalty,
                        num_return_sequences=num_sequences
                        )

generated_sequences = []

for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        generated_sequence = generated_sequence.tolist()
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
        generated_sequences.append(text.strip())

for generation in generated_sequences:
  print(generation)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I am i want you baby and your hand on my shoulder now oh god we re falling apart this is how much freedom s been born of me but it doesn t matter what happens when the dust settled to begin with mister ooh hey why don d ya just know that if she were alive then he would have stayed put yeah said uh let us pray for all our lost souls who are losing hearts somehow get back home no more love nothing was ever too simple could be done right once again cause in these days every one knows there can only be so many angels  say a prayer today may still taste good hear some people cry praying yes please wait til tomorrow might even come tonight maybe sooner rather than later as an angel like uhh ahah haaha haha hoo boy did things start out alright couldn
I am the light my god bless me for coming home to you and i can t believe that this world is empty of all life s troubles because everything stops with what it feels leaving behind a place where nothing seems atypical but maybe some will have tea

# Attention is All You Need, Right?

We'll begin by looking at how the basic Transformer Block is set up using PyTorch - and to do that, we'll start with our dependencies!

We'll start with the classic image of the Transformer from the classic paper ["Attention is All You Need"](https://arxiv.org/pdf/1706.03762.pdf).

![img](https://i.imgur.com/4pA8cS6.png)

## Multi-Head Attention

The first step to creating the transformer is straightforward enough: We need to code up that Attention mechanism!

We need two components to make this happen:

1. Scaled Dot-Product Attention
2. Multi-Head Attention

Let's look at their respective images from the paper!

![img](https://i.imgur.com/1Sp9EXp.png)

The basic idea is as follows:

We allow different Attention Heads to attend to different parts of the sequence with different representation subspaces.

All those words to say that each of our Attention Heads will care about different things throughout the course of training - as the old adage goes: Many heads are better than one!

In [75]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import copy

Now we can create our MultiHeadAttention Module!

In [76]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        # Input Dimension of Model
        self.d_model = d_model

        # Number of Heads (h)
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Q Linear Layer
        self.W_q = nn.Linear(d_model, d_model)

        # K Linear Layer
        self.W_k = nn.Linear(d_model, d_model)

        # V Linear Layer
        self.W_v = nn.Linear(d_model, d_model)

        # Output Linear Layer
        self.W_o = nn.Linear(d_model, d_model)

    ### Left Side of the Above Image
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output

    ### Right Side of the Above Image
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output