### Original Course

This course is adapted from the Hugging Face NLP Course to work on Amazon Sagemaker in a workshop environment.

If you are interested in seeing the original course, see here:
- Original Course:  https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt
- Original Course Github: https://github.com/huggingface/course
- Original Course Apache 2.0 License: https://github.com/huggingface/course/blob/main/LICENSE

Although this repository is released under the MIT-0 license, Models from HuggingFace course use multiple models distilbert-base-uncased-finetuned-sst-2-english, facebook/bart-large-mnli, gpt2, distilgpt2, distilroberta-base, dbmdz/bert-large-cased-finetuned-conll03-english, distilbert-base-cased-distilled-squad, sshleifer/distilbart-cnn-12-6, Helsinki-NLP/opus-mt-fr-en, bert-base-uncased. Models from HuggingFace course are licensed under the Apache 2.0 license.

## STEP 0 - ENVIRONMENT SETUP

In [2]:
!pip install transformers
!pip install transformers[sentencepiece]
!pip install datasets evaluate transformers[sentencepiece]
!pip install transformers[torch]
!pip install torch
# NOTE - ensure you specify the correct instance type [ml.mg4dn.xlarge] using the drop-down in the top-right corner

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpi

In [3]:
# Import required packages
import transformers
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoModelForSequenceClassification
import torch
from transformers import BertConfig, BertModel
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments
from transformers import Trainer
import evaluate
import numpy as np

## SECTION 1 - Using HuggingFace Transformers

To begin, we want to see what we can do with the Transformers library in Python.

Below are a series of examples of different commands you can perform using HuggingFace Transformer functions on the default Generative AI model (distilbert-base-uncased-finetuned-sst-2-english).

### Use Case 1 - Sentiment Analysis and Text Classification

In [9]:
# Define the model and model version that we want to use for this use case
chosen_model = "distilbert-base-uncased-finetuned-sst-2-english"

# Example 1 - Perform sentiment analysis
classifier = pipeline(task = "sentiment-analysis", model = chosen_model)
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598050713539124}]

In [10]:
# Example 2 - Perform sentiment analysis on multiple input strings
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [12]:
# Define the model and model version that we want to use for this use case
chosen_model = "facebook/bart-large-mnli"

# Example 3 - Classify an input based on custom output labels 
classifier = pipeline(task = "zero-shot-classification", model = chosen_model)
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445996046066284, 0.1119738295674324, 0.04342665523290634]}

### Use Case 2 - Generate Text

In [13]:
# Define the model and model version that we want to use for this use case
chosen_model = "gpt2"

# Example 4 - Perform ChatGPT-like text generation from a given text prompt
generator = pipeline(task = "text-generation", model = chosen_model)
generator("In this course, we will teach you how to")

  "You have modified the pretrained model configuration to control generation. This is a"
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to identify people who have an issue for which someone has no answer, the way that most people have a conversation about their personal experience and what they feel can hurt and impact their world at large.'}]

In [14]:
# Define the model and model version that we want to use for this use case
chosen_model = "distilgpt2"

# Example 5 - Select a specific mdoel for generating text, constrain the length of repsonses and number of returned sequences
generator = pipeline(task = "text-generation", chosen_model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build software and technologies that will allow us for better collaboration and greater self-management, co-operation'},
 {'generated_text': 'In this course, we will teach you how to build a beautiful and beautiful art. I hope you will enjoy the course! -MichaelBenson'}]

### Use Case 3 - Fill in Missing Data

In [15]:
# Define the model and model version that we want to use for this use case
chosen_model = "distilroberta-base"

# Example 6 - Take a sentence with missing words (indicated by '<mask>'), and determine the top 2 most likely words to fill in that gap
unmasker = pipeline(task = "fill-mask", model = chosen_model)
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.196197971701622,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052729904651642,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

### Use Case 4 - Named Entity Recognition

In [19]:
# Define the model and model version that we want to use for this use case
chosen_model = "dbmdz/bert-large-cased-finetuned-conll03-english"

# Example 7 - Identify the key entities (persons, locations or organizations) within a sentence
ner = pipeline(task = "ner", grouped_entities=True, model = chosen_model)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Use Case 5 - Answering Questions

In [20]:
# Define the model and model version that we want to use for this use case
chosen_model = "distilbert-base-cased-distilled-squad"

# Example 8 - Give a model a context and a question, and produce an answer to the question based on the givne context
question_answerer = pipeline(task = "question-answering", model = chosen_model)
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6949755549430847, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### Use Case 6 - Text Summarization

In [21]:
# Define the model and model version that we want to use for this use case
chosen_model = "sshleifer/distilbart-cnn-12-6"

# Example 9 - Summarize a block of text
summarizer = pipeline(task = "summarization", model = chosen_model)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### Use Case 7 - Language Translation

In [22]:
# Define the model and model version that we want to use for this use case
chosen_model = "Helsinki-NLP/opus-mt-fr-en"

# Example 10 - Translate from French to English
translator = pipeline(task = "translation", model = chosen_model)
translator("Ce cours est produit par Hugging Face.")



[{'translation_text': 'This course is produced by Hugging Face.'}]

## SECTION 2 - How HuggingFace Transformers Work Under the Hood

Now that we have seen some of the core capabilities of transformers, we will go through a lower-level exploration of how these models work.

The first thing to understand about NLP pipelines is that when we make a request, (e.g. for Example 1 in Section 1) using the pipeline() function call, the following steps are completed under-the-hood:
1. TOKENIZATION - We take a raw input string (e.g. "I've been waiting for a HuggingFace course my whole life.") and translate it into a numeric representation using a process called "Tokenization" (e.g. [101, 2023, 2067, 2003, 6429, 999, 102])
2. COMPUTATION - Use the model to perform our computation task (e.g. sentiment analysis)
3. POST-PROCESSING - Take the mathematical output of the model (e.g. Logit: -4.3630) and translate that into a human-readable format (e.g. POSITIVE: 99.89%)

To get a sense of how this is done, we will implement it below

### Step 1 - Preprocessing

In [29]:
# Set starting checkpoint and select your tokenization model
# NOTE - When using Pytorch, you can use 'checkpoints' to save your model as you are training it/working on it
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [30]:
# Tokenize raw inputs (break down the string by word and punctuation point, and convert those bits into integer representations)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


#### Attention Masking 
Now, it is worth talking briefly about what an 'Attention Mask' is.

When you perform inference using a transformer model on multiple strings of text (such as the strings above), we need to pass in sequences that are of the same length.

However, the model cannot tell on its own which parts of this matrix are blank.

So, for the model to identify which compoenents are blank and which are not, it uses an 'Attention Mask' matrix, where each character in the sequence is represented by a 1 and every blank is identified by 0.



### Step 2 - Invoke the model

In [31]:
# Update checkpoint
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [32]:
# Perform inference on the model (i.e. generate desired outputs by passing in the inputs
outputs = model(**inputs)

In [33]:
# Inspect model dimensions
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


NOTE - the 'last_hidden-state.shape' command tells us 3 elements about the model
1. Batch size - number of sequences processed at a time (2 in our example, due to the 2 sequences of text we passed in)
2. Sequence length - Length of the numerical representation (the legnth of the tokenized array passed in - 16 in our example)
3. Hidden size - Vector dimension of each model input (i.e. the width of the hidden layer in the underlying neural network we are passing into)

In [34]:
# This performs a similar function, instead it shows the shape of the output values
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


In this case, the two dimensions represent:
1. Number of output labels (in this case 2 - POSITIVE or NEGATIVE)
2. Number of responses (in this case 2 - for the 2 sentences we passed into the model)

In [35]:
# View raw model outputs
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### Step 3 - Postprocessing the Output

Now that we have the raw output values from the model, we need to translate the final outputs into a human-readable format

In [21]:
# Perform softmax transformation on output variables
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [20]:
# Convert model ids to labels
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

This tells us that 0 corresponds to a 'Negative' Sentiment and 1 corresponds to a positive sentiment

Therefore, we can read the output maxtrix to mean the following:
1. First sentence: Probability of Negative Sentiment: 0.0402, Probability of Positive Sentiment: 0.9598
2. Second sentence: Probability of Negative Sentiment: 0.99462, Probability of Positive Sentiment: 0.0005 

### SECTION 3 - Fine Tuning a Pre-Trained Model

Now that we have seen what these models can do and that these models work, we will have a go at fine-tuning a pre-trained language model!

To begin with, we will load a dataset that we can use to enhance our pre-trained model with. 

For the purposes of this workshop, we will download the Microsoft Research Paraphrase Dataset (MRPC - one of the 10 'Glue' academic datasets used to measure model performance) off the HuggingFace dataset repo.

### Dataset Preparation

In [22]:
# MRPC Dataset
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As we can see, the dataset is already broken up into train, validation and test sets.

On top of that, the dataset contains 4 elements:
1. A sentence
2. A corresponding sentence that paraphrases the first
3. A label for the data
4. An index for each set of sentences

In [23]:
# Example row from training dataset
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [23]:
# Now that we have the dataset, lets tokenize it
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_datasets

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-734ebcebc978e191.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-a419d1b08f06595c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e6b21efb6ccfc370.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### Train Model

In [24]:
# Now we move on to fine-tuning our model.

# To begin, we need to create a container for all the hyperparameters the Trainer will use for training and evaluation
training_args = TrainingArguments("test-trainer")

In [25]:
# Next, we define our model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [26]:
# Define the parameters we will use for the model training
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [27]:
# Train dataset (this can take a while)
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5171
1000,0.3051


TrainOutput(global_step=1377, training_loss=0.3437755116536813, metrics={'train_runtime': 216.4853, 'train_samples_per_second': 50.83, 'train_steps_per_second': 6.361, 'total_flos': 406183858377360.0, 'train_loss': 0.3437755116536813, 'epoch': 3.0})

### Model Evaluation

In [28]:
# View predictions
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [29]:
# Transform predictions into human readable format
preds = np.argmax(predictions.predictions, axis=-1)

In [30]:
# Pull standard evaluation metrics for the MRPC dataset to evalaute
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8504901960784313, 'f1': 0.8967851099830796}

### How to Interpret Text Model Fine-Tuning
Whereas it is easy to see the immediate impact of fine-tuning with image data (you can see the model produce images of things it previously could not), measuring the performance improvement from text fine-tuning can be less obvious.

A good starting point is to use industry-standard metrics against industry-standard datasets. For example, for the model we just trained, the model shows an accuracy of ~85.78% and F1 score (a combined measure of precision and accuracy) of ~89.97 (your answers may vary slightly). Given that the original BERT paper only got an f1 score of 88.9, we can say that this dataset improved the LLM response quality for this specific test dataset.

However, you will likely also need to consider additional metrics and measurement methods for your specific use case. For example, if you are trying to get your model to identify particular terms in lease documents across multiple state jurisdictions, you might consider calculating F1 scores for each jurisdiction seprartely rather than in aggregate. Furthermore, in this lease document identification, you would need to consider building a training and testing dataset from data specific to your use case.


In [31]:
# A more complete set of evaluation metrics
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [32]:
# If we want to compute these metrics as part of the training process, we can declare it within the 'trainer' object next time we run it
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

NOTE - if you would like to learn more about how to perform model fine-tuning without relying on the Pytorch class (customize it yourself), see this link: https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt

### APPENDIX - Loading and Configuring Models

Now that we have seen what the model pipeline looks like under the hood, we will learn about how to load and configure different transformer models.

For our purposes, we will use the BERT model from Google.

In [33]:
# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

# View BERT configuration:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.28.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



This describes the different components of the model.

While it is beyond the scope of this workshop to explain each element, I recommend reading into each of them further if you are interested.

However, what we have done in the previous cell is show that the model is randomly initiallized.

In [34]:
# If we don't want to start with random starting values and instead start
# with pre-trained weights, we can do the following:
model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
# Alternatively, if we want to save our model as it is in our local directory, then we do the following
model.save_pretrained("saved_model")

# You will now see a new directory caled 'saved_model' containing your model in your file directory

### APPENDIX - Tokenizers

We talked before about how you need to turn words and sentences into integers through the process of Tokenization for the model to accept the input.

There are different approaches you can take to tokenization, including the following

In [36]:
# Example 1 - Word-based tokenization (splits on spaces)
raw_text = "Jim Henson was a puppeteer"

tokenized_text = raw_text.split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [37]:
# Example 2 - Character-based tokenization (split on each character)
tokenized_text = list(raw_text)
print(tokenized_text)

['J', 'i', 'm', ' ', 'H', 'e', 'n', 's', 'o', 'n', ' ', 'w', 'a', 's', ' ', 'a', ' ', 'p', 'u', 'p', 'p', 'e', 't', 'e', 'e', 'r']


And there are many other methods! including:
1. Byte-level BPE, used in GPT-2
2. WordPiece, used in BERT
3. SentencePiece/Unigram, used in several multilingual models

You should chose the tokenization method most appropriate for your resource

### APPENDIX - Padding

As we saw earlier in Section 2, when you pass inputs into a model they need to all be of the same length, even if all the input strings themselves are not the same length on their own.

The process of adding adding filler data that the model ignores is called 'padding', and we can see an example of it below.

In [38]:
# Suppose you have a set of 2 inputs of different lengths like so 
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [39]:
# We can choose to manually set a value
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [40]:
# Or we can use a default padding value from the tokenizer function we set earlier
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]