In [1]:
pip install --upgrade torch==2.1.0 torchvision==0.16 transformers sagemaker sentencepiece accelerate datasets

Collecting torch==2.1.0
  Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting torchvision==0.16
  Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
Collecting sagemaker
  Downloading sagemaker-2.235.2-py3-none-any.whl.metadata (16 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting accelerate
  Downloading accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.1.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.

|Topic|URL(s)|
|--|--|   
|For the list of foundation models on Amazon Sagemaker|https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models.html|
|List of pretrained models|https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html|
|SageMaker APIs|<ul><li>https://aws.amazon.com/developer/tools/</li><li>https://sagemaker.readthedocs.io/en/stable/overview.html</li></ul>|
|Fine tuning models using Jumpstart Estimator|https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-use-python-sdk-estimator-class.html|
|Fine tuning models using domain adaptation|https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-fine-tuning-domain-adaptation.html|
|Fine tuning models using instruction prompts|https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-fine-tuning-instruction-based.html|
|Huggingface page for the RoBERTa Model used in this notebook|https://huggingface.co/distilbert/distilroberta-base|


For this experiment we shall use RoBERTa model `distilbert/distilroberta-base` which is fine tunable and available in HuggingFace hub: https://huggingface.co/distilbert/distilroberta-base

**As always, make sure to read the usage considerations, bias, risk, and limitations, and training details of the model before using it**

In [3]:
model_id = "distilbert/distilroberta-base"

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
#model = AutoModelForMaskedLM.from_pretrained(model_id, torch_dtype=torch.bfloat16,low_cpu_mem_usage=True)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [24]:
# Use a pipeline as a high-level helper
from transformers import pipeline
unmask_pipe = pipeline("fill-mask", model=model_id)

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##### The following examples show masked texts providing a placeholder for "fill in the blank" type of answer. Note how poor the answers are for the model without fine tuning. Refer to the model card on Huggingface for details about the training corpus

In [35]:
input_text = "<mask> is the largest country in the world by area"
unmask_pipe(input_text,top_k= 5)

[{'score': 0.14858654141426086,
  'token': 8481,
  'token_str': 'China',
  'sequence': 'China is the largest country in the world by area'},
 {'score': 0.08127991855144501,
  'token': 31004,
  'token_str': 'Brazil',
  'sequence': 'Brazil is the largest country in the world by area'},
 {'score': 0.06448844075202942,
  'token': 11015,
  'token_str': 'India',
  'sequence': 'India is the largest country in the world by area'},
 {'score': 0.049340859055519104,
  'token': 33133,
  'token_str': 'Turkey',
  'sequence': 'Turkey is the largest country in the world by area'},
 {'score': 0.037434160709381104,
  'token': 20700,
  'token_str': 'Pakistan',
  'sequence': 'Pakistan is the largest country in the world by area'}]

In [37]:
input_text = "<mask> is the largest city in the world by area"
unmask_pipe(input_text, top_k=5)

[{'score': 0.2082318216562271,
  'token': 23122,
  'token_str': 'London',
  'sequence': 'London is the largest city in the world by area'},
 {'score': 0.1920369565486908,
  'token': 32826,
  'token_str': 'Paris',
  'sequence': 'Paris is the largest city in the world by area'},
 {'score': 0.02755577303469181,
  'token': 20983,
  'token_str': 'Manchester',
  'sequence': 'Manchester is the largest city in the world by area'},
 {'score': 0.027195390313863754,
  'token': 30358,
  'token_str': 'Toronto',
  'sequence': 'Toronto is the largest city in the world by area'},
 {'score': 0.024872828274965286,
  'token': 41384,
  'token_str': 'Moscow',
  'sequence': 'Moscow is the largest city in the world by area'}]

##### The following asks the same question by shifting the placement of mask. Notice how the answer varies

In [38]:
input_text = "The largest city in the world by area is <mask>"
unmask_pipe(input_text, top_k=5)

[{'score': 0.07826175540685654,
  'token': 6798,
  'token_str': ' Dubai',
  'sequence': 'The largest city in the world by area is Dubai'},
 {'score': 0.06888768821954727,
  'token': 12275,
  'token_str': ' Istanbul',
  'sequence': 'The largest city in the world by area is Istanbul'},
 {'score': 0.0451258048415184,
  'token': 16680,
  'token_str': ' Jakarta',
  'sequence': 'The largest city in the world by area is Jakarta'},
 {'score': 0.04360025003552437,
  'token': 4612,
  'token_str': ' Barcelona',
  'sequence': 'The largest city in the world by area is Barcelona'},
 {'score': 0.03441775590181351,
  'token': 5729,
  'token_str': ' Mumbai',
  'sequence': 'The largest city in the world by area is Mumbai'}]

In [39]:
input_text = "United States of America has <mask> states"
unmask_pipe(input_text, top_k=5)

[{'score': 0.04338939115405083,
  'token': 4034,
  'token_str': ' 47',
  'sequence': 'United States of America has 47 states'},
 {'score': 0.03533728048205376,
  'token': 4059,
  'token_str': ' 46',
  'sequence': 'United States of America has 46 states'},
 {'score': 0.024327823892235756,
  'token': 2766,
  'token_str': ' 49',
  'sequence': 'United States of America has 49 states'},
 {'score': 0.023394925519824028,
  'token': 3550,
  'token_str': ' 44',
  'sequence': 'United States of America has 44 states'},
 {'score': 0.022955384105443954,
  'token': 2843,
  'token_str': ' 38',
  'sequence': 'United States of America has 38 states'}]

In [41]:
input_texts = ["Influenza is caused by <mask>", "Strep throat is caused by <mask> bacteria", \
               "Acetaminophen can be used to treat <mask>", "Example of an antibiotic is <mask>",\
              "<mask> is an antibacterial drug"]
unmask_pipe(input_texts, top_k=5)

[[{'score': 0.03974457457661629,
   'token': 16968,
   'token_str': ' vaccines',
   'sequence': 'Influenza is caused by vaccines'},
  {'score': 0.03954038396477699,
   'token': 21717,
   'token_str': ' viruses',
   'sequence': 'Influenza is caused by viruses'},
  {'score': 0.03611234575510025,
   'token': 24994,
   'token_str': ' mosquitoes',
   'sequence': 'Influenza is caused by mosquitoes'},
  {'score': 0.02466738037765026,
   'token': 25393,
   'token_str': ' pesticides',
   'sequence': 'Influenza is caused by pesticides'},
  {'score': 0.0235974732786417,
   'token': 28848,
   'token_str': ' genetics',
   'sequence': 'Influenza is caused by genetics'}],
 [{'score': 0.3872624337673187,
   'token': 8731,
   'token_str': ' gut',
   'sequence': 'Strep throat is caused by gut bacteria'},
  {'score': 0.22904224693775177,
   'token': 39475,
   'token_str': ' intestinal',
   'sequence': 'Strep throat is caused by intestinal bacteria'},
  {'score': 0.02543472684919834,
   'token': 11190,
  

In [42]:
input_text = "<mask> built the great wall of China"
unmask_pipe(input_text, top_k=5)

[{'score': 0.08012279868125916,
  'token': 47458,
  'token_str': 'Lenin',
  'sequence': 'Lenin built the great wall of China'},
 {'score': 0.06752075254917145,
  'token': 12375,
  'token_str': 'Who',
  'sequence': 'Who built the great wall of China'},
 {'score': 0.04769080877304077,
  'token': 8481,
  'token_str': 'China',
  'sequence': 'China built the great wall of China'},
 {'score': 0.0431637316942215,
  'token': 39858,
  'token_str': 'Whoever',
  'sequence': 'Whoever built the great wall of China'},
 {'score': 0.025495178997516632,
  'token': 1185,
  'token_str': 'You',
  'sequence': 'You built the great wall of China'}]

In [43]:
input_text = "<mask> is the largest ocean in the world"
unmask_pipe(input_text, top_k=5)

[{'score': 0.0636025071144104,
  'token': 41496,
  'token_str': 'Ocean',
  'sequence': 'Ocean is the largest ocean in the world'},
 {'score': 0.06237642839550972,
  'token': 243,
  'token_str': 'It',
  'sequence': 'It is the largest ocean in the world'},
 {'score': 0.053864073008298874,
  'token': 34526,
  'token_str': 'Earth',
  'sequence': 'Earth is the largest ocean in the world'},
 {'score': 0.05324801430106163,
  'token': 5860,
  'token_str': ' Ocean',
  'sequence': ' Ocean is the largest ocean in the world'},
 {'score': 0.047932833433151245,
  'token': 37697,
  'token_str': 'Sea',
  'sequence': 'Sea is the largest ocean in the world'}]

In [46]:
input_texts = ["<mask> discovered Xrays",\
"The name of the scientist who discovered the phenomenon of X-ray was <mask>",\
"The physicist responsible for the discovery of Xrays was <mask>"]
unmask_pipe(input_texts, top_k=5)

[[{'score': 0.2434065043926239,
   'token': 30726,
   'token_str': ' Newly',
   'sequence': ' Newly discovered Xrays'},
  {'score': 0.2107158601284027,
   'token': 3862,
   'token_str': ' newly',
   'sequence': ' newly discovered Xrays'},
  {'score': 0.1592417061328888,
   'token': 45095,
   'token_str': 'Previously',
   'sequence': 'Previously discovered Xrays'},
  {'score': 0.04660941660404205,
   'token': 38386,
   'token_str': 'Recently',
   'sequence': 'Recently discovered Xrays'},
  {'score': 0.022596264258027077,
   'token': 4030,
   'token_str': 'New',
   'sequence': 'New discovered Xrays'}],
 [{'score': 0.20618940889835358,
   'token': 4727,
   'token_str': ' unknown',
   'sequence': 'The name of the scientist who discovered the phenomenon of X-ray was unknown'},
  {'score': 0.12664254009723663,
   'token': 22292,
   'token_str': ' withheld',
   'sequence': 'The name of the scientist who discovered the phenomenon of X-ray was withheld'},
  {'score': 0.10497443377971649,
   'to

In [47]:
input_text = "The atomic weight of selenium is <mask>"
unmask_pipe(input_text, top_k=5)

[{'score': 0.04041153937578201,
  'token': 4727,
  'token_str': ' unknown',
  'sequence': 'The atomic weight of selenium is unknown'},
 {'score': 0.018839875236153603,
  'token': 2284,
  'token_str': ' increasing',
  'sequence': 'The atomic weight of selenium is increasing'},
 {'score': 0.015256152488291264,
  'token': 4276,
  'token_str': ' zero',
  'sequence': 'The atomic weight of selenium is zero'},
 {'score': 0.012496164068579674,
  'token': 20910,
  'token_str': ' decreasing',
  'sequence': 'The atomic weight of selenium is decreasing'},
 {'score': 0.00959750171750784,
  'token': 36334,
  'token_str': ' negligible',
  'sequence': 'The atomic weight of selenium is negligible'}]

In [48]:
input_text = "The deepest point in the Pacific ocean is <mask>"
unmask_pipe(input_text, top_k=5)

[{'score': 0.058124203234910965,
  'token': 27593,
  'token_str': ' Antarctica',
  'sequence': 'The deepest point in the Pacific ocean is Antarctica'},
 {'score': 0.04617989435791969,
  'token': 16974,
  'token_str': ' underwater',
  'sequence': 'The deepest point in the Pacific ocean is underwater'},
 {'score': 0.04363410174846649,
  'token': 1666,
  'token_str': ' ...',
  'sequence': 'The deepest point in the Pacific ocean is ...'},
 {'score': 0.024700311943888664,
  'token': 734,
  'token_str': '...',
  'sequence': 'The deepest point in the Pacific ocean is...'},
 {'score': 0.022545766085386276,
  'token': 1174,
  'token_str': '…',
  'sequence': 'The deepest point in the Pacific ocean is…'}]

In [52]:
input_text = "The most common high temperature deformation mechanism in aluminium alloy \
composites is due to <mask> boundary sliding"
unmask_pipe(input_text, top_k=5)

[{'score': 0.14769397675991058,
  'token': 17210,
  'token_str': ' thermal',
  'sequence': 'The most common high temperature deformation mechanism in aluminium alloy composites is due to thermal boundary sliding'},
 {'score': 0.03588949888944626,
  'token': 12194,
  'token_str': ' vertical',
  'sequence': 'The most common high temperature deformation mechanism in aluminium alloy composites is due to vertical boundary sliding'},
 {'score': 0.017547065392136574,
  'token': 4084,
  'token_str': ' surface',
  'sequence': 'The most common high temperature deformation mechanism in aluminium alloy composites is due to surface boundary sliding'},
 {'score': 0.01499862689524889,
  'token': 25490,
  'token_str': ' horizontal',
  'sequence': 'The most common high temperature deformation mechanism in aluminium alloy composites is due to horizontal boundary sliding'},
 {'score': 0.014807412400841713,
  'token': 4204,
  'token_str': ' metal',
  'sequence': 'The most common high temperature deformati