# Homework 5, Large-Language Models (LLMs)

## Part a: Assessing Prompt Response
One of the most common applications of language models is the question-answering task. Suppose we want to identify three inorganic materials with the lowest thermal conductivity, as this property plays a crucial role in device design across various fields—from electronics to nuclear reactors. Using Transformers, write a script that searches for and prints three inorganic materials with the lowest reported thermal conductivity reported so far. Then, compare the results with experimentally measured values for those materials and analyze the findings.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Phi-3-mini-4k-instruct LLM
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

We provide the prompt, tokenize it, those tokens will be passed to the model, which generates its output.

The next cell may take a few minutes to run.

In [2]:
prompt = "Name three inorganic materials that have shown the lowest thermal conductivity by far"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=300,  # can play with this value
  use_cache=False)

# Print the output
print(tokenizer.decode(generation_output[0]))


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Name three inorganic materials that have shown the lowest thermal conductivity by far.

Thermal conductivity is a measure of a material's ability to conduct heat. Inorganic materials with low thermal conductivity are often used as insulators. Here are three inorganic materials known for their low thermal conductivity:

1. **Aerogel (Silica Aerogel)**: Aerogels are a group of ultralight materials derived from a gel in which the liquid component of the gel has been replaced with a gas. Silica aerogel, in particular, is known for its extremely low thermal conductivity, which can be as low as 0.014 W/(m·K) at room temperature. This makes it one of the best insulating materials available.

2. **Polyurethane Foam**: While not a traditional inorganic material, polyurethane foam is often used in conjunction with inorganic materials to enhance insulation. It has a low thermal conductivity, typically around 0.025 W/(m·K), making it an effective insulator when used in building applications.

3. *

How well does the answer agree with materials science knowledge?

## Part b: Tokenization
Tokenization is a crucial step in applying language models, as different tokenizers can segment the same text in various ways. Suppose we have the following passage related to materials science and wish to tokenize it using the tokenizers listed below.

After testing these tokenizers, which one would you recommend? Please elaborate on your choice with supporting reasons.

In [3]:
#Comparing Trained LLM Tokenizers
from transformers import AutoModelForCausalLM, AutoTokenizer

def show_tokens(sentence, tokenizer_name):
    my_tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    out_toekns = my_tokenizer.tokenize(sentence)
    print(f'{tokenizer_name} tokenizes my sentence to the following tokens:\n {out_toekns}')

text = """Among the hybrid organic–inorganic perovskites CH3NH3PbX3, the triiodide specimen (CH3NH3PbI3) with a bandgap of 1.6 eV is still the material of choice for solar energy applications."""
#model_list = ["bert-base-uncased","bert-base-cased","gpt2",
#              "google/flan-t5-small","Xenova/gpt-4","bigcode/starcoder2-15b",
#              "facebook/galactica-1.3b","microsoft/Phi-3-mini-4k-instruct",
#              "roberta-base","t5-small","google/byt5-small",
#             "microsoft/mpnet-base","t5-small","google/flan-t5-base"]
model_list = ["gpt2"]
for imodel in model_list:
  show_tokens(text,imodel)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

gpt2 tokenizes my sentence to the following tokens:
 ['Among', 'Ġthe', 'Ġhybrid', 'Ġorganic', 'âĢĵ', 'in', 'organic', 'Ġper', 'ov', 'sk', 'ites', 'ĠCH', '3', 'NH', '3', 'P', 'b', 'X', '3', ',', 'Ġthe', 'Ġtri', 'iod', 'ide', 'Ġspecimen', 'Ġ(', 'CH', '3', 'NH', '3', 'P', 'b', 'I', '3', ')', 'Ġwith', 'Ġa', 'Ġband', 'gap', 'Ġof', 'Ġ1', '.', '6', 'Ġe', 'V', 'Ġis', 'Ġstill', 'Ġthe', 'Ġmaterial', 'Ġof', 'Ġchoice', 'Ġfor', 'Ġsolar', 'Ġenergy', 'Ġapplications', '.']


## Part c: Masked Language Models
Masked Language Models (MLMs) are widely used to predict masked tokens and can also be adapted for text completion tasks. In materials science, this approach can be applied by providing the language model with material properties, allowing it to predict potential applications. For instance, if we inform a matBERT model that Bi₂Te₃ has a high Seebeck coefficient and low thermal conductivity while masking its application, the model might predict that Bi₂Te₃ is a promising candidate for thermoelectric applications.

Following this approach, choose a material from your own research, describe its properties, and prompt matBERT to predict its possible application. Then, analyze the model’s response and verify whether the predicted application is accurate.

In [5]:
!pwd
!mkdir MATBERT
!ls -ltr
%cd MATBERT/
!pwd
!ls -ltr
mypath = "/content/MATBERT"
!mkdir -p {mypath}/matbert-base-cased
!mkdir -p {mypath}/matbert-base-uncased
!curl -# -o {mypath}/matbert-base-cased/config.json https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/config.json
!curl -# -o {mypath}/matbert-base-cased/vocab.txt https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/vocab.txt
!curl -# -o {mypath}/matbert-base-cased/pytorch_model.bin https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/pytorch_model.bin
!curl -# -o {mypath}/matbert-base-uncased/config.json https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/config.json
!curl -# -o {mypath}/matbert-base-uncased/vocab.txt https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/vocab.txt
!curl -# -o {mypath}/matbert-base-uncased/pytorch_model.bin https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_uncased_30522_wd/pytorch_model.bin
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(mypath + '/matbert-base-cased', do_lower_case=False)
tokenizer_bert = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False)

/content/MATBERT
total 4
drwxr-xr-x 2 root root 4096 Nov 24 21:08 MATBERT
/content/MATBERT/MATBERT
/content/MATBERT/MATBERT
total 0
######################################################################## 100.0%
######################################################################## 100.0%
######################################################################## 100.0%
######################################################################## 100.0%
######################################################################## 100.0%
######################################################################## 100.0%


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [6]:
from transformers import BertForMaskedLM, BertTokenizerFast, pipeline

my_model = BertForMaskedLM.from_pretrained(mypath + '/matbert-base-cased')
my_tokenizer = BertTokenizerFast.from_pretrained(mypath + '/matbert-base-cased', do_lower_case=False)
fill_mask = pipeline(
    "fill-mask",
    model=my_model,
    tokenizer=my_tokenizer
)
# Can change the two lines below, while keeping a "[MASK]" inside the string.
my_candidate = 'Bi2Te3'
fill_mask( my_candidate + 'possesses a high Seebeck coefficient and a low thermal conductivity which make it a promising candidate for [MASK] applications.')

Device set to use cuda:0


[{'score': 0.9161185622215271,
  'token': 12737,
  'token_str': 'thermoelectric',
  'sequence': 'Bi2Te3possesses a high Seebeck coefficient and a low thermal conductivity which make it a promising candidate for thermoelectric applications.'},
 {'score': 0.02452981472015381,
  'token': 6776,
  'token_str': 'TE',
  'sequence': 'Bi2Te3possesses a high Seebeck coefficient and a low thermal conductivity which make it a promising candidate for TE applications.'},
 {'score': 0.00728849321603775,
  'token': 3711,
  'token_str': 'electronic',
  'sequence': 'Bi2Te3possesses a high Seebeck coefficient and a low thermal conductivity which make it a promising candidate for electronic applications.'},
 {'score': 0.004953564144670963,
  'token': 4164,
  'token_str': 'device',
  'sequence': 'Bi2Te3possesses a high Seebeck coefficient and a low thermal conductivity which make it a promising candidate for device applications.'},
 {'score': 0.004639696795493364,
  'token': 5744,
  'token_str': 'practical