# Phi-3

Testing performant model, found that basic "only quote text" works pretty well as baseline

- used synthesis of aspirin as example, might be good idea to make fake chemistry recipe instead?
- adjusted prompt until found example that is not 100% in text verbatim

In [8]:
set_seed(1729)

In [12]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [13]:
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained( 
    MODEL_NAME,  
    device_map=device,  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
template = """Below is a document, followed by a query which can be answered by reading the document. Answer the query using without rephrasing and using the same words as the original document. Do not reformulate or explain your answer.

## DOCUMENT
{document}

## QUERY
{query}
"""

In [29]:
document = """1. Place 2.0 g (0.015 mole) of salicylic acid in a 125-mL Erlenmeyer flask.
2. Add 5 mL (0.05 mole) of acetic anhydride, followed by 5 drops of conc. H2SO4 (use a dropper, H2SO4 is highly corrosive) and swirl the flask gently until the salicylic acid dissolves.
3. Heat the flask gently on the steam bath for at least 10 minutes.
4. Allow the flask to cool to room temperature. If acetylsalicylic acid does not begin to crystallize out, scratch the walls of the flask with a glass rod. Cool the mixture slightly in an ice bath until crystallization is completed. The product will appear as a solid mass when crystallization is completed.
5. Add 50 mL of water and cool the mixture in an ice bath. Do not add the water until crystal formation is complete.
6. Vacuum filter the product using a Buchner funnel. You can use some of the filtrate to rinse the Erlenmeyer flask if necessary.
7. Rinse the crystals several times with small portions (5 mL) of cold water and air dry the crystals on a Buchner funnel by suction until the crystals appear to be free of solvent. Test this crude product for the presence of unreacted salicylic acid using the ferric chloride test. Record the weight of the crude solid which probably contains water.
8. Stir the crude solid with 25 mL of a saturated aqueous sodium bicarbonate solution in a 150 mL beaker until all signs of reaction have ceased (evolution of CO2 ceases).
9. Filter the solution through a Buchner funnel to remove any insoluble impurities or polymers that may have been formed. Wash the beaker and the funnel with 5 to 10 mL of water.
10. Carefully pour the filtrate with stirring, a small amount at a time, into an ice cold HCl solution (ca 3.5 mL of conc. HCl in 10 mL of water) in a 150-mL beaker and cool the mixture in an ice bath. Make sure that the resulting solution is acidic (blue litmus paper) and that the aspirin has completely precipitated out.
11. Filter the solid by suction and wash the crystals 3X with 5 mL of cold water each. Remove all the liquid from the crystals by pressing with a clean stopper or cork. Air dry the crystals and transfer them to a watch glass to dry. Test a small amount of the product for the presence of unreacted salicylic acid using the ferric chloride solution.
12. When the product is completely dry, weigh the product, determine its melting point (lit mp 135-136 °C) and calculate the percentage yield.
13. Dissolve the final product in a minimum amount (no more than 2-3 mL) of hot ethyl acetate in a 25 mL Erlenmeyer flask. Make sure that the product is completely dissolved while gently and continuously heating on a steam bath.
14. Cool the solution to room temperature and then in a ice-bath. Collect the product by vacuum filtration and rinse out of the flask with a few milliliters of cold petroleum ether.
15. When the product is completely dry, weigh its weight, determine its melting point (lit mp 135 °C) and calculate the percentage yield of this recrystallized product. Calculate the % recovery of recrystallized material from crude material. Submit the crystalline sample in a small vial with proper labeling to your instructor
"""

# --------


query = "How should I wash the crystals?"

# ---------


prompt = template.format(document=document, query=query)

messages = [ 
    {"role": "system", "content": "You are helpful asssistant that finds answers to questions and queries in documents."}, 
    {"role": "user", "content": prompt}, 
] 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    #"do_sample": False, 
}

In [30]:
output = pipe(messages, **generation_args)

print(output[0]['generated_text'])

 Wash the crystals several times with small portions (5 mL) of cold water.


# world's worst trie

In [31]:
def create_trie(context_tokens: list[int]) -> dict[int, dict]:
    trie = {}
    for suffix in [context_tokens[i:] for i in range(len(context_tokens))]:
        node = trie
        for token in suffix:
            if token not in node:
                node[token] = {}
            node = node[token]
    return trie

In [32]:
from functools import reduce

In [33]:
def valid_next_tokens(trie: dict[int, dict], prefix: list[int]) -> list[int]:    
    return list(reduce(lambda d, k: d.get(k,{}), prefix, trie).keys())

In [34]:
from transformers.generation import LogitsProcessor, LogitsProcessorList

In [36]:
class ExtractiveGeneration(LogitsProcessor):
    def __init__(self, input_start_len: int, context_tokens: list[int], eos_token_id: int | list[int]) -> None:
        self.trie = create_trie(context_tokens)
        self.input_start_len = input_start_len
        self.eos_token_id = eos_token_id
        if not isinstance(self.eos_token_id, list):
            self.eos_token_id = [self.eos_token_id]

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        beam_prefixes = input_ids[:, self.input_start_len :]
        
        # DEBUGGING:
        #print("BEAM PREFIXES SHAPE:",beam_prefixes.shape)
        
        for i, prefix in enumerate(beam_prefixes):
            options = valid_next_tokens(self.trie, prefix.tolist())
            options.extend(self.eos_token_id)
            options = torch.tensor(options, dtype=torch.int, device=input_ids.device)
            mask = torch.isin(torch.arange(scores[i].numel(), device=input_ids.device), options)
            scores[i][~mask] = float("-inf")
        return scores

In [37]:
inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda:0")


In [38]:
extractive_generation = ExtractiveGeneration(inputs.shape[-1], tokenizer(document)["input_ids"], tokenizer.eos_token_id)

In [39]:
logits_processor = LogitsProcessorList([extractive_generation])

In [53]:
response = model.generate(
    tokenizer.encode(prompt, return_tensors="pt").to("cuda:0"),
    #max_new_tokens=100,
    max_new_tokens=50,
    #eos_token_id=tokenizer.eos_token_id,
    logits_processor=logits_processor,
    #num_beams=3,
)

In [54]:
tokenizer.decode(response[0, :])

'Below is a document, followed by a query which can be answered by reading the document. Answer the query using without rephrasing and using the same words as the original document. Do not reformulate or explain your answer.\n\n## DOCUMENT\n1. Place 2.0 g (0.015 mole) of salicylic acid in a 125-mL Erlenmeyer flask.\n2. Add 5 mL (0.05 mole) of acetic anhydride, followed by 5 drops of conc. H2SO4 (use a dropper, H2SO4 is highly corrosive) and swirl the flask gently until the salicylic acid dissolves.\n3. Heat the flask gently on the steam bath for at least 10 minutes.\n4. Allow the flask to cool to room temperature. If acetylsalicylic acid does not begin to crystallize out, scratch the walls of the flask with a glass rod. Cool the mixture slightly in an ice bath until crystallization is completed. The product will appear as a solid mass when crystallization is completed.\n5. Add 50 mL of water and cool the mixture in an ice bath. Do not add the water until crystal formation is complete.\

In [55]:
print(tokenizer.decode(response[0, inputs.shape[-1]:]))


11. Filter the solid by suction and wash the crystals 3X with 5 mL of cold water each. Remove all the liquid from the crystals by pressing with a clean stopper or cork.
