## finetune an LLM for Reference Parsing 


- for this proof of concept we use only 1000 examples (800 for training) and train it for only 3 epochs 

- To do: 

     - a better training dataset 
     - come up with a better way to check for possible halucinations 
 
 


## 1. Prepare dataset 

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split

from datasets import Dataset

import textwrap

import random 


In [2]:
## import dataset 

xfile = './dataset/reference_parsing.zip'



# for testing 
t0 = pd.read_csv(xfile, sep = '\t') # , nrows=5)
t1 = t0.sample(n = 1000, random_state = 632) 


print('dataset:', len(t0))

print('sample:', len(t1))


t1.head()

dataset: 10000
sample: 1000


Unnamed: 0,record_nr,doi,ref_type,ref_style,ref_string,ref_annotated
9522,9522,10.1007/978-3-642-23623-5_36,chapter,acta-anaesthesiologica-scandinavica,"Parthasarathy V, Hatt C, Stankovic Z, Raval A,...","<author>Parthasarathy V, Hatt C, Stankovic Z, ..."
1082,1082,10.1049/IP-VIS:20050217,article-journal,ambio,"Li, M.-B., G.-B. Huang, P. Saratchandran, and ...","<author>Li, M.-B., G.-B. Huang, P. Saratchandr..."
6107,6107,10.1016/S0959-8049(11)71809-3,article-journal,acta-universitatis-agriculturae-et-silvicultur...,"De luliis, F., Russo, I., Di Trapani, M. C., C...","<author>De luliis, F., Russo, I., Di Trapani, ..."
6466,6466,10.1117/12.367560,paper-conference,administrative-science-quarterly,"Dong, M., X. Chen, and H. Deng1999 “<title&rt;...","<author>Dong, M., X. Chen, and H. Deng</author..."
1946,1946,10.1021/BI200600K,article-journal,acta-scientiae-veterinariae,"Kallio, P., P. Patrikainen, J.-P. Suomela, P. ...","<author>Kallio, P., P. Patrikainen, J.-P. Suom..."


In [8]:
## create train, test and validation dataset 

# First split: Separate out the training set (80%) and a temporary set (20%)
train_df, temp_df = train_test_split(t1, test_size=0.20, random_state=1234)

# Second split: Divide the temporary set into validation (20% of temp_df) and test sets (80% of temp_df)
val_df, test_df = train_test_split(temp_df, test_size=0.20, random_state=1234)


print('training dataset:', len(train_df))
print('validation:', len(val_df))
print('test:', len(test_df))

training dataset: 800
validation: 160
test: 40


## 2. Prepare model 

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


## we use the chat model 
xmodel_id = 'unsloth/tinyllama-bnb-4bit' 
#xmodel_id = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = xmodel_id, 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)


##### ADD SPECIAL TOKEN PAD TOKEN 

The pad_token is a small but essential part of fine-tuning language models. 

It helps 

- ensuring that that variable-length sequences can be efficiently processed in fixed-size batches 

- maintaining the integrity of the model's attention mechanism. As we do not want the model to attend to padding tokens, as they do not contain meaningful information. The pad_token helps in creating an attention mask that tells the model to ignore these tokens during the attention calculation.



In [None]:
## verify if the tokenizer already has a pad token 
# this model as pad token 
tokenizer.special_tokens_map


In [None]:
### ADD SPECIAL PAD TOKEN to the tokenizer if there is not 

### IF there was not pad token 

PAD_TOKEN = '<pad>'
#tokenizer.add_special_tokens({'pad_token': PAD_TOKEN})
#tokenizer.padding_side = 'left'
# see new map of tokenizer 
print(tokenizer.special_tokens_map)
# resize the model 
#model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)


#### count tokens 

## Chat Template 

the function **tokenizer.chat_template** makes it easier to work with 
the chat template if there is one 

use the following command to find out is the tokenizer has a chat template

```
print(tokenizer.chat_template)
```

if a model has chat_template, this should be used 


In [None]:
from transformers import AutoTokenizer
xTokenizer_mistral = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
tokenizer.chat_template = xTokenizer_mistral.chat_template

In [None]:
## the model used here has a chattemplate 
print(tokenizer.chat_template)


## Inference before fine-tuning 

In [None]:
## randomly select one record 
xrecord = random.sample(train_df.to_dict(orient = 'records'), 1)[0]


## prepare the prompt 

xPrompt_inference_user_template = '''In the following bibliographic reference, wrap each reference with <REFERENCE> at the beginning and </REFERENCE> at the end. Keep also the part of text which is not in references ###Text: {text}'''


xPrompt_inference = xPrompt_inference_user_template.format(text = xrecord.get('ref_string'))

# print the prompt 
print(textwrap.fill(xPrompt_inference, width=100))

print('\n### expected answers \n subject:' + xrecord.get('ref_annotated') ) 



In [None]:
## use template for inference 

messages = [{"role": "user", "content": xPrompt_inference }]

#tokenize using chat_template 
tokenized_chat = tokenizer.apply_chat_template(messages, 
                                               tokenize=True, 
                                               add_generation_prompt=False, 
                                               return_tensors="pt").to('cuda')

xtokens_length = tokenized_chat.shape[1]

#generate 
outputs = model.generate(tokenized_chat, max_new_tokens=xtokens_length * 1.5) 
print(tokenizer.decode(outputs[0]))



# fine tuning steps 

## Step 1:  prepare chat template 

In [None]:
## create a function to format the records in desired format 

def format_record_training(xrecord):
    '''
    function to create training example 
    '''
    
    xPrompt_training_user = '''please parse following bibliographic reference into its main components using html-like tags ###reference: {text}'''
    
    xmessage_user = xPrompt_training_user.format(text =  xrecord.get('ref_string'))  
    xmessage_assistant  = xrecord.get('ref_annotated')
    
    
    messages = [{"role": "user", "content": xmessage_user },
                {"role": "assistant", "content": xmessage_assistant  },
                   ]

    xprompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt = False)
        
        
    return xprompt

    
def format_record_inference(xrecord):
    '''
    function to create inference example : 
    no response from the assistant 
    '''
    
    xPrompt_training_user = '''please parse following bibliographic reference into its main components using html-like tags ###reference: {text}'''
    
    xmessage_user = xPrompt_training_user.format(text =  xrecord.get('ref_string'))  
    
    
    messages = [{"role": "user", "content": xmessage_user }]

    xprompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt = False)
        
        
    return xprompt    
    
    
# get a random record 
xrecord = random.sample(train_df.to_dict(orient = 'records'), 1)[0]

print ('## for training')
print (format_record_training(xrecord))

print ('## for inference')
print (format_record_inference(xrecord))


In [None]:
## apply the function to the training and validation set 
train_df['text'] = [format_record_training(xrecord)  for xrecord in train_df.to_dict(orient = 'records')]
val_df['text'] = [format_record_training(xrecord)  for xrecord in val_df.to_dict(orient = 'records')]

#for test set use the inference_format 
test_df['text'] = [format_record_inference(xrecord)  for xrecord in test_df.to_dict(orient = 'records')]



In [None]:
## create dataset 

In [None]:
dataset = {'train': Dataset.from_pandas(train_df, preserve_index = False),
           'val':   Dataset.from_pandas(val_df,   preserve_index = False),
           'test':  Dataset.from_pandas(test_df,  preserve_index = False),
          }


dataset 

### Step 2 : Lora Configuration

In [None]:
model = FastLanguageModel.get_peft_model(model,
                                         r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
                                         target_modules = ["q_proj", "k_proj", 
                                                           "v_proj", "o_proj",
                                                           "gate_proj", "up_proj", 
                                                           "down_proj"],
                                         lora_alpha = 128,
                                         lora_dropout = 0, # Currently only supports dropout = 0
                                         bias = "none",    # Currently only supports bias = "none"
                                         use_gradient_checkpointing = False, # With Unsloth, we can turn this off!
                                         random_state = 68,
                                         use_rslora = False,  # We support rank stabilized LoRA
                                         loftq_config = None, # And LoftQ
                                        )


In [None]:
### check the model specifications 
model.config.pad_token_id = tokenizer.pad_token_id 
model.config

### Step 3 : training configuration 

In [None]:
# based on config

from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
logging.set_verbosity_info()


#xpath_checkpoints = "outputs_tinyllama_class_1"
#xpath_checkpoints = "outputs_tinyllama_class_full"

xpath_checkpoints = "outputs_tinyllama_refparsing_1"

xtrain_args = TrainingArguments(output_dir=xpath_checkpoints,
                                overwrite_output_dir=True,
                                
                                ### strategy 
                                max_steps=-1,
                                num_train_epochs=3 ,  
                                gradient_accumulation_steps=4,                                
                                per_device_eval_batch_size=4, # 4
                                per_device_train_batch_size=4, # 4
                                gradient_checkpointing=True,
                                #gradient_checkpointing_kwargs={"use_reentrant": False},
                                
                                evaluation_strategy="steps",
                                save_strategy="epoch", ##every epoch "no",

                                learning_rate=1e-4,
                                lr_scheduler_type="cosine",
                                warmup_ratio=0.1,
                                optim="adamw_torch",
                                
                                
                                #precision                                 
                                fp16 = not torch.cuda.is_bf16_supported(),
                                bf16 = torch.cuda.is_bf16_supported(),
                                
                                #loggging 
                                log_level="info",
                                logging_steps=10,
                                logging_strategy="steps",
                                
                                #report_to="tensorboard",
                                
                                save_safetensors=True,
                                
                                seed=68)





In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
logging.set_verbosity_info()

trainer = SFTTrainer(model = model,
                     args = xtrain_args,
                     train_dataset = dataset['train'],
                     eval_dataset = dataset['val'],
                     dataset_text_field='text',
                     max_seq_length = max_seq_length 
                     )




In [None]:
trainer_stats = trainer.train()

In [None]:

print('training time in minutes:', trainer_stats.metrics.get('train_runtime')/60)

trainer_stats.metrics  ## this training took about 30 minutes 

In [None]:
## save model 

In [None]:
model.save_pretrained("refparser_tinyllama_lora_model_1") # Local saving


## inference on the newly trained model 

In [None]:
#xtest_samples = dataset['test'][0:10]
xtest_samples = dataset['val'][0:10]
xprompts = xtest_samples.get('text')



In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


for xnr, xprompt in enumerate(xprompts):


    inputs = tokenizer(xprompt, return_tensors = 'pt', padding = True).to('cuda')

    outputs = model.generate(**inputs, max_new_tokens = 400, use_cache = True)
    xresponse = tokenizer.batch_decode(outputs)
    
    print(xresponse[0].split('</s>')[1])
    print('-------------')

###  merge and save full model 

In [None]:
# we merge now to '16bit' 
model.save_pretrained_merged(save_directory = 'refparser_tinyllama_1',
                             tokenizer= tokenizer,
                             save_method = 'merged_16bit',
                             push_to_hub = False)

In [None]:
print('OK')

## test with the saved model 

(after restaring the kernel) 

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM


xmodel_merged = 'refparser_tinyllama_1'


xtokenizer = AutoTokenizer.from_pretrained(xmodel_merged)

xmodel = AutoModelForCausalLM.from_pretrained(xmodel_merged, device_map="auto")



In [17]:
print('test:', len(test_df))

xrecords = test_df.to_dict(orient = 'records')

xrec = xrecords[0]

xrec 

test: 40


{'record_nr': 159,
 'doi': '10.1017/S0021911800033787',
 'ref_type': 'article-journal',
 'ref_style': 'american-journal-of-orthodontics-and-dentofacial-orthopedics',
 'ref_string': 'Ananda P. A Checklist of Indonesian Serials in the Cornell University Library (1945–1970). Compiled by Yvonne Thung and John M. Echols. Ithaca: Cornell University Southeast Asia Program Data Paper No. 89, 1973. vi, 215 pp. $7.00Indonesian Monographs: A Catalogue of Monographs Publications 1945–1968. A Collection of more than 7000 titles mainly concerning the Social Sciences from Cornell University Libraries on Microfiche. Zug, Switzerland: Inter Documentation Company, 1974. (Bibliotheca Asiatica, 10). iv, 154 pp. Sfr 40 (cloth), Sfr 16 (microfiche).. J of Asian Stud 1975;3501:176–7. Available at: http://dx.doi.org/10.1017/s0021911800033787.',
 'ref_annotated': '<author>Ananda P</author>. <title>A Checklist of Indonesian Serials in the Cornell University Library (1945–1970). Compiled by Yvonne Thung and John

In [11]:
def format_record_inference(xrecord):
    '''
    function to create inference example : 
    no response from the assistant 
    '''
    
    xPrompt_training_user = '''please parse following bibliographic reference into its main components using html-like tags ###reference: {text}'''
    
    xmessage_user = xPrompt_training_user.format(text =  xrecord.get('ref_string'))  
    
    
    messages = [{"role": "user", "content": xmessage_user }]

    xprompt = xtokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt = False)
        
        
    return xprompt    

In [16]:
xprompt = format_record_inference(xrec)
xprompt 

'<|user|>\nplease parse following bibliographic reference into its main components using html-like tags ###reference: Ananda P. A Checklist of Indonesian Serials in the Cornell University Library (1945–1970). Compiled by Yvonne Thung and John M. Echols. Ithaca: Cornell University Southeast Asia Program Data Paper No. 89, 1973. vi, 215 pp. $7.00Indonesian Monographs: A Catalogue of Monographs Publications 1945–1968. A Collection of more than 7000 titles mainly concerning the Social Sciences from Cornell University Libraries on Microfiche. Zug, Switzerland: Inter Documentation Company, 1974. (Bibliotheca Asiatica, 10). iv, 154 pp. Sfr 40 (cloth), Sfr 16 (microfiche).. J of Asian Stud 1975;3501:176–7. Available at: http://dx.doi.org/10.1017/s0021911800033787.</s>\n'

In [None]:
## test with other data 

' Ananda P. A Checklist of Indonesian Serials in the Cornell University Library (1945–1970). Compiled by Yvonne Thung and John M. Echols. Ithaca: Cornell University Southeast Asia Program Data Paper No. 89, 1973. vi, 215 pp. $7.00Indonesian Monographs: A Catalogue of Monographs Publications 1945–1968. A Collection of more than 7000 titles mainly concerning the Social Sciences from Cornell University Libraries on Microfiche. Zug, Switzerland: Inter Documentation Company, 1974. (Bibliotheca Asiatica, 10). iv, 154 pp. Sfr 40 (cloth), Sfr 16 (microfiche).. J of Asian Stud 1975;3501:176–7. Available at: http://dx.doi.org/10.1017/s0021911800033787.</s>\n'

In [15]:
inputs = xtokenizer(xprompt, return_tensors = 'pt', padding = True).to('cuda')
outputs = xmodel.generate(**inputs, max_new_tokens = 400, use_cache = False)
xresponse = xtokenizer.batch_decode(outputs)
    
#xresponse
print(xresponse[0].split('</s>')[1])

 
<|assistant|>
<author>Rawal, V. & Bansal, V.</author> <year>2021</year>. <title>The Land Question in Contemporary Rural India</title>


In [21]:
xprompts = [format_record_inference(xrecord) for xrecord in xrecords]

for xprompt in xprompts:
    
    inputs = xtokenizer(xprompt, return_tensors = 'pt', padding = True).to('cuda')
    outputs = xmodel.generate(**inputs, max_new_tokens = 400, use_cache = False)
    xresponse = xtokenizer.batch_decode(outputs)
    
    #xresponse
    
    print('REF_STRING')
    print(xprompt.split('###reference:')[1])
    print('TAGGED')
    print(xresponse[0].split('</s>')[1])
    print('--------')
    print('\n')

REF_STRING
 Ananda P. A Checklist of Indonesian Serials in the Cornell University Library (1945–1970). Compiled by Yvonne Thung and John M. Echols. Ithaca: Cornell University Southeast Asia Program Data Paper No. 89, 1973. vi, 215 pp. $7.00Indonesian Monographs: A Catalogue of Monographs Publications 1945–1968. A Collection of more than 7000 titles mainly concerning the Social Sciences from Cornell University Libraries on Microfiche. Zug, Switzerland: Inter Documentation Company, 1974. (Bibliotheca Asiatica, 10). iv, 154 pp. Sfr 40 (cloth), Sfr 16 (microfiche).. J of Asian Stud 1975;3501:176–7. Available at: http://dx.doi.org/10.1017/s0021911800033787.</s>

TAGGED
 
<|assistant|>
Ananda P. <title>A Checklist of Indonesian Serials in the Cornell University Library (1945–1970)</title>. <container-title>Compiled by Yvonne Thung and John M. Echols</container-title>. <venue>Ithaca: Cornell University Southeast Asia Program Data Paper No. 89</venue>, <year>1973</year>. <page>vi, 215 pp</page

REF_STRING
 Bozionellou, V., 2004, “Trastuzumab Administration Can Effectively Target Chemotherapy-Resistant Cytokeratin-19 Messenger RNA-Positive Tumor Cells in the Peripheral Blood and Bone Marrow of Patients With Breast Cancer,” Clinical Cancer Research, 1024, 8185–8194.</s>

TAGGED
 
<|assistant|>
<author>Bozionellou, V.</author>, <year>2004</year>, <title>“Trastuzumab Administration Can Effectively Target Chemotherapy-Resistant Cytokeratin-19 Messenger RNA-Positive Tumor Cells in the Peripheral Blood and Bone Marrow of Patients With Breast Cancer,”</title> <container-title>Clinical Cancer Research</container-title>, <volume>10</volume><issue>24</issue>, <page>8185–8194</page>.
--------


REF_STRING
 HANNAWAY, P.. 1991. January. “375 Additional observations on asthma deaths (ADS) in Massachusetts 1974–1988”. Journal of Allergy and Clinical Immunology 87 1 January: 233. doi:10.1016/0091-6749(91)91658-g. http://dx.doi.org/10.1016/0091-6749(91)91658-g.</s>

TAGGED
 
<|assistant|>
<aut

REF_STRING
 ... 2012. CUHSO 22-1 Completo. Cultura - Hombre - Sociedad CUHSO. 221.</s>

TAGGED
 
<|assistant|>
<title>...</title> <year>2012</year>. <container-title>CUHSO 22-1 Completo</container-title>. <genre>Cultura - Hombre - Sociedad</genre> <publisher>CUHSO</publisher>. <volume>22</volume><issue>1</issue>.
--------


REF_STRING
 Zehnder, Adalbert. 2016. “Die einen können’s, die anderen nicht”. kma - Das Gesundheitswirtschaftsmagazin 1601: 8–8. http://dx.doi.org/10.1055/s-0036-1575852.</s>

TAGGED
 
<|assistant|>
<author>Zehnder, Adalbert</author>. <year>2016</year>. <title>“Die einen können’s, die anderen nicht”</title>. <container-title>kma - Das Gesundheitswirtschaftsmagazin</container-title> <volume>16</volume><issue>01</issue>: <page>8–8</page>. <URL>http://dx.doi.org/10.1055/s-0036-1575852</URL>.
--------


REF_STRING
 The 20 Non-Negotiable Characteristics of Higher Performing School Systems: Aligning District Practices to Support High-Quality Instruction. 2011. PsycEXTRA D