### Loading the model checkpoint using fairseq extensions

In [None]:
from pathlib import Path

In [None]:
qa_model_checkpoint_path = Path.cwd().joinpath(
    'models', 'QA-PubMedQA-BioGPT', "checkpoint.pt")

After looking at the architecture, let us load the model directly using fairseq and later we will decide to load it using transformers and may be transform it to huggingface.

In [None]:
qa_model_checkpoint_path

In [None]:
from biogpt_model.transformer_lm_prompt import TransformerLanguageModelPrompt

In [None]:
data_path = Path.cwd().joinpath('datasets', 'biogpt', 'pqal_qcl_ansis-bin')
bpe_code_path = data_path.parent.joinpath('raw', 'bpecodes')
assert data_path.exists()
assert bpe_code_path.exists()

In [None]:
model_fairseq = TransformerLanguageModelPrompt.from_pretrained(
    qa_model_checkpoint_path.parent,
    "checkpoint.pt",
    data_path.__str__(),
    tokenizer="moses",
    bpe="fastbpe",
    bpe_codes=bpe_code_path.__str__(),
    )

In [None]:
model_fairseq.cfg.get("generation")

In [None]:
from pprint import pprint

In [None]:
contexts = [
    'Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
    'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoTracker Red CMXRos and examined. Mitochondrial dynamics were delineated into four categories (M1-M4) based on characteristics including distribution, motility, and membrane potential (ΔΨm). A TUNEL assay showed fragmented nDNA in a gradient over these mitochondrial stages. Chloroplasts and transvacuolar strands were also examined using live cell imaging. The possible importance of mitochondrial permeability transition pore (PTP) formation during PCD was indirectly examined via in vivo cyclosporine A (CsA) treatment. This treatment resulted in lace plant leaves with a significantly lower number of perforations compared to controls, and that displayed mitochondrial dynamics similar to that of non-PCD cells.'
]

In [None]:
contexts[0]

In [None]:
question = "Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?"

In [None]:
prompt = f"question: {question} context: { ' '.join(contexts)}"

In [None]:
pprint(prompt)

In [None]:
source_tokens  = model_fairseq.encode(question)

In [None]:
from transformers import set_seed

In [None]:
set_seed(42)

In [None]:
source_tokens

In [None]:
generated_text = model_fairseq.generate([source_tokens],
                                        beam=5,
                                        min_len=100,
                                        max_len_a=512,
                                        max_len_b=512,
                                        temperature=0.25,
                                        sampling_topk=50,
                                        sampling_topp=0.95,
                                        sampling=True,)[0]

In [None]:
model_fairseq.max_positions

In [None]:
len(source_tokens)

In [None]:
generated_text[0]['tokens']

In [None]:
output_text = model_fairseq.decode(generated_text[0]['tokens'])

In [None]:
pprint(output_text)

In [None]:
pprint(output_text)

The prompt model seems to be working but always returning learned. What am I missing here?


Not sure why the model generated this type of data, that may be because of the data. But it worth checking what went wrong with the model.

### Uploading the model to hugginface transformer

In [None]:
model_fairseq.state_dict().keys()

This is the model architecture, the next step is to convert the architecture to the huggingface

Will start here tommorow!

https://github.com/huggingface/transformers/blob/main/src/transformers/models/biogpt/convert_biogpt_original_pytorch_checkpoint_to_pytorch.py

### Model Conversion to HF

In [None]:
from pathlib import Path

In [None]:
model_path = Path.cwd().joinpath('datasets', 'biogpt', 'raw')

In [None]:
assert model_path.exists(), "Model path does not exist"

In [None]:
biogpt_qa_hf_path = Path.cwd().joinpath('models', 'bio-gpt-qa')

###### Convert me to python cell to execute
convert_biogpt_checkpoint_to_pytorch(biogpt_checkpoint_path=model_path, 
                                     pytorch_dump_folder_path=biogpt_qa_hf_path)

Now we have a problem, the fairseq  model has a vocabulary size  of 42384 while the model embedding layers has a size of 42393 words. It looks like in the embedding layers we have added  9 words which are learned1, learned2, learned3.... and learned9. Those words aret he words that the model is always generating before putting the final answer.

Let us see how the model will perform

In [None]:
from transformers import BioGptForCausalLM, BioGptTokenizer

In [None]:
tokenizer = BioGptTokenizer.from_pretrained(biogpt_qa_hf_path)
bio_gpt_model = BioGptForCausalLM.from_pretrained(biogpt_qa_hf_path)

In [None]:
prompt

In [None]:
tokenized_text = tokenizer.encode(prompt, return_tensors="pt")

In [None]:
tokenized_text

In [None]:
pprint(prompt)

In [None]:
generate_tokens = bio_gpt_model.generate(tokenized_text,
                                         num_beams=5,
                                         do_sample=True,
                                         top_k=50,
                                         top_p=0.95,
                                         max_length=512)

In [None]:
pprint(tokenizer.decode(generate_tokens[0], skip_special_tokens=True))

In [None]:
bio_gpt_model.push_to_hub("BioGPT-Large-QA-PubMedQA")

This si where we stop today, I will comeback tommorow to learn why the prompt is working.

In [None]:
tokenizer.push_to_hub("BioGPT-Large-QA-PubMedQA")