<a target="_blank" href="https://colab.research.google.com/github/abhiwebshar/llm_lab_tutorials/blob/main/Llama_3_8b_faster_finetuning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook is meant to be run on colab T4 GPU.

To install Unsloth on your own computer, follow the installation instructions on Unsloth's Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).




  - Installation of required libraries for training.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
##

### Inference before fine-tuning

In [None]:
sample_input = """Sample Type / Medical Specialty:
Cardiovascular / Pulmonary.

Sample Name:
Peripheral Effusion - Consult.

Description:
Peripheral effusion on the CAT scan.  The patient is a 70-year-old Caucasian female with prior history of lung cancer, status post upper lobectomy.  She was recently diagnosed with recurrent pneumonia and does have a cancer on the CAT scan, lung cancer with metastasis.

REASON FOR CONSULT:
Peripheral effusion on the CAT scan.

HISTORY OF PRESENT ILLNESS:
The patient is a 70-year-old Caucasian female with prior history of lung cancer, status post upper lobectomy.  She was recently diagnosed with recurrent pneumonia and does have a cancer on the CAT scan, lung cancer with metastasis.  The patient had a visiting nurse for Christmas and started having abdominal pain, nausea and vomiting for which, she was admitted.  She had a CAT scan of the abdomen done, showed moderate pericardial effusion for which cardiology consult was requested.  She had an echo done, which shows moderate pericardial effusion with early tamponade.  The patient has underlying shortness of breath because of COPD, emphysema and chronic cough.  However, denies any dizziness, syncope, presyncope, palpitation.  Denies any prior history of coronary artery disease.

ALLERGIES:
No known drug allergies.

MEDICATIONS:
At this time, she is on hydromorphone p.r.n., erythromycin, ceftriaxone, calcium carbonate, Ambien.  She is on oxygen and nebulizer.

PAST MEDICAL HISTORY:
History of COPD, emphysema, pneumonia, and lung cancer.

PAST SURGICAL HISTORY:
Hip surgery and resection of the lung cancer 10 years ago.

SOCIAL HISTORY:
Still smokes, but less than before.  Drinks socially.

FAMILY HISTORY:
Noncontributory.

REVIEW OF SYSTEMS:
Denies any syncope, presyncope, palpitations, shortness of breath, cough, nausea, vomiting, or diarrhea.

PHYSICAL EXAMINATION:
GENERAL  The patient is comfortable not in any distress.VITAL SIGNS  Blood pressure 121/79, Pulse rate 94, respiratory rate 19, and temperature 97.6.HEENT  Atraumatic and normocephalic.NECK  Supple.  No JVD.  No carotid bruit.CHEST  Breath sounds vesicular.  Clear on auscultation.HEART  PMI could not be localized.  S2 and S2 regular.  No S3, no S4.  No murmur.ABDOMEN  Soft and nontender.  Positive bowel sounds.EXTREMITIES  No cyanosis, clubbing, or edema.  Pulse 2+.CNS  Alert, awake, and oriented x3.EKG shows normal sinus rhythm, low voltage.

LABORATORY DATA:
White cell count 7.3, hemoglobin 12.9, hematocrit 38.1, and platelet at 322,000.  Sodium 135, potassium 5, BUN 6, creatinine 1.2, glucose 71, alkaline phosphatase 263, total protein 5.3, lipase 414, and amylase 57.

DIAGNOSTIC STUDIES:
Chest x-ray shows left upper lobe airspace disease consistent with pneumonia _______.  CT abdomen showed diffuse replacement of the _______ metastasis, hepatomegaly, perihepatic ascites, moderate pericardial effusion, small left _______ sigmoid diverticulosis.

ASSESSMENT:
1.  Moderate peripheral effusion with early tamponade, probably secondary to lung cancer.2.  Lung cancer with metastasis most likely.3.  Pneumonia.4.  COPD.

PLAN:
We will get CT surgery consult for pericardial window.  Continue present medication. """

In [None]:
#Let's add the instruction to our data
sample_instruction = """You are a professional clinician.
You have been given a doctor's note after a patient's visit.
Your task is to extract the main diagnoses from the note.
    Note:
        1. Focus on the assessment, plan, impression, recommendation or similar sections.
        2. Exclude any items that are negated or ruled out.
        3. Do not include any extraneous information.
        4. Do not include the phrase \"diagnosis\" in your search.

"""
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        sample_instruction, # instruction
         sample_input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")
input_length = inputs['input_ids'].shape[1]
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
print(tokenizer.decode(outputs.squeeze()[input_length:], skip_special_tokens=True).strip())

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Peripheral Effusion - Consult. 
1.  Moderate peripheral effusion with early tamponade, probably secondary to lung cancer.2.  Lung cancer with metastasis most likely.3.  Pneumonia.4.  COPD.


Output is not structured

GPT 4 Output for same input:
1. Moderate peripheral effusion with early tamponade, secondary to lung cancer.
2. Lung cancer with metastasis.
3. Pneumonia.
4. Chronic Obstructive Pulmonary Disease (COPD).


We will train LoRa adapter to get structured output and results aligned with GPT 4

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep

We will use the dataset prepared from narrative notes.

In [3]:
!pip install -q colorama

In [4]:
import pandas as pd
from colorama import Fore

In [1]:
!wget https://raw.githubusercontent.com/abhiwebshar/llm_lab_tutorials/main/data/clean_data_for_training.csv

--2024-06-03 14:13:28--  https://raw.githubusercontent.com/abhiwebshar/llm_lab_tutorials/main/data/clean_data_for_training.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 370037 (361K) [text/plain]
Saving to: ‘clean_data_for_training.csv’


2024-06-03 14:13:28 (22.4 MB/s) - ‘clean_data_for_training.csv’ saved [370037/370037]



In [5]:
df = pd.read_csv('clean_data_for_training.csv')

In [6]:
df.head()

Unnamed: 0,File,Input,Output
0,samplecaf1.txt,Sample Type / Medical Specialty: \nGastroenter...,"```json\n{\n ""diagnoses"": [\n ""Chron..."
1,samplee897.txt,Sample Type / Medical Specialty: \nOphthalmolo...,"```json\n{\n ""diagnoses"": [\n ""Catar..."
2,sample66d5.txt,Sample Type / Medical Specialty: \nSurgery. \n...,"```json\n{\n ""diagnoses"": [\n ""Left ..."
3,sampleb608.txt,Sample Type / Medical Specialty: \nCardiovascu...,"```json\n{\n ""diagnoses"": [\n ""Moder..."
4,sample9dcd.txt,Sample Type / Medical Specialty: \nEndocrinolo...,"```json\n{\n ""diagnoses"": [\n ""Acqui..."


In [None]:
for row in df.sample(3).itertuples(index=False):
  print(Fore.BLUE + f"Input: {row.Input}")
  print(Fore.GREEN +f"Output: {row.Output}")
  print("*"*50)

[34mInput: Sample Type / Medical Specialty: 
Pediatrics - Neonatal. 

Sample Name: 
Chest Closure. 

Description: 
Delayed primary chest closure.  Open chest status post modified stage 1 Norwood operation.  The patient is a newborn with diagnosis of hypoplastic left heart syndrome who 48 hours prior to the current procedure has undergone a modified stage 1 Norwood operation. 

PROCEDURE: 
Delayed primary chest closure. 

INDICATIONS: 
The patient is a newborn with diagnosis of hypoplastic left heart syndrome who 48 hours prior to the current procedure has undergone a modified stage 1 Norwood operation.  Given the magnitude of the operation and the size of the patient (2.5 kg), we have elected to leave the chest open to facilitate postoperative management.  He is now taken back to the operative room for delayed primary chest closure. 

PREOP DX: 
Open chest status post modified stage 1 Norwood operation. 

POSTOP DX: 
Open chest status post modified stage 1 Norwood operation. 

ANESTHE

  - We prepared the data from narrative notes
  - Around 100 files were sampled and output for them was generated by gpt 4.

In [None]:
#Let's add the instruction to our data
instruction = """You are a professional clinician.
You have been given a doctor's note after a patient's visit.
Your task is to extract the main diagnoses from the note and provide them in a structured format.
Instructions:

Focus on the assessment, plan, impression, recommendation or similar sections.
Exclude any items that are negated or ruled out.
Do not include any extraneous information.
Do not include the phrase "diagnosis" in your search.
Provide only the names of the diagnoses, without any additional details.

Please provide the output in the following strucutured format:
{{
    "diagnoses": [
        "<Diagnosis 1>",
        "<Diagnosis 2>",
        ...
    ]
}}
"""

In [None]:
df['instruction'] = instruction

In [None]:
from datasets import Dataset, load_dataset


In [None]:
dataset = Dataset.from_pandas(df)

In [None]:
dataset. #huggingface dataset format

Dataset({
    features: ['File', 'Input', 'Output', 'instruction'],
    num_rows: 92
})

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["Input"]
    outputs      = examples["Output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/92 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/92 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.594 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 92 | Num Epochs = 6
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.8659
2,1.7183
3,1.8103
4,1.8236
5,1.778
6,1.7245
7,1.5581
8,1.5662
9,1.4021
10,1.3382


Step,Training Loss
1,1.8659
2,1.7183
3,1.8103
4,1.8236
5,1.778
6,1.7245
7,1.5581
8,1.5662
9,1.4021
10,1.3382


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1917.2298 seconds used for training.
31.95 minutes used for training.
Peak reserved memory = 9.609 GB.
Peak reserved memory for training = 4.015 GB.
Peak reserved memory % of max memory = 65.155 %.
Peak reserved memory for training % of max memory = 27.224 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
sample_input = """Sample Type / Medical Specialty:
SOAP / Chart / Progress Notes.

Sample Name:
Wasp Sting - SOAP.

Description:
Comes in complaining that he was stung by a Yellow Jacket Wasp yesterday and now has a lot of swelling in his right hand and right arm.

SUBJECTIVE:
He is a 29-year-old white male who is a patient of Dr. XYZ and he comes in today complaining that he was stung by a Yellow Jacket Wasp yesterday and now has a lot of swelling in his right hand and right arm.  He says that he has been stung by wasps before and had similar reactions.  He just said that he wanted to catch it early before he has too bad of a severe reaction like he has had in the past.  He has had a lot of swelling, but no anaphylaxis-type reactions in the past; no shortness of breath or difficultly with his throat feeling like it is going to close up or anything like that in the past; no racing heart beat or anxiety feeling, just a lot of localized swelling where the sting occurs.

OBJECTIVE:
Vitals  His temperature is 98.4.  Respiratory rate is 18.  Weight is 250 pounds.Extremities  Examination of his right hand and forearm reveals that he has an apparent sting just around his wrist region on his right hand on the medial side as well as significant swelling in his hand and his right forearm; extending up to the elbow.  He says that it is really not painful or anything like that.  It is really not all that red and no signs of infection at this time.

ASSESSMENT:
Wasp sting to the right wrist area.PLAN1.  Solu-Medrol 125 mg IM X 1.2.  Over-the-counter Benadryl, ice and elevation of that extremity.3.  Follow up with Dr. XYZ if any further evaluation is needed. """



In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction, # instruction
         sample_input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a professional clinician.\nYou have been given a doctor\'s note after a patient\'s visit.\nYour task is to extract the main diagnoses from the note and provide them in a structured format.\nInstructions:\n\nFocus on the assessment, plan, impression, recommendation or similar sections.\nExclude any items that are negated or ruled out.\nDo not include any extraneous information.\nDo not include the phrase "diagnosis" in your search.\nProvide only the names of the diagnoses, without any additional details.\n\nPlease provide the output in the following strucutured format:\n{{\n    "diagnoses": [\n        "<Diagnosis 1>",\n        "<Diagnosis 2>",\n       ...\n    ]\n}}\n\n\n### Input:\nSample Type / Medical Specialty: \nSOAP / Chart / Progress Notes. \n\nSample Name: \nWasp Sting - SOAP.

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
         instruction, # instruction
         sample_input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a professional clinician.
You have been given a doctor's note after a patient's visit.
Your task is to extract the main diagnoses from the note and provide them in a structured format.
Instructions:

Focus on the assessment, plan, impression, recommendation or similar sections.
Exclude any items that are negated or ruled out.
Do not include any extraneous information.
Do not include the phrase "diagnosis" in your search.
Provide only the names of the diagnoses, without any additional details.

Please provide the output in the following strucutured format:
{{
    "diagnoses": [
        "<Diagnosis 1>",
        "<Diagnosis 2>",
       ...
    ]
}}


### Input:
Sample Type / Medical Specialty: 
SOAP / Chart / Progress Notes. 

Sample Name: 
Wasp Sting - SOAP. 

Description: 
Comes in complai

In [None]:
test_input = """Sample Type / Medical Specialty:
Cardiovascular / Pulmonary.

Sample Name:
Peripheral Effusion - Consult.

Description:
Peripheral effusion on the CAT scan.  The patient is a 70-year-old Caucasian female with prior history of lung cancer, status post upper lobectomy.  She was recently diagnosed with recurrent pneumonia and does have a cancer on the CAT scan, lung cancer with metastasis.

REASON FOR CONSULT:
Peripheral effusion on the CAT scan.

HISTORY OF PRESENT ILLNESS:
The patient is a 70-year-old Caucasian female with prior history of lung cancer, status post upper lobectomy.  She was recently diagnosed with recurrent pneumonia and does have a cancer on the CAT scan, lung cancer with metastasis.  The patient had a visiting nurse for Christmas and started having abdominal pain, nausea and vomiting for which, she was admitted.  She had a CAT scan of the abdomen done, showed moderate pericardial effusion for which cardiology consult was requested.  She had an echo done, which shows moderate pericardial effusion with early tamponade.  The patient has underlying shortness of breath because of COPD, emphysema and chronic cough.  However, denies any dizziness, syncope, presyncope, palpitation.  Denies any prior history of coronary artery disease.

ALLERGIES:
No known drug allergies.

MEDICATIONS:
At this time, she is on hydromorphone p.r.n., erythromycin, ceftriaxone, calcium carbonate, Ambien.  She is on oxygen and nebulizer.

PAST MEDICAL HISTORY:
History of COPD, emphysema, pneumonia, and lung cancer.

PAST SURGICAL HISTORY:
Hip surgery and resection of the lung cancer 10 years ago.

SOCIAL HISTORY:
Still smokes, but less than before.  Drinks socially.

FAMILY HISTORY:
Noncontributory.

REVIEW OF SYSTEMS:
Denies any syncope, presyncope, palpitations, shortness of breath, cough, nausea, vomiting, or diarrhea.

PHYSICAL EXAMINATION:
GENERAL  The patient is comfortable not in any distress.VITAL SIGNS  Blood pressure 121/79, Pulse rate 94, respiratory rate 19, and temperature 97.6.HEENT  Atraumatic and normocephalic.NECK  Supple.  No JVD.  No carotid bruit.CHEST  Breath sounds vesicular.  Clear on auscultation.HEART  PMI could not be localized.  S2 and S2 regular.  No S3, no S4.  No murmur.ABDOMEN  Soft and nontender.  Positive bowel sounds.EXTREMITIES  No cyanosis, clubbing, or edema.  Pulse 2+.CNS  Alert, awake, and oriented x3.EKG shows normal sinus rhythm, low voltage.

LABORATORY DATA:
White cell count 7.3, hemoglobin 12.9, hematocrit 38.1, and platelet at 322,000.  Sodium 135, potassium 5, BUN 6, creatinine 1.2, glucose 71, alkaline phosphatase 263, total protein 5.3, lipase 414, and amylase 57.

DIAGNOSTIC STUDIES:
Chest x-ray shows left upper lobe airspace disease consistent with pneumonia _______.  CT abdomen showed diffuse replacement of the _______ metastasis, hepatomegaly, perihepatic ascites, moderate pericardial effusion, small left _______ sigmoid diverticulosis.

ASSESSMENT:
1.  Moderate peripheral effusion with early tamponade, probably secondary to lung cancer.2.  Lung cancer with metastasis most likely.3.  Pneumonia.4.  COPD.

PLAN:
We will get CT surgery consult for pericardial window.  Continue present medication. """

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
         instruction, # instruction
         test_input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a professional clinician.
You have been given a doctor's note after a patient's visit.
Your task is to extract the main diagnoses from the note and provide them in a structured format.
Instructions:

Focus on the assessment, plan, impression, recommendation or similar sections.
Exclude any items that are negated or ruled out.
Do not include any extraneous information.
Do not include the phrase "diagnosis" in your search.
Provide only the names of the diagnoses, without any additional details.

Please provide the output in the following strucutured format:
{{
    "diagnoses": [
        "<Diagnosis 1>",
        "<Diagnosis 2>",
       ...
    ]
}}


### Input:
Sample Type / Medical Specialty: 
Cardiovascular / Pulmonary. 

Sample Name: 
Peripheral Effusion - Consult. 

Description: 
Periphe

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
import getpass #
import os
os.environ["HF_WRITE_TOKEN"] = getpass.getpass("HF WRITE TOKEN:")

HF WRITE TOKEN:··········


In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")


('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

In [None]:
model.push_to_hub("abhiwebshar/lora_model", token = os.environ["HF_WRITE_TOKEN"]) # Online saving
tokenizer.push_to_hub("abhiwebshar/lora_model", token = os.environ["HF_WRITE_TOKEN"]) # Online saving

README.md:   0%|          | 0.00/578 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/abhiwebshar/lora_model


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "abhiwebshar/lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:




inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction, # instruction
         sample_input,  # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")
input_length = inputs['input_ids'].shape[1]

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
print(tokenizer.decode(outputs.squeeze()[input_length:], skip_special_tokens=True).strip())

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


```json
{
    "diagnoses": [
        "Moderate peripheral effusion with early tamponade",
        "Lung cancer with metastasis",
        "Pneumonia",
        "COPD"
    ]
}
```


### Pre training and Post Training Comparision

**pretraining output**


Peripheral Effusion - Consult.
1.  Moderate peripheral effusion with early tamponade, probably secondary to lung cancer.2.  Lung cancer with metastasis most likely.3.  Pneumonia.4.  COPD.

**post training output**

```json
{
    "diagnoses": [
        "Moderate peripheral effusion with early tamponade",
        "Lung cancer with metastasis",
        "Pneumonia",
        "COPD"
    ]
}
```
**gpt 4 output**

1. Moderate peripheral effusion with early tamponade, secondary to lung cancer.
2. Lung cancer with metastasis.
3. Pneumonia.
4. Chronic Obstructive Pulmonary Disease (COPD).

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.