<a href="https://colab.research.google.com/github/daniel-jyc/MLOps_Final_Project/blob/main/MLOps_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install neptune
!pip -q install trl bitsandbytes accelerate peft datasets transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/479.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/479.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m479.7/479.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.5/11.5 MB[0m [31m115.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m

In [None]:
import os
import json
import pandas as pd
import torch
from datasets import load_dataset
from pprint import pprint
from getpass import getpass

from sklearn.model_selection import train_test_split

import neptune
import accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, get_cosine_schedule_with_warmup
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM

from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Preprocessing

In [None]:
# Specify the path to your JSON file
train_json_file_path = '/content/drive/MyDrive/Data/test_webmd_squad_v2_consec.json'
test_json_file_path = '/content/drive/MyDrive/Data/test_webmd_squad_v2_consec.json'

# Open the file and load its contents
with open(train_json_file_path, 'r') as json_file:
    train_data = json.load(json_file)

with open(test_json_file_path, 'r') as json_file:
    test_data = json.load(json_file)

In [None]:
def extract_text_from_json(data):
  # Iterate over topics
  dfs = []

  for topic_index, topic_data in enumerate(data['data']):
      topic_paragraphs = topic_data.get('paragraphs', [])

      # Iterate over paragraphs
      for paragraph in topic_paragraphs:
          qas = paragraph.get('qas', [])

          # Iterate over questions
          for qa_index, qa in enumerate(qas):
              input_text = qa.get('question', '')

              # Iterate over answers
              answers = qa.get('answers', [])
              for answer_index, answer in enumerate(answers):
                  output_text = answer.get('text', '')

                  # Create a DataFrame for each answer
                  df_answer = pd.DataFrame({
                      'Topic': [topic_data],
                      'Question': [input_text],
                      'Answer': [output_text]
                  })

                  # Append the DataFrame to the list
                  dfs.append(df_answer)

  # Concatenate all DataFrames in the list into a single DataFrame
  dataframe = pd.concat(dfs, ignore_index=True)

  return dataframe

In [None]:
def process_and_save_dataset(dataset, split):
  prompt_template = """### Question:
  {Input}

  ### Answer:
  """

  num_train_df = len(dataset["Question"])
  finetuning_dataset = []
  for i in range(num_train_df):
    question = dataset["Question"][i]
    answer = dataset["Answer"][i]
    text_with_prompt_template = prompt_template.format(Input=question)
    finetuning_dataset.append({"text": text_with_prompt_template + answer})

  with open(f'/content/drive/MyDrive/Data/dataset_{split}.json', 'w') as outfile:
    for obj in finetuning_dataset:
        json.dump(obj, outfile)
        outfile.write('\n')

  return

In [None]:
train_df = extract_text_from_json(train_data)
test_df = extract_text_from_json(test_data)
train_df.head(10)

Unnamed: 0,Topic,Question,Answer
0,{'title': 'https://www.webmd.com/eye-health/un...,What surgical techniques are used to treat gla...,If the glaucoma does not respond to medication...
1,{'title': 'https://www.webmd.com/eye-health/un...,What are the best ways to treat glaucoma?,Both drugs and surgery have high rates of succ...
2,{'title': 'https://www.webmd.com/eye-health/un...,What should you know about treating open-angle...,That is why it's so important to have your eye...
3,{'title': 'https://www.webmd.com/eye-health/un...,Is surgery for glaucoma dangerous?,"Before giving your consent, always ask the sur..."
4,{'title': 'https://www.webmd.com/eye-health/un...,How is acute closed-angle glaucoma treated?,Acute angle-closure glaucoma is different from...
5,{'title': 'https://www.webmd.com/cancer/bladde...,What do the letters of bladder cancer stages m...,It's based on the following three key pieces o...
6,{'title': 'https://www.webmd.com/ibd-crohns-di...,Which foods usually cause trouble for people w...,You may want to cut these out for a while and ...
7,{'title': 'https://www.webmd.com/ibd-crohns-di...,What should I eat after surgery for my ulcerat...,"If you have an operation for your UC, your doc..."
8,{'title': 'https://www.webmd.com/ibd-crohns-di...,What food choices can help my ulcerative coli...,Try to switch from full-fat to low-fat dairy. ...
9,{'title': 'https://www.webmd.com/prostate-canc...,When should I call my doctor after treatment f...,When you go home from the hospital after prost...


In [None]:
process_and_save_dataset(train_df, 'test')
process_and_save_dataset(test_df, 'test')

## Training

### Pull out preprossed and saved datasets

In [None]:
# dataset = load_dataset("json", data_files={'train':["/content/drive/MyDrive/Data/dataset_train.json"],'test':["/content/drive/MyDrive/Data/dataset_test.json"]})
dataset_train = load_dataset("json", data_files="/content/drive/MyDrive/Data/dataset_train.json")
dataset_test = load_dataset("json", data_files="/content/drive/MyDrive/Data/dataset_test.json")
dataset_train = dataset_train['train']
dataset_test = dataset_test['train']
dataset_train, dataset_test

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

(Dataset({
     features: ['text'],
     num_rows: 19989
 }),
 Dataset({
     features: ['text'],
     num_rows: 2614
 }))

### Load Model and set configs

In [None]:
os.environ["NEPTUNE_API_TOKEN"] = "eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiI5MzFkODAyMy1hYWUzLTQ1ODQtOGViNC04ZGQ4YmU1NTQ4MTgifQ=="
#getpass("Enter your Neptune API token: ")
os.environ["NEPTUNE_PROJECT"] = "sakai030/MLOps-Final"

In [None]:
from huggingface_hub import login
login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model_name = 'meta-llama/Llama-2-7b-hf'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant= True,
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config, use_cache=False, device_map={"": 0})
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
response_template = " ### Answer:\n "
response_template_ids = tokenizer.encode(response_template)[1:5]
collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

In [None]:
peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    r=16,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
# Define training arguments
args = TrainingArguments(
    output_dir = "/content/drive/MyDrive/Data/results_llama2",
    num_train_epochs = 1,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    lr_scheduler_type = "constant",
    warmup_steps = 0.03,
    logging_steps = 200,
    save_steps= 200,
    save_strategy = "steps",
    evaluation_strategy= "steps",
    learning_rate = 2e-5,
    fp16 = True,
    do_eval= True,
    load_best_model_at_end = True,
    report_to = 'neptune'
)


# Define SFTTrainer arguments
max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=False, ## packing = False for collator
    dataset_text_field="text",
    args=args,
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    data_collator=collator,
)

Map:   0%|          | 0/19989 [00:00<?, ? examples/s]

Map:   0%|          | 0/2614 [00:00<?, ? examples/s]

In [None]:
trainer.train()

  self._run = init_run(**self._init_run_kwargs, **additional_neptune_kwargs)


https://app.neptune.ai/sakai030/MLOps-Final/e/MLOP-19


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
200,1.4585,1.442051
400,1.4121,1.421003
600,1.3898,1.409043
800,1.4234,1.404199
1000,1.3988,1.40131
1200,1.3904,1.399114
1400,1.3959,1.398233
1600,1.3778,1.396958
1800,1.3933,1.395183
2000,1.3793,1.394845


Shutting down background jobs, please wait a moment...
Done!
Waiting for the remaining 6 operations to synchronize with Neptune. Do not kill this process.
All 6 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/sakai030/MLOps-Final/e/MLOP-19/metadata


TrainOutput(global_step=2499, training_loss=1.3982599326351635, metrics={'train_runtime': 3289.2383, 'train_samples_per_second': 6.077, 'train_steps_per_second': 0.76, 'total_flos': 1.5066859592997274e+17, 'train_loss': 1.3982599326351635, 'epoch': 1.0})

## Save Model

In [None]:
from huggingface_hub import login
login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
trainer.model.push_to_hub("Danieljyc/Llama-2")
tokenizer.push_to_hub("Danieljyc/Llama-2")

adapter_model.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Danieljyc/Llama-2/commit/a5eca5bcbddda9f1448f14ab7d732edaffff1a93', commit_message='Upload tokenizer', commit_description='', oid='a5eca5bcbddda9f1448f14ab7d732edaffff1a93', pr_url=None, pr_revision=None, pr_num=None)

## Inference

In [None]:
from huggingface_hub import login
login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# base_model, new_model = 'meta-llama/Llama-2-7b-hf', 'Danieljyc/Llama-2'
base_model, new_model = 'mistralai/Mistral-7B-v0.1', 'Danieljyc/Mistral_7B_llr'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
    new_model,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    device_map={"":0}
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

adapter_config.json:   0%|          | 0.00/521 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
def generate_inference(question):
  prompt_template = f"""### Question:
  {question}

  ### Answer:
  """

  device = "cuda:0"
  inputs = tokenizer(prompt_template, return_tensors="pt").to(device)
  outputs = finetuned_model.generate(**inputs, max_new_tokens=128, eos_token_id=pad_token_id)

  text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(text)
  return

In [None]:
questions = ["What types of exercise are best for people with asthma?","How is obsessive-compulsive disorder diagnosed?",
             "When are you more likely to get a blood clot?","How should you lift objects to prevent back pain?",
             "How can you be smart with antibiotics?"]

In [None]:
generate_inference(questions[0])

### Question:
  What types of exercise are best for people with asthma?

  ### Answer:
  30 minutes of moderate exercise, such as walking, swimming, or biking, 3 to 4 times a week. If you have severe asthma, you may need to exercise less often. Talk to your doctor about what's best for you. If you have asthma, you may be able to exercise without any problems. But if you have asthma symptoms, such as coughing, wheezing, or shortness of breath, stop exercising and take your rescue inhaler. If you have severe asthma, you may need to take your rescue inhaler before you exercise.


In [None]:
generate_inference(questions[1])

### Question:
  How is obsessive-compulsive disorder diagnosed?

  ### Answer:
  1. Obsessions: Recurrent and persistent thoughts, urges, or images that are experienced as intrusive and inappropriate and that cause anxiety or distress. 2. Compulsions: Repetitive behaviors (such as hand washing, ordering, checking, or cleaning) or mental acts (such as praying, counting, or repeating words silently) that the person feels driven to perform in response to an obsession or according to rules that must be applied rigidly. The behaviors or mental acts are aimed at preventing or reducing distress or preventing some dreaded event or situation; however, these behaviors or mental acts


In [None]:
generate_inference(questions[2])

### Question:
  When are you more likely to get a blood clot?

  ### Answer:
  1. You're more likely to get a blood clot if you're overweight or obese. That's because fat cells make a substance called adiponectin that helps keep your blood from clotting. If you're overweight, you have less of it. 2. If you're pregnant, you're more likely to get a blood clot. That's because your blood flows more slowly during pregnancy. 3. If you're a woman, you're more likely to get a blood clot during your period. That's because your blood flows more slowly


In [None]:
generate_inference(questions[3])

### Question:
  How should you lift objects to prevent back pain?

  ### Answer:
  1. Lift with your legs, not your back. Bend your knees and keep your back straight. 2. Don't twist your body while you lift. 3. Keep the object close to your body. 4. Don't lift anything that's too heavy for you. 5. Don't lift and twist at the same time. 6. Don't lift and reach at the same time. 7. Don't lift and carry at the same time. 8. Don't lift and hold at the same time. 9. Don't lift and pull at


In [None]:
generate_inference(questions[4])

### Question:
  How can you be smart with antibiotics?

  ### Answer:
  1. Be smart with antibiotics. Antibiotics are powerful drugs that can cure bacterial infections. But they don't work on viruses, like the flu or a cold. And if you take them when you don't need them, you can make yourself sicker. You can also make the bacteria in your body more resistant to antibiotics. That means they won't work as well when you really need them. So don't ask for antibiotics when you have a cold or the flu. And don't take them if your doctor says you don't need them.
