# Introduction

LLM to be Fine Tuned on [SleepQA](https://github.com/IvaBojic/SleepQA/) dataset

[https://huggingface.co/roneneldan/TinyStories-33M](https://huggingface.co/roneneldan/TinyStories-33M)


The fine-tuned model can be found at -

[https://huggingface.co/ahmedshahriar/SleepQA-TinyStories](https://huggingface.co/ahmedshahriar/SleepQA-TinyStories)

# Libraries

In [1]:
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q evaluate
# !pip install -q einops

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m72.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [2]:
!nvidia-smi

Sun Aug  6 05:15:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-2616da5a-15d6-9d3d-0626-cdada0a24f10)


In [None]:
!lscpu |grep 'Model name'

Model name:                      Intel(R) Xeon(R) CPU @ 2.00GHz


In [None]:
#no.of cores each processor is having
!lscpu | grep 'Core(s) per socket:'

Core(s) per socket:              1


In [None]:
#no.of threads each core is having
!lscpu | grep 'Thread(s) per core'

Thread(s) per core:              2


In [None]:
#memory that we can use
!free -h --si | awk  '/Mem:/{print $2}'

12G


In [103]:
import re
import gc
import json
import math
import random
import torch

import collections
import numpy as np
import pandas as pd

from ast import literal_eval
from tqdm.auto import tqdm

from transformers import (pipeline, AutoTokenizer, AutoModelForCausalLM,
                          TrainingArguments, Trainer, DataCollatorForLanguageModeling,
                          )
from datasets import load_dataset
from evaluate import load, evaluator

from sklearn.metrics import f1_score

SEED=42

def setting_seed(SEED):
  np.random.seed(SEED)
  random.seed(SEED)
  torch.manual_seed(SEED)

setting_seed(SEED)

In [4]:
df_sleep_train = pd.read_csv("https://raw.githubusercontent.com/IvaBojic/SleepQA/main/data/training/sleep-train.csv",
                 delimiter="\t",
                 header=None,
                 names=['question', 'answer'],
                #  converters={"answer": lambda x: x.strip('"[]"')}
                             )

In [None]:
df_sleep_train.answer

0               academic performance, behavior, and mood.
1       can reveal whether someone's sleep problems mi...
2       two pressure settings - inhalation positive ai...
3          a firmer mattress and a pillow with a low loft
4                                   sleeping on the right
                              ...                        
3995    when the airflow from breathing causes floppy ...
3996    shift workers who experience swd symptoms for ...
3997    mostly during the second half of the sleep period
3998    they experienced an increase followed by a dec...
3999    a decreased risk of stroke, heart attack, hear...
Name: answer, Length: 4000, dtype: object

In [None]:
df_sleep_train.answer.apply(literal_eval).map(len).max()

1

In [5]:
# https://huggingface.co/docs/datasets/v1.1.2/loading_datasets.html#csv-files

base_url = "https://raw.githubusercontent.com/IvaBojic/SleepQA/main/data/training/"

dataset_sleep_raw = load_dataset("csv", data_files={"train": base_url + "sleep-train.csv",
                                          "validation": base_url + "sleep-dev.csv",
                                          "test": base_url + "sleep-test.csv"},
                       delimiter="\t",
                       header=None,
                       names=['question', 'answer'],
                       converters={"answer": lambda x: x.strip('"[]"')}
                      )

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/155k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.5k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [6]:
dataset_sleep_raw

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['question', 'answer'],
        num_rows: 500
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 500
    })
})

In [7]:
dataset_sleep_with_ctx = load_dataset("json", data_files={"train": base_url + "sleep-train.json",
                                          "validation": base_url + "sleep-dev.json",
                                          "test": base_url + "sleep-test.json"
                                          },
                      #  delimiter="\t",
                      #  header=None,
                      #  names=['question', 'answers'],
                      #  converters={"answers": lambda x: x.strip('"[]"')}
                      )

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/245k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [8]:
dataset_sleep_with_ctx

DatasetDict({
    train: Dataset({
        features: ['question', 'positive_ctxs', 'negative_ctxs', 'answers'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['question', 'positive_ctxs', 'negative_ctxs', 'answers'],
        num_rows: 500
    })
    test: Dataset({
        features: ['question', 'positive_ctxs', 'negative_ctxs', 'answers'],
        num_rows: 0
    })
})

In [9]:
dataset_sleep_raw['train']['question'][:5], dataset_sleep_raw['train']['answer'][:5]

(['what can lack of sleep in children impact?',
  'what is the purpose of the light sensor in the watch?',
  'how many pressure settings do bipap machines have?',
  'what do stomach sleepers tend to require?',
  'what can increase pressure on internal organs?'],
 ['academic performance, behavior, and mood.',
  "can reveal whether someone's sleep problems might be due to an overly bright bedroom or insufficient light during the day",
  'two pressure settings - inhalation positive airway pressure (ipap) and exhalation positive airway pressure (epap) - that allow for lower pressure levels during exhalation.',
  'a firmer mattress and a pillow with a low loft',
  'sleeping on the right'])

In [10]:
# Define a function to concatenate question and answer
def concat_question_answer(data):
    concatenated = "Given the question delimited by triple backticks \
                    ```{" + data['question'] + "}```, what is the answer? \
                    Answer: {" + data['answer'] + "}"
    return {"text": concatenated}

# Apply the function to create the new column
dataset_sleep = dataset_sleep_raw.map(concat_question_answer)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [11]:
# Loading model and tokenizer

# https://huggingface.co/roneneldan/TinyStories-33M

model_name = "roneneldan/TinyStories-33M"
model_prefix = "TinyStories"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

Downloading (…)okenizer_config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/291M [00:00<?, ?B/s]

In [12]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

Using pad_token, but it is not set yet.


In [13]:
# # prepare the data for training
def prepare_train_data(data):
    # prompt + completion
    text_input = data['text']
    # tokenize the input (prompt + completion) text
    tokenized_input = tokenizer(text_input,
                                return_tensors='pt',
                                # max_length,
                                # padding=True,

                                padding='max_length',
                                truncation=True,
                                max_length=1024

                                # truncation='only_first',
                                # max_length=512
                                )
    # generative models: labels are the same as the input
    tokenized_input['labels'] = tokenized_input['input_ids']
    return tokenized_input

dataset_sleep_tokenized = dataset_sleep.map(prepare_train_data,
                                     batched=True,
                                     remove_columns=dataset_sleep["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [14]:
dataset_sleep_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
})

In [15]:
len(dataset_sleep_tokenized["train"]['input_ids']), len(dataset_sleep['train'])

(4000, 4000)

In [16]:
training_arguments = TrainingArguments(
    'SleepQA-'+model_prefix,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=True,
    optim="adafactor",
    gradient_accumulation_steps=4,
    gradient_checkpointing=True
)

In [None]:
dataset_sleep_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
})

# Training

In [17]:
%%time

# The operator 'aten::cumsum.out' is not currently supported on the MPS
# https://stackoverflow.com/a/72416727/11105356

trainer = Trainer(
    model = model,
    args = training_arguments,
    train_dataset=dataset_sleep_tokenized["train"], # .select(range(100))
    eval_dataset=dataset_sleep_tokenized["validation"], # .select(range(50))
    # data_collator=data_collator,
)

trainer.train()
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
500,0.2045
1000,0.1011


CPU times: user 13min 8s, sys: 4.63 s, total: 13min 12s
Wall time: 14min 12s


# Evaluation Training

In [18]:
%%time

trainer.evaluate()

CPU times: user 15.5 s, sys: 77 ms, total: 15.6 s
Wall time: 15.8 s


{'eval_loss': 0.10811787098646164,
 'eval_runtime': 15.8423,
 'eval_samples_per_second': 31.561,
 'eval_steps_per_second': 3.977,
 'epoch': 2.0}

In [19]:
# perplexity
# https://huggingface.co/docs/transformers/v4.31.0/en/tasks/language_modeling#train

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.4f}")
# Perplexity: 1.1142

Perplexity: 1.1142


In [35]:
dataset_sleep_tokenized['validation'].select(range(5))

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 5
})

In [20]:
trainer.args

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_privat

In [74]:
import gc

gc.collect()
torch.cuda.empty_cache()

# Evaluation Text Generation

## Original Model

In [24]:
model_old = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-33M",
                                         low_cpu_mem_usage=True).to("cpu")
prompt = 'Given the question delimited by triple backticks \
          ```{what do stomach sleepers tend to require}```, what is the answer? \
          Answer:'
generator = pipeline('text-generation',
                     model=model_old,
                     tokenizer=tokenizer,
                     do_sample=False)
result = generator(prompt)
print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 51, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what do stomach sleepers tend to require}```, what is the answer?           Answer: Yes'}]


## Fine Tuned Model

In [110]:
# tokenizer = AutoModelForCausalLM.from_pretrained("AutoModelForCausalLM")
model_sleep = AutoModelForCausalLM.from_pretrained("SleepQA-TinyStories",
                                          low_cpu_mem_usage=True).to("cpu")

prompt = 'Given the question delimited by triple backticks \
          ```{what do stomach sleepers tend to require}```, what is the answer? \
          Answer:'

generator = pipeline('text-generation',
                      model=model_sleep,
                      tokenizer=tokenizer,
                      do_sample=False)
result = generator(prompt, max_length=128)
display(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what do stomach sleepers tend to require}```, what is the answer?           Answer: {a person's sleep}}}}}}}}"}]

### Push To Hub

In [None]:
from huggingface_hub import login, Repository, get_full_repo_name

login()

In [None]:
model_name = "SleepQA-TinyStories"
repo_name = get_full_repo_name(model_name)
display(repo_name)

tokenizer.save_pretrained("SleepQA-TinyStories")
model_sleep.push_to_hub("SleepQA-TinyStories")
tokenizer.push_to_hub("SleepQA-TinyStories")

### Sample QA

In [109]:
prompt = 'what do stomach sleepers tend to require?'

generator = pipeline('text-generation',
                      model=model_sleep,
                      tokenizer=tokenizer,
                      do_sample=False)
result = generator(prompt, max_length=128)
display(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'what do stomach sleepers tend to require?”                                          Answer: {a few hours of sleep}}}}'}]

##### Train Set

In [115]:
# https://huggingface.co/learn/nlp-course/chapter5/3#slicing-and-dicing-our-data

sample_train_data = dataset_sleep['train'].shuffle(seed=SEED).select(range(10))

for example in sample_train_data:
  prompt = 'Given the question delimited by triple backticks \
          ```{'+example['question']+'}```, what is the answer? \
          Answer:'
  true_ans = example['answer']

  generator = pipeline('text-generation',
                        model=model_sleep,
                        tokenizer=tokenizer,
                        do_sample=False)
  result = generator(prompt, max_length=128)
  display(result, '\n\n', 'True Answer:', true_ans)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what events may affect our dreams?}```, what is the answer?           Answer: {a person's sleep-wake cycle}}}}}}"}]

'\n\n'

'True Answer:'

'major life changes, such as pregnancy or trauma'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what materials are preferred for intimate activities?}```, what is the answer?           Answer: {sleepers}}}}}}}}'}]

'\n\n'

'True Answer:'

'materials with a bouncier feel like latex and coils'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what should you do if you are struggling with getting good rest?}```, what is the answer?           Answer: {a condition of your sleep}}}}}}'}]

'\n\n'

'True Answer:'

'take an inventory of your entire pre-sleep routine.'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what is a power nap?}```, what is the answer?           Answer: {a type of sleep}}}}}}}}'}]

'\n\n'

'True Answer:'

'a short daytime nap of 30 minutes or less intended to boost energy levels'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what are excellent options for sleepers looking for something soft yet durable?}```, what is the answer?           Answer: {the mattress}}}}}}}}'}]

'\n\n'

'True Answer:'

'bamboo and cotton sheets'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what are hypnic jerks?}```, what is the answer?           Answer: {a sleep apnea (cnea (cexcex.}}}}}}'}]

'\n\n'

'True Answer:'

'sudden, involuntary muscle jerks you may experience as you fall asleep.'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what is the flight or fight response triggered in response to?}```, what is the answer?           Answer: {a person's sleep-wake cycle}}}}}}}"}]

'\n\n'

'True Answer:'

'in response to stress'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{how many adults report using sleep aids a few times each week?}```, what is the answer?           Answer: {seven to 20 hours}}}}}}}'}]

'\n\n'

'True Answer:'

'around 8%'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what should your new mattress come with?}```, what is the answer?           Answer: {a mattress} and soft mattress}}}}}}'}]

'\n\n'

'True Answer:'

'a warranty that covers manufacturing and workmanship defects'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{how does fragmented sleep affect memory?}```, what is the answer?           Answer: {sleep-up sleep apnea (osanea (a) sleep apnea (a) sleep apnea (a) sleep apnea (a) sleep apnea (a) and wake up at the same time every day}'}]

'\n\n'

'True Answer:'

'negatively affect memory'

##### Validation Set

In [116]:
# https://huggingface.co/learn/nlp-course/chapter5/3#slicing-and-dicing-our-data

sample_train_data = dataset_sleep['validation'].shuffle(seed=SEED).select(range(10))

for example in sample_train_data:
  prompt = 'Given the question delimited by triple backticks \
          ```{'+example['question']+'}```, what is the answer? \
          Answer:'
  true_ans = example['answer']

  generator = pipeline('text-generation',
                        model=model_sleep,
                        tokenizer=tokenizer,
                        do_sample=False)
  result = generator(prompt, max_length=128)
  display(result, '\n\n', 'True Answer:', true_ans)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{how does a stuffy nose during sleep lead to snoring?}```, what is the answer?           Answer: {a type of nose-like and firmness that helps you wake up in the morning}}}}'}]

'\n\n'

'True Answer:'

'by reducing the flow of air through the airway and causing the airway to collapse'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what should you to if suffer from heartburn?}```, what is the answer?           Answer: {a condition that involves a condition of a condition in the brain, as well as a mattress}}}}}}'}]

'\n\n'

'True Answer:'

'avoid fried, spicy, or acidic foods close to bedtime'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{why do auto cpap machines less likely to wake a sleeper?}```, what is the answer?           Answer: {because the sleeper's sleep.} and other symptoms}}}}}"}]

'\n\n'

'True Answer:'

"they do not blow out air too forcefully for the sleeper's current sleep stage"

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what are over-the-counter sleep aids are composed of?}```, what is the answer?           Answer: {the sleep-wake}}}}}}}'}]

'\n\n'

'True Answer:'

'antihistamines'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{how long are body pillows?}```, what is the answer?           Answer: {a period of time}}}}}}}}'}]

'\n\n'

'True Answer:'

'20 by 54 inches'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{why are hybrid models usually easier to move on than similarly built foam mattresses?}```, what is the answer?           Answer: {because of the airway caused by a lack of sleep}}}}}'}]

'\n\n'

'True Answer:'

'the coils in hybrid mattresses add bounce'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what do high-quality materials mean for bedsheets?}```, what is the answer?           Answer: {the materials}}}}}}}}'}]

'\n\n'

'True Answer:'

'soft, durable, and breathable sheets'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what age do most children stop napping by?}```, what is the answer?           Answer: {a person's sleep} and sleep}}}}}"}]

'\n\n'

'True Answer:'

'by seven years of age'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{how fast does memory foam return to its original shape?}```, what is the answer?           Answer: {the body's natural sleep quality}}}}}}}"}]

'\n\n'

'True Answer:'

'on average, around 5-10 seconds.'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{how does sleep talking affect a person's sleep?}```, what is the answer?           Answer: {sleep}}}}}}}"}]

'\n\n'

'True Answer:'

"it doesn't usually have a major effect on the person's sleep"

#### Test Set

In [117]:
# https://huggingface.co/learn/nlp-course/chapter5/3#slicing-and-dicing-our-data

sample_train_data = dataset_sleep['test'].shuffle(seed=SEED).select(range(10))

for example in sample_train_data:
  prompt = 'Given the question delimited by triple backticks \
          ```{'+example['question']+'}```, what is the answer? \
          Answer:'
  true_ans = example['answer']

  generator = pipeline('text-generation',
                        model=model_sleep,
                        tokenizer=tokenizer,
                        do_sample=False)
  result = generator(prompt, max_length=128)
  display(result, '\n\n', 'True Answer:', true_ans)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what is talalay latex typically reserved for in mattresses?}```, what is the answer?           Answer: {a condition and high-quality sleep-related levels of sleep}}}}}'}]

'\n\n'

'True Answer:'

'comfort layers in mattresses'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what happens to memory foam when coming into contact with body heat and pressure?}```, what is the answer?           Answer: {the body temperature and pressure}}}}}}}'}]

'\n\n'

'True Answer:'

'memory foam will slowly conform to meet the shape of the body'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{how can sleep be used to cope with the stressors of a person's life?}```, what is the answer?           Answer: {the risk of a person's sleep}}}}}"}]

'\n\n'

'True Answer:'

'achieving better sleep is one way to cope with its stressors'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what would doing away with the siesta allow?}```, what is the answer?           Answer: {a condition that comes from the brain}}}}}}'}]

'\n\n'

'True Answer:'

'many workers to end their workdays earlier'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what can delay the onset of melatonin?}```, what is the answer?           Answer: {a condition that causes the sleep-related sleepers, and the sleep-wake cycle}}}}'}]

'\n\n'

'True Answer:'

'bright lights from electronic screens and even household lighting'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what impacts the price-point of a platform bed?}```, what is the answer?           Answer: {the bed's a barrier} and the higher the platform}}}}"}]

'\n\n'

'True Answer:'

'the quality of the materials, manufacturing location, brand, and size all impact the price-point.'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what does 'weekend migraine' refer to?}```, what is the answer?           Answer: {a type of sleep apnea (osa}}}}}}"}]

'\n\n'

'True Answer:'

'migraines that commonly occur in individuals sleeping in on weekends to make up for lost sleep during the week'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{when should you consider delaying the transition from a crib to a bed for a toddler?}```, what is the answer?           Answer: {when you're having a bedtime.}}}}}}"}]

'\n\n'

'True Answer:'

'if your toddler is in the middle of potty training, or another big transition like starting daycare or a family move'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given the question delimited by triple backticks           ```{what is talalay latex?}```, what is the answer?           Answer: {a condition that is found to be a part of the night)}}}}}}}'}]

'\n\n'

'True Answer:'

'a softer, airier form of latex produced through the talalay process'

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Given the question delimited by triple backticks           ```{what does sleep affect?}```, what is the answer?           Answer: {a person's sleep disorder, such as sleep}, and sleep}}}}}"}]

'\n\n'

'True Answer:'

'various aspects of your overall health, from your mood to your immune system'

# Evaluation QA

In [120]:
'max answer length', max((len(l) for l in dataset_sleep_with_ctx['train']['answers']))

# for example in dataset_sleep_with_ctx['validation']:
#     context = example['positive_ctxs']
#     print(context[0]['text'])

('max answer length', 1)

## Train

In [122]:
%%time

# helper function
# https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb

# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

# Load pre-trained question answering model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-33M")
qa_pipeline = pipeline("question-answering",
                       model="SleepQA-TinyStories",  # distilbert-base-cased-distilled-squad SleepQA-palmyra-small
                       tokenizer=tokenizer, device='cuda')

predicted_answers = []
true_answers = []

# Evaluate the model on the test data
for example in dataset_sleep_with_ctx['train']:
    # question = 'Given the question delimited by triple backticks \
    #       ```{'+ example['question'] +'}```, what is the answer? \
    #       Answer:'

    question = example['question']
    context = example['positive_ctxs'][0]['text'] + example['negative_ctxs'][0]['text']
    true_answer = example['answers'][0]


    # Tokenize the input
    # inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    # outputs = model_sleep(**inputs)
    # answer_start = torch.argmax(outputs[0])  # get the most likely beginning of answer with the argmax of the score
    # answer_end = torch.argmax(outputs[1]) + 1

    # answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))


    # Make predictions
    # print(question, context)
    answer = qa_pipeline(question, context)

    # Append predicted and true answers
    predicted_answers.append(answer["answer"])
    true_answers.append(true_answer)

    # Print the predicted answer

    # display(f"Question: {question}")
    # display(f"Context: {context}")
    # display(f"Predicted Answer: {answer['answer']}\n")
    # display(f"True Answer: {true_answer}\n")


# Calculate F1 score
f1 = f1_score([normalize_text(ans) for ans in true_answers], [normalize_text(ans) for ans in predicted_answers], average="macro")

# Calculate exact match (EM) score
exact_match = sum(1 for true, pred in zip(true_answers, predicted_answers) if normalize_text(true) == normalize_text(pred)) / len(true_answers)

display(f"F1 Score: {f1:.4f}")
display(f"Exact Match (EM) Score: {exact_match:.4f}")

Some weights of GPTNeoForQuestionAnswering were not initialized from the model checkpoint at SleepQA-TinyStories and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'F1 Score: 0.0012'

'Exact Match (EM) Score: 0.0027'

## Validation

In [123]:
%%time

# helper function
# https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb

# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

# Load pre-trained question answering model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-33M")
qa_pipeline = pipeline("question-answering",
                       model="SleepQA-TinyStories",  # distilbert-base-cased-distilled-squad SleepQA-palmyra-small
                       tokenizer=tokenizer, device='cuda')

predicted_answers = []
true_answers = []

# Evaluate the model on the test data
for example in dataset_sleep_with_ctx['validation']:
    # question = 'Given the question delimited by triple backticks \
    #       ```{'+ example['question'] +'}```, what is the answer? \
    #       Answer:'

    question = example['question']
    context = example['positive_ctxs'][0]['text'] + example['negative_ctxs'][0]['text']
    true_answer = example['answers'][0]


    # Tokenize the input
    # inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    # outputs = model_sleep(**inputs)
    # answer_start = torch.argmax(outputs[0])  # get the most likely beginning of answer with the argmax of the score
    # answer_end = torch.argmax(outputs[1]) + 1

    # answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))


    # Make predictions
    # print(question, context)
    answer = qa_pipeline(question, context)

    # Append predicted and true answers
    predicted_answers.append(answer["answer"])
    true_answers.append(true_answer)

    # Print the predicted answer

    # display(f"Question: {question}")
    # display(f"Context: {context}")
    # display(f"Predicted Answer: {answer['answer']}\n")
    # display(f"True Answer: {true_answer}\n")


# Calculate F1 score
f1 = f1_score([normalize_text(ans) for ans in true_answers], [normalize_text(ans) for ans in predicted_answers], average="macro")

# Calculate exact match (EM) score
exact_match = sum(1 for true, pred in zip(true_answers, predicted_answers) if normalize_text(true) == normalize_text(pred)) / len(true_answers)

display(f"F1 Score: {f1:.4f}")
display(f"Exact Match (EM) Score: {exact_match:.4f}")

Some weights of GPTNeoForQuestionAnswering were not initialized from the model checkpoint at SleepQA-TinyStories and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'F1 Score: 0.0000'

'Exact Match (EM) Score: 0.0000'