<a href="https://colab.research.google.com/github/dapopov-st/RagOverArXiv/blob/main/GenerateQaDatasetWithZephyr3B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate QA Pairs with Zephyr-3b

- Takes a very long time (3-5 minutes/passage) on T4 and often produces rambling rambling qa pairs
- Longer context window reduces the need to slicing, so possibly a bit simpler to implement. However, would not return (q,a) tuples inspite of being prompted, so would need to postprocess the qa pairs with regex.

In [None]:
!pip install -q transformers==4.35.2 accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 trl==0.4.7 datasets==2.10.1 wandb==0.16.0


In [None]:
!pip install ctransformers[cuda]



In [None]:
!pip install pdfminer.six



In [None]:
import os
from google.colab import drive
drive.mount('/content/drive/')

output_dir = '/content/drive/MyDrive/PdfRag/finetune_output_dir'
logging_dir = '/content/drive/MyDrive/PdfRag/finetune_logging_dir'
index_dir = '/content/drive/MyDrive/PdfRag/finetune_index_dir'
papers_dir = '/content/drive/MyDrive/PdfRag/clusterofstars'
data_dir = '/content/drive/MyDrive/PdfRag/data'
%cd /content/drive/MyDrive/PdfRag
!ls ./clusterofstars

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/MyDrive/PdfRag
'Atlas:  Few-shot Learning with Retrieval Augmented Language Models.pdf'
 ChainOfThought
'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf'
'Chain-of-Verification Reduces Hallucination in Large Language Models.pdf'
'DSPy:  Compiling Declarative Language Model Calls into Self-Improving Pipelines.pdf'
'Evaluating Large Language Models Trained on Code.pdf'
 ExcludeSurveysAndLitReviews
'FlashAttention:  Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf'
'From Sparse to Dense:  GPT-4 Summarization with Chain of Density Prompting.pdf'
'Graph of Thoughts: Solving Elaborate Problems with Large Language Models.pdf'
'In-Context Retrieval-Augmented Language Models.pdf'
'Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.pdf'
'Leveraging Passage Retrieval with Generative Models

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import sys,gc,traceback
import torch
def clean_ipython_hist():
    # Code in this function mainly copied from IPython source
    if not 'get_ipython' in globals(): return
    ip = get_ipython()
    user_ns = ip.user_ns
    ip.displayhook.flush()
    pc = ip.displayhook.prompt_count + 1
    for n in range(1, pc): user_ns.pop('_i'+repr(n),None)
    user_ns.update(dict(_i='',_ii='',_iii=''))
    hm = ip.history_manager
    hm.input_hist_parsed[:] = [''] * pc
    hm.input_hist_raw[:] = [''] * pc
    hm._i = hm._ii = hm._iii = hm._i00 =  ''



def clean_tb():
    # h/t Piotr Czapla
    if hasattr(sys, 'last_traceback'):
        traceback.clear_frames(sys.last_traceback)
        delattr(sys, 'last_traceback')
    if hasattr(sys, 'last_type'): delattr(sys, 'last_type')
    if hasattr(sys, 'last_value'): delattr(sys, 'last_value')

def clean_mem():
    clean_tb()
    clean_ipython_hist()
    gc.collect()
    torch.cuda.empty_cache()
clean_mem()

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import json

from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode

In [None]:
TRAIN_FILES = ['./clusterofstars/LoRA: Low-Rank Adaptation of Large Language Models.pdf']
VAL_FILES = ['./clusterofstars/QLORA: Efficient Finetuning of Quantized LLMs.pdf']

TRAIN_CORPUS_FPATH = './data/train_corpus.json'
VAL_CORPUS_FPATH = './data/val_corpus.json'

In [None]:
from pdfminer.high_level import extract_text
import re
import pandas as pd
text = extract_text(TRAIN_FILES[0])

In [None]:
def mostly_numbers(text, threshold=0.5):
    # Find all numeric characters in the text
    numbers = re.findall(r'[\d()±/. ]', text)
    percentage = len(numbers) / len(text) if text else 1
    return percentage >= threshold
def starts_with_figure_or_table(text):
  pattern = r"^(Figure|Table)\s+\d+\s*:"
  return bool(re.match(pattern, text))

def ends_with_period_or_number(text):
    """...or dash"""
    pattern = r"[.\d-]$"
    return bool(re.search(pattern, text))

def join_hyphenated_words(text):
    pattern = r"([a-z]+)-\n\s*([a-z]+)"
    return re.sub(pattern, r'\1\2', text)

def ignore_footnote(text):
    "Works OK, just want to trim what's carried over on the next line in footnote"
    pattern = r"\n\d[A-Z][^.]*\."
    return bool(re.search(pattern, text))


from pdfminer.layout import LAParams
from pdfminer.converter import PDFResourceManager, PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.layout import LTTextBoxHorizontal
def parse_paragraphs(min_len_to_keep=50):
  df = pd.DataFrame(columns=['abstract'])
  document = open(TRAIN_FILES[0], 'rb')
  #Create resource manager
  rsrcmgr = PDFResourceManager()
  # Set parameters for analysis.
  laparams = LAParams()
  # Create a PDF page aggregator object.
  device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  SEEN_ABSTRACT = False
  buffer = ''
  for page in PDFPage.get_pages(document):
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()

    for element in layout:
      if isinstance(element, LTTextBoxHorizontal):
        text = element.get_text()
        text = join_hyphenated_words(text)

        if "REFERENCES\n" in text.upper():return df

        if not SEEN_ABSTRACT and 'ABSTRACT' not in text.upper(): continue
        else : SEEN_ABSTRACT = True

        if (mostly_numbers(text) or starts_with_figure_or_table(text) or text.startswith('*')
            or ignore_footnote(text)):
          #print(f"<<<WiLL IGNORE: {text}>>>")
          continue
        elif (text[-1] == '-' or text[-1]==':' or len(text)<10
              or not ends_with_period_or_number(text)):
          buffer += text+" "
        else:
          if buffer:# and not ignore_footnote(text):
            text = buffer + " "+ text
            buffer = ''
          if len(df) and df['abstract'].iloc[-1].endswith('-'): df['abstract'].iloc[-1] += text
          elif len(text)>min_len_to_keep: df = df.append({'abstract':text.strip()}, ignore_index=True)
  return df
df=parse_paragraphs()

In [None]:
df

Unnamed: 0,abstract
0,ABSTRACT\n An important paradigm of natural l...
1,INTRODUCTION\n Many applications in natural l...
2,Many sought to mitigate this by adapting only ...
3,often introduce inference latency (Houlsby et ...
4,We take inspiration from Li et al. (2018a); Ag...
...,...
58,"To answer these questions, we project W onto t..."
59,∆Wq Wq\n Random ∆Wq Wq\n Random\n ||U (cid:62...
60,We draw several conclusions from Table 7. Firs...
61,8 CONCLUSION AND FUTURE WORK\n Fine-tuning en...


In [None]:
MODEL_PATH = "stabilityai/stablelm-zephyr-3b"

In [None]:
import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
model_config = transformers.AutoConfig.from_pretrained(MODEL_PATH,trust_remote_code=True,device_map='auto')

In [None]:
model_config

StableLMEpochConfig {
  "_name_or_path": "stabilityai/stablelm-zephyr-3b",
  "architectures": [
    "StableLMEpochForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "stabilityai/stablelm-zephyr-3b--configuration_stablelm_epoch.StableLMEpochConfig",
    "AutoModelForCausalLM": "stabilityai/stablelm-zephyr-3b--modeling_stablelm_epoch.StableLMEpochForCausalLM"
  },
  "bos_token_id": 0,
  "eos_token_id": 0,
  "hidden_act": "silu",
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": 6912,
  "max_position_embeddings": 4096,
  "model_type": "stablelm_epoch",
  "norm_eps": 1e-05,
  "num_attention_heads": 32,
  "num_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "rope_pct": 0.25,
  "rope_theta": 10000,
  "rotary_scaling_factor": 1.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 50304
}

In [None]:
model = transformers.AutoModelForCausalLM.from_config(model_config)
model.eval()
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)

The repository for stabilityai/stablelm-zephyr-3b contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/stabilityai/stablelm-zephyr-3b.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


modeling_stablelm_epoch.py:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/stabilityai/stablelm-zephyr-3b:
- modeling_stablelm_epoch.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer_config.json:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('stabilityai/stablelm-zephyr-3b')
model = AutoModelForCausalLM.from_pretrained(
    'stabilityai/stablelm-zephyr-3b',
    trust_remote_code=True,
    device_map="auto"
)

prompt = [{'role': 'user', 'content': """Generate a question-answer tuple for
            the following text We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
            over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
            change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
            Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
            network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
            adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
            175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) sufﬁces even
            when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient."""}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    return_tensors='pt'
)

tokens = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(tokens[0], skip_special_tokens=False))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|user|>
Generate a question-answer tuple for
            the following text We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
            over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
            change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
            Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
            network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
            adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
            175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) sufﬁces even
            when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient.<|endoftext|>
<|assistant|>
Question: How does the Low-Rank Ad

In [None]:
def generate_qa_tokens(text,model=model,tokenizer=tokenizer):
  content = f"""
  <|user|>
  You are an astute professor making in-depth questions
  and answers for an upcoming exam for a graduate paper reading group
  on Natural Language Processing. You would generate an in-depth question/answer tuple
  in the format of ("your question", "your answer")
  Now generate a question/answer tuple for the following text:
  {text}
  <|assistant|>

  """

  prompt = [{'role': 'user', 'content': f"""Generate a question-answer tuple for
              the following text: {text}."""}]
  inputs = tokenizer.apply_chat_template(
      prompt,
      add_generation_prompt=True,
      return_tensors='pt'
  )

  tokens = model.generate(
      inputs.to(model.device),
      max_new_tokens=1024,
      temperature=0.8,
      do_sample=True
  )

  print(tokenizer.decode(tokens[0], skip_special_tokens=False))


In [None]:
text = """
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) sufﬁces even
when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient.
"""
generate_qa_tokens(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|user|>
Generate a question-answer tuple for
              the following text: 
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) sufﬁces even
when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient.
.<|endoftext|>
<|assistant|>
Question: How does the Low-Rank Adaptation (LoRA) approach preserve the effectiveness of over-parametrized model

In [None]:
text="""
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As
we pre-train larger models, full ﬁne-tuning, which retrains all model parameters,
becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of ﬁne-tuned models, each with 175B parameters, is prohibitively
expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each
layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B ﬁne-tuned with Adam,
LoRA can reduce the number of trainable parameters by 10,000 times and the
GPU memory requirement by 3 times. LoRA performs on-par or better than ﬁnetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters,
no additional inference latency. We also provide an empirical investigation into
rank-deﬁciency in language model adaptation, which sheds light on the efﬁcacy of
LoRA. We release a package that facilitates the integration of LoRA with PyTorch
models and provide our implementations and model checkpoints for RoBERTa,
DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
"""
generate_qa_tokens(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|user|>
Generate a question-answer tuple for
              the following text: 
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As
we pre-train larger models, full ﬁne-tuning, which retrains all model parameters,
becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of ﬁne-tuned models, each with 175B parameters, is prohibitively
expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each
layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B ﬁne-tuned with Adam,
LoRA can reduce the number of trainable parameters by 10,000 times and the
GPU memory requirement by 3 times. LoRA performs on-par or better than ﬁnetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3

- For Abstract,question/answer were really a batch of questions + summary of the abstract. Took 5 minutes to process, not scalable.
