#### This notebook demonstrates the initialization and usage of the LLAMA3 language model with 8-bit quantization using Hugging Face's Transformers library for generating QA pairs from pdf dcuments. 

Key steps include:
1. Installing and importing necessary libraries (`transformers` for model handling and `bitsandbytes` for quantization).
2. Initializing the tokenizer and model with 8-bit quantization to optimize memory usage while maintaining performance.
3. Automatically distributing the model across available devices (e.g., GPUs) to leverage computational resources effectively.
4. Managing model checkpoints and loading them efficiently to handle large model files.

This setup is particularly useful for running large language models on limited hardware resources.

In [None]:
!pip install bitsandbytes datasets transformers peft accelerate tqdm pandas PyMuPDF


Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting PyMuPDF
  Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1

In [None]:
import os
import glob
import re
import pandas as pd
from tqdm import tqdm
import fitz  # PyMuPDF
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig

In [None]:
# Load configuration
SOURCE_DIRECTORY = '/content/drive/MyDrive/Colab Notebooks/WIP_kitectrics/CNSC'
chunk_size = 2000
chunk_overlap = 0
HF_TOKEN = "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"
model_name = "meta-llama/Meta-Llama-3-8B"

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Initialize the LLAMA3 model with 8-bit quantization from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=HF_TOKEN)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map='auto',  # Automatically distribute model to available devices
    low_cpu_mem_usage=True,
    use_auth_token=HF_TOKEN
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [None]:
# Function to read and clean all PDFs
def extract_text_from_pdfs(directory):
    text_data = []
    for pdf_file in glob.glob(os.path.join(directory, "*.pdf")):
        print(f"Reading PDF: {pdf_file}")
        doc = fitz.open(pdf_file)
        text = ""
        for page in doc:
            text += page.get_text()
        text_data.append(text)
    return text_data

In [None]:
# Convert documents to chunks
def chunk_text(text, chunk_size=2000, chunk_overlap=0):
    start = 0
    chunks = []
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap
    return chunks

In [None]:
# Prompt to use LLAMA3 API as QA generator
qa_prompt_template = """
You are an AI assistant tasked with generating question-answer pairs for fine-tuning a large language model (LLM).
Each question should be formatted as <question>Your question here</question> and each answer as <answer>Your answer here</answer>.
Use only the information provided in the context to create relevant and contextually accurate questions and answers.

Context: {context}

Q&A:

<question>What is the purpose of Dosimetry in radiation protection?</question>
<answer>The purpose of Dosimetry in radiation protection is to accurately measure and assess the radiation doses received by individuals working in environments where they may be exposed to radiation.</answer>
"""

In [None]:
# Generate QA pairs for the first 3 chunks
def generate_qa_pairs(chunks):
    qa_pairs = []
    for chunk in tqdm(chunks[:3], desc="Generating QA pairs for first 3 chunks"):
        try:
            prompt = qa_prompt_template.format(context=chunk)
            inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
            outputs = model.generate(
                **inputs,
                max_length=2048,
                num_return_sequences=1,
                do_sample=True,
                temperature=1.8,
                top_p=0.90,
                top_k=50
            )
            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Extract Q&A pairs from the generated text
            qa_start = generated_text.find("Q&A:")
            if qa_start != -1:
                qa_text = generated_text[qa_start + len("Q&A:"):]
                qa_pairs.extend(qa_text.strip().split('\n'))
            else:
                print("No Q&A found in the generated text.")
        except Exception as e:
            print(f"Error generating QA for chunk: {e}")
    return qa_pairs

In [None]:
# Main execution
if __name__ == "__main__":
    text_data = extract_text_from_pdfs(SOURCE_DIRECTORY)
    print(f"Extracted text from {len(text_data)} PDFs")  # Debugging statement
    all_qa_pairs = []

    for text in text_data:
        chunks = chunk_text(text, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        qa_pairs = generate_qa_pairs(chunks)
        all_qa_pairs.extend(qa_pairs)

    # Post-processing: Ensure each question has an answer
    complete_qa_pairs = []
    for qa in all_qa_pairs:
        if "<question>" in qa and "<answer>" in qa:
            complete_qa_pairs.append(qa)
        else:
            print(f"Incomplete QA pair found: {qa}")  # Debugging statement

    # Print the first 20 questions for debugging
    print("Generated QA pairs for the first 3 chunks:")
    for i, qa in enumerate(complete_qa_pairs[:20]):
        print(f"{i+1}. {qa}")

Reading PDF: /content/drive/MyDrive/Colab Notebooks/WIP_kitectrics/CNSC/REGDOC-1_4_1__Class_II_Nuclear_Facilities_and_Prescribed_Equipment.pdf
Reading PDF: /content/drive/MyDrive/Colab Notebooks/WIP_kitectrics/CNSC/REGDOC-1-1-1-site-evaluation-site-preparation-for-new-reactor-facilities-v1-2.pdf
Extracted text from 2 PDFs


Generating QA pairs for first 3 chunks:   0%|          | 0/3 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating QA pairs for first 3 chunks:  33%|███▎      | 1/3 [05:27<10:55, 327.69s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating QA pairs for first 3 chunks:  67%|██████▋   | 2/3 [06:44<03:00, 180.05s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating QA pairs for first 3 chunks: 100%|██████████| 3/3 [07:14<00:00, 144.79s/it]
Generating QA pairs for first 3 chunks:   0%|          | 0/3 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating QA pairs for first 3 chunks:  33%|███▎      | 1/3 [00:00<00:01,  1.84it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating QA pairs for first 3 chunks:  67%|██████▋   | 2/3 [06:05<03:34, 214.72s/it]Setting `pad_token_id` to `eos_token_id`:128001 for

Incomplete QA pair found: <question>What is the purpose of Dosimetry in radiation protection?</question>
Incomplete QA pair found: <answer>The purpose of Dosimetry in radiation protection is to accurately measure and assess the radiation doses received by individuals working in environments where they may be exposed to radiation.</answer>
Incomplete QA pair found:     
Incomplete QA pair found: # Answer should mention:
Incomplete QA pair found: Measure of total cumulative effects of the action or radiation or radioactivity.
Incomplete QA pair found: Can be total effective body (whole-body) dose equivalent or equivalent organ.
Incomplete QA pair found: Individuals who must provide and wear dosimeter badge for work within areas where dose may 
Incomplete QA pair found: exceed the controlled exposure limits must comply.
Incomplete QA pair found: Determine worker dose limits or control values; used in occupational radiation protection; or control program.
Incomplete QA pair found: Determin


