### Resources:
- https://colab.research.google.com/github/RSNA/AI-Deep-Learning-Lab-2023/blob/main/sessions/nlp-text-classification/RSNA23_ACR_contrast_manual_chat.ipynb#scrollTo=bFcBcigjKEuj

### Installing libraries & functions

In [None]:
%%capture
!pip uninstall numpy -y
!pip install numpy==1.25
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.11 --force-reinstall --upgrade --no-cache-dir
!pip install -U \
    llama-index==0.8.69.post2 \
    huggingface-hub==0.19.3 \
    transformers==4.35.2 \
    pypdf==3.17.1 \
    torch

In [1]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.readers import PDFReader
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt

from pathlib import Path

### Installing relevant textbooks & loading pdf: 
https://drive.google.com/drive/u/1/folders/12mL45XMDRSxhkgMH_PIeQAAsAtbv-X2W

In [2]:
import os
cwd = os.getcwd()
pdf_folder_path = Path(f'{cwd}/pdfs/')

documents = PDFReader().load_data(pdf_folder_path/"General - Brant _ Helms - Fundamentals of Diagnostic Radiology (4e).pdf")

### Modeling - Embedding + LLM models

In [3]:
embed_model = HuggingFaceEmbedding(model_name="ncbi/MedCPT-Article-Encoder")

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Llama index is a data framework for LLM-based applications to ingest, structure, and access private or domain-specific data. Llama CPP allows us to run our a Llama LLM model on basically any laptop/pc. Typically we use a LLama model for ease of indexing, querying, etc. - especially useful for RAG.

In [5]:
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=512,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=2048,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 36},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True
)

Downloading url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to path /Users/andychen2121/Library/Caches/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf
total size (MB): 4081.0


3892it [02:12, 29.44it/s]                                                       
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/andychen2121/Library/Caches/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  

Defining service context: Service generated from our embedding model (imported from Huggingface) and our llm (Llama model).
- LLM: understands and generates text for our QA functionality
- embedding model: creates numerical representations (embeddings) of words or phrases
- chunk size: size of data processed at a time - *worth experimenting with*

In [6]:
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    chunk_size=1000
)

VectorStoreIndex: Vector space of our embedded vectors, which we generate using the embedding model and our documents. We create a query engine which can query k similar chunks of text from the VectorStoreIndex.

In [13]:
# TO EXPLORE: can VectorStoreIndex intake better formatted document per Max Vogel's textbook scraper?

In [7]:
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

# @markdown Another hyperparameter to experiment with is `similarity_top_k`. This is the number of chunks of text that will be retrieved from the `VectorStoreIndex` for each query.
query_engine = index.as_query_engine(similarity_top_k=3)

### Testing

In [8]:
response = query_engine.query("How is radiation dose minimized in diagnostic imaging?");
print(response)


llama_print_timings:        load time =  5974.68 ms
llama_print_timings:      sample time =   232.53 ms /   337 runs   (    0.69 ms per token,  1449.26 tokens per second)
llama_print_timings: prompt eval time =  8981.72 ms /  1711 tokens (    5.25 ms per token,   190.50 tokens per second)
llama_print_timings:        eval time =  8316.23 ms /   336 runs   (   24.75 ms per token,    40.40 tokens per second)
llama_print_timings:       total time = 17963.39 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  5974.68 ms
llama_print_timings:      sample time =   169.22 ms /   241 runs   (    0.70 ms per token,  1424.22 tokens per second)
llama_print_timings: prompt eval time =  5198.20 ms /  1744 tokens (    2.98 ms per token,   335.50 tokens per second)
llama_print_timings:        eval time =  5946.76 ms /   240 runs   (   24.78 ms per token,    40.36 tokens per second)
llama_print_timings:       total time = 11626.66 ms
Llama.generate: prefix-match hit


  Thank you for providing additional context! Based on the information provided, here is a refined answer to the original question:
Radiation dose minimization in diagnostic imaging is crucial to reduce the risks associated with ionizing radiation exposure. Here are some strategies that healthcare providers can use to minimize radiation dose in diagnostic imaging, taking into account the new context provided:
1. Choose the appropriate imaging modality: Select the imaging modality that best suits the clinical question and patient age, size, and pathology, taking into account any specific requirements for the examination, such as the need for a normal V/Q scan or perfusion lung scan. For example, ultrasound or MRI may be preferred over CT for certain examinations, especially in pediatric patients.
2. Use the lowest possible dose of radiation: Select the lowest mAs setting that produces diagnostic quality images, taking into account any specific requirements for the examination and the pa


llama_print_timings:        load time =  5974.68 ms
llama_print_timings:      sample time =   359.49 ms /   512 runs   (    0.70 ms per token,  1424.22 tokens per second)
llama_print_timings: prompt eval time =  1015.47 ms /   369 tokens (    2.75 ms per token,   363.38 tokens per second)
llama_print_timings:        eval time = 11585.74 ms /   511 runs   (   22.67 ms per token,    44.11 tokens per second)
llama_print_timings:       total time = 13658.39 ms


In [9]:
sources = response.get_formatted_sources(length=1000)
print(sources)

> Source (Doc id: dece6c86-b214-460b-a7ea-860460ed76a0): 46.    Fazel   R  ,   Krumholz   HM  ,   Wang   Y  , et al.  Exposure to low-dose ionizing radi-
ation from medical imaging procedures .  N Eng J Med   2009 ; 361 : 849 – 857 . 
   47.    Balter   S  ,   Hopewell   JW  ,   Miller   DL  , et al.  Fluoroscopically guided interven-
tional procedures: a review of the radiation effects on patients’ skin and hair .  Radiology   2010 ; 254 : 326 – 341 . 
   48.    Little   MP  ,   Wakeford   R  ,   Tawn   EJ  , et al.  Risks associated with low doses and 
low rates of ionizing radiation: why linearity may be (almost) the best we can do .  Radiology   2009 ; 251 : 6 – 12 . 
   49.    Strzelczyk   J  ,   Damilakis   J  ,   Marx   MV  ,   Macura   KJ  .  Facts and controversies 
about radiation exposure, part 2: low-level exposures and cancer risk . 
 J Am Coll Radiol   2007 ; 4 : 32 – 39 . 
   50.    Brenner   D  ,   Elliston   C  ,   Hall   E  ,   Berdon   W  .  Estimated risk of radiati