Right now, all pdfs have the section Interpretation and conclusions of economic evidence, so we aim to find that section. Since reference always follows this section, we stop at reference.

TODO:
For pdfs without section Interpretation and conclusions of economic evidence, we need NLP method to determine start and end.

In [None]:
import pdfplumber
import re
from fuzzywuzzy import process

def extract_section_from_pdf(pdf_path, target_section="Interpretation and conclusions of economic evidence", start_page=20):
    """
    Extracts the section with a name similar to 'target_section' from the given PDF, starting from 'start_page'.
    Works even if the section name is slightly different.
    """
    text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num in range(start_page - 1, len(pdf.pages)):  # Adjust for 0-based index
            text.append(pdf.pages[page_num].extract_text())

    full_text = "\n".join(filter(None, text))  # Remove None values

    # Extract all section headers using a regex pattern for headings (e.g., "B.x.x Section Name")
    section_pattern = r"(B\.\d+\.\d+.*?)\n"
    section_matches = re.findall(section_pattern, full_text)

    if not section_matches:
        print("No section headers found.")
        return None

    # Find the best match for the target section
    best_match, score = process.extractOne(target_section, section_matches)

    if score < 80:  # Adjust threshold based on quality
        print(f"No close match found for '{target_section}'. Closest match: '{best_match}' (Score: {score})")
        return None

    print(f"Extracting section: {best_match}")

    # Find start position of matched section
    start_pos = full_text.find(best_match)

    # Find Reference to determine the end position
    next_match = re.search(r"B\.\d+\s+References", full_text[start_pos + len(best_match):])

    if next_match:
        end_pos = start_pos + len(best_match) + next_match.start()
        extracted_text = full_text[start_pos:end_pos]
    else:
        extracted_text = full_text[start_pos:]  # Extract till the end if no next header found

    return extracted_text

# Example usage:
pdf_path = "10 Typical Committee Papers/committee-papers-Baricitinib-not recommend.pdf"
extracted_text = extract_section_from_pdf(pdf_path, start_page=20) # Starts at page 20 because all titles are presented in the content section

if extracted_text:
    print(extracted_text)

Extracting section: B.3.15 Interpretation and conclusions of economic evidence
B.3.15 Interpretation and conclusions of economic evidence
Summary of cost-effectiveness evidence
The cost-effectiveness of baricitinib in severe AA was evaluated versus ‘watch and wait’, the
most clinically relevant comparator for this population, due to a lack of high-quality clinical data
for another comparator. In the probabilistic base case, baricitinib was associated with higher
costs (£***** per patient over a lifetime horizon) and higher benefits (**** QALYs per patient over
a lifetime horizon) compared with ‘watch and wait’. The base case probabilistic ICER was
£29,111 per QALY gained and did not differ meaningfully from the deterministic ICER (£29,395
per QALY gained). In absolute terms, base case probabilistic results suggested that baricitinib
4mg was associated with a total cost of £******, of which £****** related to drug acquisition. These
were partially compensated by cost savings due to redu

1. Extract pdf section
2. Run pdf section through NLP
3. Extract website section from website (committee)
4. Run website section using NLP
5. Use summarized website section to test accuracy of pdf section

### Only run the next part with GPU!
A few models we could look at:
1. MedAlpaca-7b
2. OpenBioLLM-8B
3. LLaMA2 base model (Matt has gotten meta's verification)

Currently tried to run these with Google Colab's T4 but struggled, even though we are only choosing these smaller <10B models (OpenBioLLM has a 70B model that works better than ChatGPT). With our current GPU resources and VRAM, running these are especially difficults. Next step is to consider using ChatGPT instead to overcome the limitation to computational power.

In [None]:
# Setting up drive
from google.colab import drive
drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/medalpaca-7b"

# Use the next part to download the model
#!huggingface-cli download medalpaca/medalpaca-7b --cache-dir $model_path

from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "/content/drive/MyDrive/medalpaca-7b/models--medalpaca--medalpaca-7b/snapshots/fbb41b75d5a46ba405d496db1083a6f1d3df72a2" # change this accordingly
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype="auto", load_in_8bit=True)

print(" Model loaded successfully from", model_path)

In [None]:
# Create a text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Process input text correctly (truncate to 512 tokens)
tokenized_input = tokenizer(extracted_text, truncation=True, max_length=500)
truncated_text = tokenizer.decode(tokenized_input["input_ids"])

# Improved prompt that forces a number response
final_prompt = f"""
Question: Based on the economic evidence provided, rate the cost-effectiveness of this treatment on a scale from 1 to 5. 
1 = Not cost-effective, 5 = Highly cost-effective.
Answer ONLY with a single number (1, 2, 3, 4, or 5). Do NOT provide explanations or any other text.

Example 1:
Economic Evidence: The ICER is above the standard threshold, making the treatment unlikely to be cost-effective.
Answer: 2

Example 2:
Economic Evidence: The treatment provides high QALY gains and remains cost-effective under multiple scenarios.
Answer: 5

Now, based on the economic evidence below, provide your answer.

Economic Evidence:
{truncated_text}

Answer:
"""

# Generate response with max_new_tokens
raw_response = text_generator(final_prompt, max_new_tokens=10, do_sample=True, temperature=0.2, top_p=0.9)[0]["generated_text"]

In [None]:
# from transformers import pipeline
# pl = pipeline("text-generation", model=model, tokenizer=tokenizer) 

# def chunk_text(text, max_length=512, overlap=50):
#     tokens = tokenizer.encode(text)
#     chunks = []
#     for i in range(0, len(tokens), max_length - overlap):
#         chunk = tokens[i:i+max_length]
#         # Ensure last sentence is included in the next chunk for continuity
#         if i > 0:
#             chunk = tokenizer.encode(chunks[-1].split(".")[-1]) + chunk
#         chunks.append(tokenizer.decode(chunk))
#     return chunks

# # Split text into 512-token chunks with 50-token overlap
# chunks = chunk_text(text)

# # Summarize each chunk individually
# summaries = []
# for chunk in chunks:
#     prompt = f"Summarize the cost-effectiveness findings in this section:\n\n{chunk}"
#     summary = text_generator(prompt, max_length=64, do_sample=True)[0]["generated_text"]
#     summaries.append(summary)

# # Combine all summaries into a single text
# combined_summary = " ".join(summaries)

# # Ask the model to make a final decision
# final_prompt = f"Based on the following economic evaluation summary, determine if the treatment is cost-effective:\n\n{combined_summary}\n\nAnswer with a value between 1 to 5, with 5 being the most cost-effective."

# final_response = text_generator(final_prompt, max_length=100, do_sample=False)[0]["generated_text"]

# print("Final Cost-Effectiveness Verdict:")
# print(final_response)


In [None]:
# Load Microsoft Phi-2 model
model_name = "microsoft/phi-2"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

# Create a text-generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, max_length=4096)

# Construct a short and effective prompt
final_prompt = f"""

Question: Based on the economic evidence provided, rate the cost-effectiveness of this treatment on a scale from 1 to 5. 
1 = Not cost-effective, 5 = Highly cost-effective.
- Answer with a number between 1 and 5.
- Do NOT repeat any text.
- Provide only a number.

Example 1:
Economic Evidence: The ICER is above the standard threshold, making the treatment unlikely to be cost-effective.
Answer: 2

Example 2:
Economic Evidence: The treatment provides high QALY gains and remains cost-effective under multiple scenarios.
Answer: 5

Now, based on the economic evidence below, provide your answer.

Economic Evidence:
{extracted_text}

Answer:
"""

# Generate response (limit to 3 tokens)
raw_response = text_generator(final_prompt, max_new_tokens=10, do_sample=True, temperature=0.2, top_p=0.9)[0]["generated_text"]

print("Retrieved Relevant Text:", raw_response)


In [None]:
!pip install openai

import openai
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "ENTER CHATGPT API KEY HERE"


In [None]:
def get_cost_effectiveness_score(retrieved_text):
    """
    Sends the retrieved economic evaluation text to GPT-4 and asks it to rate cost-effectiveness on a scale of 1-5.
    """
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert in health economics and NICE guidelines."},
            {"role": "user", "content": f"""
            Based on the following economic evaluation, rate the cost-effectiveness of the treatment from 1 to 5:
            1 = Not cost-effective
            5 = Highly cost-effective

            {extracted_text}

            Answer with ONLY a number (1, 2, 3, 4, or 5). Provide 1 sentence explanations.
            """}
        ],
        max_tokens=20,
        temperature=0.1  # Low temperature for deterministic output
    )

    # Extract response
    gpt_output = response.choices[0].message.content.strip()

    return gpt_output

# Example call
cost_effectiveness_score = get_cost_effectiveness_score(extracted_text)

# Print only the final numerical value
print(cost_effectiveness_score)
