In [57]:
# read in the entire pdf and extract the text

import pandas as pd
from openai import OpenAI
import os
import json
from dotenv import load_dotenv
import pdfplumber

# Constants
INPUT_FILE = "G:\My Drive\Master IS\MasterThesis\Literature\Maybe_Relevant\Abdullahi2024_Retrieval-Based Diagnostic Decision SupportMixed Methods Study.pdf"
OPENAI_MODEL = 'gpt-4o-mini'
load_dotenv()
# Instantiate the OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_APIKEY'))  # You can omit the api_key if it's set in your environment


# read in text from pdf file
def read_pdf(file_path):
    with pdfplumber.open(file_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
    return text


pdf_text = read_pdf(INPUT_FILE)



In [58]:
# print the text
print(pdf_text)

JMIR MEDICAL INFORMATICS Abdullahi et al
Original Paper
Retrieval-Based Diagnostic Decision Support: Mixed Methods
Study
Tassallah Abdullahi1, MSc; Laura Mercurio2, MD; Ritambhara Singh1,3, PhD; Carsten Eickhoff4, PhD
1Department of Computer Science, Brown University, Providence, RI, United States
2Departments of Pediatrics & Emergency Medicine, Alpert Medical School, Brown University, Providence, RI, United States
3Center for Computational Molecular Biology, Brown University, Providence, RI, United States
4School of Medicine, University of Tübingen, Tübingen, Germany
Corresponding Author:
Carsten Eickhoff, PhD
School of Medicine
University of Tübingen
Schaffhausenstr, 77
Tübingen, 72072
Germany
Phone: 49 7071 29 843
Email: carsten.eickhoff@uni-tuebingen.de
Abstract
Background: Diagnostic errors pose significant health risks and contribute to patient mortality. With the growing accessibility
of electronic health records, machine learning models offer a promising avenue for enhancing di

In [59]:
# Prompt template
PROMPT_TEMPLATE = '''
You are assisting in finding potential information in a paper's content I provide you below. It's about to identify any content related to requirements for ai-based knowledge/information retrieval systems -> If the section or paper is not about IE/IR or similar -> not relevant! The objective is to identify places in the text that can contribute to any of the following predefined requirements: 
Organizational-Centric
1. Increased Productivity & Efficiency (f.x. solution enables time savings or ROI or similar.)
2. Performance Indicators / Measurability / Validation Metrics (Metrics/Test Data to validate solution.)
3. Organizational Acceptance (Any factors to achieve company acceptance like change management.)
4. Cost-Effectiveness (Releated to a positive cost-benefit ratio.)
User-Centric
5. User-Friendliness & Accessibility (f.x. intuitive access for all user skill levels.)
6. Natural Language Use (Enable conversational interactions in everyday language.)
7. Contextual Relevance & Awareness (Deliver precise, domain-specific answers tailored to context and user request.)
8. Feedback Mechanism (f.x. Incorporate stakeholders or users for initial planning or continuous improvement.)
9. Traceability & Transparency (f.x. Show sources and reasoning behind results to foster trust.)
10. Reasonable Response Times (Provide timely answers for an efficient user experience.)
Data-Centric
11. Data Integration / Handling Data Variety (Combine structured and unstructured data seamlessly / from different sources.)
12. Data Quality and Actuality (Are able to handle up-to-date and changing data.)
Tech-Centric
13. Integration Capability / Modularity (f.x. can be integrated and used within any workflows and systems / API-Design.)
14. Adaptability & Scalability (Support evolving demands and facilitate expansions.)
15. Security (f.x. mentions any security measures or standards.)
Legal and Ethical
16. Regulatory Compliance (Adhere to legal standards like GDPR or internal governance.)
17. Fairness & Bias (Ensure unbiased data handling and algorithmic outputs / mention anything about discrimination.)
18. Sustainability (f.x. Minimize environmental impact or ensure long-term viability.)

If found, please provide it like this (just an example data): 

(R2) Performance Indicators / Measurability / Validation: 
The paper discusses the validation of a retrieval system, indicating the need for metrics to assess its effectiveness. Also a suitable test data is needed as a quality metric to assist engineers and stakeholders for validation and expectation management. (p. 195).

(R7) Context Relevance: 
Paper describes the failure points of retrieval systems and the need to have a system that is designed to generate contextually relevant and accurate information. It describes that requests to unavailable content or requests for questions that are related to the content but don’t have answers the system could be fooled into giving a confusing response. (p.197)

(R9)Traceability:
The AITutor system proposed in the paper allows students to verify answers by a returned a source list, indicating it is required to support user by giving insights into how the system came to this solution (p. 196).

(R11) Data Integration / Handling Data Variety:
The paper mentions the need to process domain knowledge captured as artifacts in different formats, indicating the integration of various data types (p. 194). It shows the need to proper data handling, f.x. if the information lies in a certain format such as a table within the document. 

(R12) Data Quality and Actuality:
The paper highlights the importance of maintaining accurate and up-to-date information for reliable insights in an information retrieval systems (p. 195).

(R13) Integration Capability / Modularity:
The discussed AITutor system is integrated into the universities learning management system, showcasing its capability to integrate with existing workflows or systems (p. 196).

(R14) Adaptability & Scalability:
The paper discusses the need for retrieval systems to adapt to evolving demands and the importance of continuous adaptions (p. 198).

(R16) Regulatory Compliance: 
The study discusses compliance with privacy regulations (like f.x. GDPR) and the need to address these issues to ensure compliance with legal and ethical standards (p. 8).
(R18) Sustainability:
The discussion around the environmental impact of the suggest to minimize the footprint of such solutions (p. 195). 

The requirements not necessarily have to be in the exact wording as above, but should be clearly transferable to the predefined requirements. It's also ok to not find any of the requirements in the text. Usually no text can include all single requirements. A mentioned insight can also apply for multiple requirements.

The extracted paper's text: 
{text}
'''


In [60]:
# Prepare the prompt
prompt = PROMPT_TEMPLATE.format(text=pdf_text)

response = client.chat.completions.create(
    model=OPENAI_MODEL,
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=1,
    temperature=0.1,
)
reply = response.choices[0].message.content

# Output the response
print(reply)

Here are the relevant extracts from the provided paper that align with the predefined requirements for AI-based knowledge/information retrieval systems:

(R1) Increased Productivity & Efficiency:  
The study emphasizes that the CliniqIR framework enhances diagnostic quality by leveraging existing electronic health records and literature, which can lead to time savings in the diagnostic process and potentially improve return on investment (ROI) for healthcare systems (p. 1).

(R2) Performance Indicators / Measurability / Validation Metrics:  
The paper discusses the evaluation of CliniqIR's performance using metrics such as mean reciprocal rank (MRR) to assess its effectiveness in predicting diagnoses, indicating the importance of measurable outcomes for validation (p. 10).

(R3) Organizational Acceptance:  
The authors highlight the need for systems like CliniqIR to be user-friendly and transparent to foster trust among clinicians, which is crucial for organizational acceptance and int