<a href="https://colab.research.google.com/github/edward-honour/PDFtoJSON-OpenAI/blob/main/PDFtoJSONTool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PIP install required Libraries.**

For this demo we are using:

*   openai - to call the LLM
*   pdfminer.six - to extract text from a PDF.


In [1]:
!pip install openai pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250506


**Import the OpenAI API Key from Colab Secrets**

We need to get the OpenAI from the secrets in Google Colab and put it in 'api_key' and then create a 'client' object that will be used later.


In [2]:
from google.colab import userdata
import openai

api_key = userdata.get("OpenAI")

if not api_key:
    raise ValueError("❌ OpenAI API key not found. Use `userdata.set('OpenAI', 'sk-...')` first.")

client = openai.OpenAI(api_key=api_key)

**Upload the PDF File to Process**

We use the files library from google colab to upload the file we want to process.


In [3]:
from google.colab import files

uploaded = files.upload()  # Upload a health report PDF
pdf_filename = list(uploaded.keys())[0]
print(f"📄 Uploaded: {pdf_filename}")


Saving SampleHealthReport.pdf to SampleHealthReport.pdf
📄 Uploaded: SampleHealthReport.pdf


**Define the Function Schema**

The function schema is like a contract or blueprint that tells GPT:

“You are not just generating text. You are calling a function named extract_patient_info, and here’s what the input should look like.”

It's based on JSON Schema standards and includes:

*   **"name"**	The name of the virtual function GPT is calling
*   **"description"**	Helps the model understand what the function is for
*   **"parameters"**	Specifies the expected structure of the output (a JSON object)
*   **"properties"**	Lists all fields we want from the report
*   **"required"**	Tells GPT which fields are mandatory


In [4]:
patient_info_function = {
    "name": "extract_patient_info",
    "description": "Extracts key personal fields from a health report.",
    "parameters": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "patient_id": {"type": "string"},
            "mrn": {"type": ["string", "null"]},
            "birth_date": {"type": "string", "format": "date"},
            "age": {"type": "integer"},
            "gender": {"type": "string"},
            "address": {"type": "string"},
            "phone": {
                "type": "object",
                "properties": {
                    "home": {"type": ["string", "null"]},
                    "work": {"type": ["string", "null"]},
                    "cell": {"type": ["string", "null"]}
                }
            },
            "email": {"type": "string"},
            "provider": {"type": "string"},
            "referring_provider": {"type": "string"},
            "print_date": {"type": "string", "format": "date"}
        },
        "required": ["name", "patient_id", "birth_date", "age", "gender", "address", "email"]
    }
}

**Define PDF and OpenAI Functions**

We use the **extract_text** library function to get the text from the PDF that was uploads.


In [5]:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(path: str) -> str:
    return extract_text(path)

pdf_text = extract_text_from_pdf(pdf_filename)
print("✅ PDF text extracted. Sample preview:\n")
print(pdf_text[:1000])


✅ PDF text extracted. Sample preview:

PATIENT CHART  
HONOUR, EDWARD
123 S MAIN STREET CHICAGO IL  60611 
O: (312) 555-1212(preferred)

DOB: 8/15/1966  AGE: 58 yrs.  Acct#: 148

DEMOGRAPHICS

NAME: HONOUR, EDWARD
PATIENT ID/#: 148
MRN:
BIRTH DATE: 8/15/1966
AGE: 58 yrs.
GENDER: M
ADDRESS: 123 S MAIN STREET

CHICAGO IL  60611

Home:
Work:
Cell: (202) 555-1212
EMAIL: ED@EXAMPLE.COM
PROVIDER: SMITH, JOHN, MD
REFERRING PROVIDER: JOHN SMITH
STE A
3196 E ELM STREET
CHICAGO IL  60611
(773) 555-1212

ALLERGIES

No data on file

MEDICATION DETAIL

Current:
SIG: epinephrine 0.15 mg/0.3 mL injection auto-injector, 0 days, Dispense #2 Each, 0 Refills, Directions: administer 0.15mg 
subcutaneously as needed
8/8/2023 SMITH,JOHN,MD
----------------------------------
Current:
SIG: lisinopril 20 mg oral tablet, 90 days, Dispense #90 Tablet, 0 Refills, Directions: Pt to take 1 tablet po a day
2/9/2024 SMITH,JOHN,MD
----------------------------------
Current:
SIG: finasteride 1 mg oral tablet, 60 days, 

**Extract Data**

We are using the gpt-4o-mini to extract data and put the results into function_schema.


In [6]:
import json

def call_extraction(text: str, function_schema: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.0,
        messages=[
            {"role": "system", "content": "You are a medical data extractor."},
            {"role": "user", "content": text}
        ],
        functions=[function_schema],
        function_call={"name": function_schema["name"]}
    )
    args = response.choices[0].message.function_call.arguments
    return json.loads(args)

try:
    patient_info = call_extraction(pdf_text, patient_info_function)
    print("✅ Patient Info Extracted:\n")
    print(json.dumps(patient_info, indent=2))
except Exception as e:
    print("❌ Extraction failed:", str(e))



✅ Patient Info Extracted:

{
  "name": "HONOUR, EDWARD",
  "patient_id": "148",
  "birth_date": "1966-08-15",
  "age": 58,
  "gender": "M",
  "address": "123 S MAIN STREET, CHICAGO IL 60611",
  "phone": {
    "home": "(312) 555-1212",
    "cell": "(202) 555-1212"
  },
  "email": "ED@EXAMPLE.COM",
  "provider": "SMITH, JOHN, MD",
  "referring_provider": "JOHN SMITH",
  "print_date": "2025-05-13"
}
