# Ollama Notebook 
 A notebook to extract relevant, predefined parameters from research articles and provide the results in a desired format. 

In [1]:
# if required
# !pip install pymupdf 

In [2]:
# importing relevant libraries 
import pandas as pd
import fitz
import json
import subprocess
import os  
from tqdm import tqdm
import io
from PIL import Image

In [3]:
# os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU-4ce5b5b6-0e10-58a3-b09a-2f4946277a5a'

The `extract_text` function reads in a pdf, and appends the entire content into a single string. It takes the path to a pdf as its argument.

In [4]:
def extract_text (pdf_path):
    document = fitz.open(pdf_path)
    text = ''
    for page in tqdm(document):
        text += page.get_text('text') + '\n'
    return text 

`pdf_to_images` converts the pdfs to images for easier extraction of structured data. Noticed in preliminary trials that normal text extraction can lead to wonky results and resutling data in text format was not standardized. Function has the following atttributes:
* Input is the path to a pdf 
* images are saved in the directory designated in `img_path` 
* output should be the pdf pages as images in a list? 

In [5]:
def pdf_to_images(pdf_path, dpi=200, img_path='./pdf_to_images/'):
    doc = fitz.open(pdf_path)
    images = []

    if not os.path.exists(img_path):
        os.makedirs(img_path)

    for i, page in enumerate(doc):
        # render at higher resolution for OCR
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
        img = Image.open(io.BytesIO(pix.tobytes("png")))
        out_path = f"{img_path}page_{i+1}.png"
        img.save(out_path)
        images.append(out_path)
    return images


Debugging cell

In [6]:
def save_text(
        text,
        ouput = 'text.txt'
        ):
    with open(ouput,'w', encoding='utf-8') as file:
        file.write(text)

`query_ollama` function calls the local Ollama LLM in this case. The argument for this function are

* prompt in utf-8 string format 
* desired model - 'mistral' is the default argument

In [7]:
def query_ollama (
        prompt,
        model = 'mistral'
        ):
    
    ollama_path = r"C:\Users\Faisal\AppData\Local\Programs\Ollama\ollama.exe"
    #path to Ollama executable 
    result = subprocess.run(
        [ollama_path, "run", model],
        input=prompt.encode("utf-8"),
        capture_output=True
    )
    return result.stdout.decode("utf-8")


Image version - `query_ollama_image` function calls the local Ollama LLM with the prompt defined elsewhere and an image. The argument for this function are

* prompt in utf-8 string format 
* desired model - 'qwen2.5vl:7b' is the default argument

In [8]:
def query_ollama_image(image_path, prompt, model="qwen2.5vl:7b"):
    # Proper multimodal schema: content is a list of "parts"
    message = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image", "image": image_path}
        ]
    }]
    
    result = subprocess.run(
        ["ollama", "run", model],
        input=json.dumps(message).encode("utf-8"),
        capture_output=True
    )
    
    return result.stdout.decode("utf-8").strip()



In [9]:
# GOOGLE SUGGESTED !!!!!!!!!!!!!!
def query_ollama_image(image_path, prompt, model="qwen2.5vl:7b"):
    # The image path is provided as a command-line argument.
    result = subprocess.run(
        ["ollama", "run", model, "--verbose"], # --verbose is helpful for debugging
        input=f"{prompt} {image_path}".encode("utf-8"),
        capture_output=True
    )
    return result.stdout.decode("utf-8").strip()

Prompt is defined here, model is queried and error handling is included. Arguments are:
 * extracted text from pdf and limited to first 2000 characters
 * model - default is 'mistral'
 
 Output should be the relevant conditions in json format.

In [10]:
def extract_conditions(
        text, 
        model="mistral"
        ):
    prompt = f"""
    Extract the following experimental parameters from the text:

    - concentration
    - hydrolysis ratio
    - rate of base addition
    - elemental cation ratio

    Return ONLY a JSON array of objects with keys:
    ["parameter", "value", "unit", "context"].

    Text:
    {text[:2000]}  # limit to first 2000 chars per call
    """
    response = query_ollama(prompt, model=model)
    
    try:
        data = json.loads(response)
    except:
        print("Could not parse JSON, raw output:", response[:300])
        data = []
    
    return data


Uses the `pdf_to_images` function to extract tables, and converts them to markdown table format. Results are appended to a list of dicitonaries 

In [11]:
def extract_tables_from_pdf(pdf_path, model="qwen2.5vl:7b"):
    images = pdf_to_images(pdf_path)
    tables = []
    
    for img in tqdm(images, desc="OCR with Qwen"):
        prompt = """Extract all tables from this page. Return them in Markdown table format. If you think there is not table present, describe what you see instead.
        Return only the first 10 rows.
        """
        response = query_ollama_image(img, prompt, model=model)
        tables.append({"page": img, "tables": response})
    return tables


Finally `process_pdfs` iterates through all the pdfs in a folder, extracts their text, prompts the LLM, and produces a `parsed_conditons.csv` file with the compiled conditions. Arguments are:
* path to a folder containing the pdfs 
* path/name of the ouput csv file - default is `parsed_conditons.csv`
* chosen model - default is 'mistral'

Output is the csv file. 

In [12]:
def process_pdfs(pdf_folder, output_csv="parsed_conditions.csv", model="mistral"):
    all_data = []
    
    # extracting from body text
    for file in tqdm(os.listdir(pdf_folder)):
        if file.endswith(".pdf"):
            print(f"Processing {file}...")
            text = extract_text(os.path.join(pdf_folder, file))
            conditions = extract_conditions(text, model=model)
            for c in tqdm(conditions):
                c["source"] = file  # tag with filename
            all_data.extend(conditions)

            # Extracting from tables via OCR
            if model == 'qwen2.5vl:7b':
                tables = extract_tables_from_pdf(os.path.join(pdf_folder, file), model=model)
                with open(file.replace(".pdf", "_tables.json"), "w", encoding="utf-8") as f:
                    json.dump(tables, f, indent=2)
    
    df = pd.DataFrame(all_data)
    df.to_csv(output_csv, index=False)
    print(f"Saved {len(df)} rows to {output_csv}")


# GOOGLE SUGGESTED !!!!!!!!!!!!!!

# Assume the other functions (extract_text, pdf_to_images, query_ollama, query_ollama_image, extract_conditions, extract_tables_from_pdf) are already defined and correct as per the previous correction.

def parse_markdown_table_with_ollama(markdown_table, model="mistral"):
    """
    Parses a markdown table string into a JSON array of objects 
    using a text-based LLM.
    """
    prompt = f"""
    You have been provided with a Markdown table extracted from a research article.
    Extract the following experimental parameters from the table:
    - concentration
    - hydrolysis ratio
    - rate of base addition
    - elemental cation ratio

    Return ONLY a JSON array of objects with keys:
    ["parameter", "value", "unit", "context"].
    
    Markdown Table:
    {markdown_table}
    """
    
    # Use the existing query_ollama function for text-based parsing
    response = query_ollama(prompt, model=model)

    try:
        data = json.loads(response)
    except json.JSONDecodeError:
        print("Could not parse JSON from table markdown, raw output:", response[:300])
        data = []
    
    return data

#----------------------------------------------------------------------------------------------------

def process_pdfs(pdf_folder, output_csv="parsed_conditions.csv", model="mistral"):
    all_data = []
    
    for file in tqdm(os.listdir(pdf_folder), desc="Processing PDFs"):
        if file.endswith(".pdf"):
            print(f"Processing {file}...")
            file_path = os.path.join(pdf_folder, file)

            # --- Text-based extraction (for the body of the article) ---
            try:
                text = extract_text(file_path)
                conditions_from_text = extract_conditions(text, model=model)
                for c in conditions_from_text:
                    c["source"] = f"{file}_body_text"
                all_data.extend(conditions_from_text)
            except Exception as e:
                print(f"Error processing text for {file}: {e}")

            # --- Image-based extraction (for tables via OCR) ---
            # This part only runs if the specified model is a multimodal one
            if model == 'qwen2.5vl:7b':
                try:
                    tables_data = extract_tables_from_pdf(file_path, model=model)
                    
                    # You will use a new text-based LLM call to parse the markdown table string
                    for page_data in tables_data:
                        markdown_tables_string = page_data["tables"]
                        # Call the new function to parse the markdown string
                        conditions_from_tables = parse_markdown_table_with_ollama(markdown_tables_string)
                        
                        for c in conditions_from_tables:
                            # Add a source tag that includes the page information
                            c["source"] = f"{file}_table_page_{page_data['page'].split('_')[-1].split('.')[0]}"
                        
                        all_data.extend(conditions_from_tables)
                except Exception as e:
                    print(f"Error processing tables for {file}: {e}")

    df = pd.DataFrame(all_data)
    df.to_csv(output_csv, index=False)
    print(f"Saved {len(df)} rows to {output_csv}")

Here we call the `process_pdfs` function to analyze all the pdfs. Don't forget to add a path to the pdf folder 

In [14]:
# process_pdfs(
#     pdf_folder='./articles',
#     output_csv='parsed_conditions.csv',
#     model='mistral'
# )

In [15]:
process_pdfs(
    pdf_folder='./articles',
    output_csv='parsed_conditions.csv',
    model='qwen2.5vl:7b'
)

Processing PDFs:   0%|          | 0/1 [00:00<?, ?it/s]

Processing d4nr00473f.pdf...


100%|██████████| 12/12 [00:00<00:00, 214.46it/s]


Could not parse JSON, raw output: ```json
[
    {
        "parameter": "concentration",
        "value": "Increasing",
        "unit": "Not specified",
        "context": "limits nanotube production for imogolite nanotubes"
    },
    {
        "parameter": "hydrolysis ratio",
        "value": "relatively high or low",
        "unit


OCR with Qwen: 100%|██████████| 12/12 [00:48<00:00,  4.01s/it]


Could not parse JSON from table markdown, raw output:  I apologize for the confusion earlier. Since there's no table provided in the description, let me provide an example based on assumptions about the content from your description:

```json
[
  {
    "parameter": "Concentration",
    "value": "0.1 M",
    "unit": "Molar",
    "context": "Concentratio
Could not parse JSON from table markdown, raw output:  I apologize for the misunderstanding. Since there's no table provided, let me provide an example based on a hypothetical table from the same research article:

```json
[
  {
    "parameter": "Concentration",
    "value": "0.1 M",
    "unit": "Molarity",
    "context": "Aluminum nitrate solution"
  
Could not parse JSON from table markdown, raw output:  I'm sorry for any confusion, but as the Markdown table you provided is not visible, I'm unable to extract the experimental parameters from it. I can only provide an example of how the JSON array would look like when filled with data:



Processing PDFs: 100%|██████████| 1/1 [01:32<00:00, 92.99s/it]

Could not parse JSON from table markdown, raw output:  I apologize for the confusion, as it appears there is no table provided in the text you've shared. In order to extract the experimental parameters from a Markdown table, I would need to access the actual table itself.

However, I can still help you understand how to format the experimental paramete
Saved 5 rows to parsed_conditions.csv



