# Ollama Notebook 
 A notebook to extract relevant, predefined parameters from research articles and provide the results in a desired format. 

In [4]:
# if required
# !pip install pymupdf 
# !pip install python-docx



In [5]:
# importing relevant libraries 
import pandas as pd
import fitz
import json
import subprocess
import os  
from tqdm import tqdm
import io
from PIL import Image
from docx import Document

In [7]:
# os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU-4ce5b5b6-0e10-58a3-b09a-2f4946277a5a'

The `extract_text` function reads in a pdf, and appends the entire content into a single string. It takes the path to a pdf as its argument.

In [8]:
def extract_text (pdf_path):
    document = fitz.open(pdf_path)
    text = ''
    for page in tqdm(document):
        text += page.get_text('text') + '\n'
    return text 

`pdf_to_images` converts the pdfs to images for easier extraction of structured data. Noticed in preliminary trials that normal text extraction can lead to wonky results and resutling data in text format was not standardized. Function has the following atttributes:
* Input is the path to a pdf 
* images are saved in the directory designated in `img_path` 
* output should be the pdf pages as images in a list? 

In [9]:
def pdf_to_images(pdf_path, dpi=200, img_path='./pdf_to_images/'):
    doc = fitz.open(pdf_path)
    images = []

    if not os.path.exists(img_path):
        os.makedirs(img_path)

    for i, page in enumerate(doc):
        # render at higher resolution for OCR
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
        img = Image.open(io.BytesIO(pix.tobytes("png")))
        out_path = f"{img_path}page_{i+1}.png"
        img.save(out_path)
        images.append(out_path)
    return images


Debugging cell

In [10]:
def save_text(
        text,
        ouput = 'text.txt'
        ):
    with open(ouput,'w', encoding='utf-8') as file:
        file.write(text)

`query_ollama` function calls the local Ollama LLM in this case. The argument for this function are

* prompt in utf-8 string format 
* desired model - 'mistral' is the default argument

In [11]:
def query_ollama (
        prompt,
        model = 'mistral'
        ):
    
    ollama_path = r"C:\Users\Faisal\AppData\Local\Programs\Ollama\ollama.exe"
    #path to Ollama executable 
    result = subprocess.run(
        [ollama_path, "run", model],
        input=prompt.encode("utf-8"),
        capture_output=True
    )
    return result.stdout.decode("utf-8")


In [12]:
TEMPLATE_PROMPT = """
You are an expert in extracting synthesis and calcination conditions from scientific papers.  
Given the following text (excerpt of a paper), fill in the template below.  
Only include information explicitly mentioned in the text. If something is not reported, leave it blank.  

Template:

Synthesis conditions:
Ni state: 
Ni source: 
Ni concentration: 
Base: 
Chelating agent: 
pH: 
Temp: 
Atmosphere: 
Pressure: 
Ni flowrate: 
Chelating agent flowrate: 
OH flowrate: 
Mixing method: 
Form: 

Calcination conditions:
Li source: 
Ni to Li ratio: 
Final temp: 
Ramp rate: 
Duration: 
Atmosphere: 
Pressure: 
Intermediate annealing: 
Form: 
Furnace type: 

Observations:
Rock salt structure observed at: 
Layered LiNiO2 observed at: 
Decomposition observed at: 
Intermediate phases: 

Text to analyze:
{paper_text}
""".strip()


In [13]:
def build_prompt_from_template(text, max_chars=8000):
    # limit to avoid context overflow on local models
    return TEMPLATE_PROMPT.format(paper_text=text[:max_chars])


Image version - `query_ollama_image` function calls the local Ollama LLM with the prompt defined elsewhere and an image. The argument for this function are

* prompt in utf-8 string format 
* desired model - 'qwen2.5vl:7b' is the default argument

In [14]:
def query_ollama_image(image_path, prompt, model="qwen2.5vl:7b"):
    # Proper multimodal schema: content is a list of "parts"
    message = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image", "image": image_path}
        ]
    }]
    
    result = subprocess.run(
        ["ollama", "run", model],
        input=json.dumps(message).encode("utf-8"),
        capture_output=True
    )
    
    return result.stdout.decode("utf-8").strip()



In [15]:
def strip_code_fences(s: str) -> str:
    s = s.strip()
    if s.startswith("```"):
        # remove first fenced block if present
        parts = s.split("```")
        # grab the first code block content if it exists, else join non-fenced parts
        if len(parts) >= 3:
            return parts[1].strip()
        return "".join(p for i,p in enumerate(parts) if i % 2 == 0).strip()
    return s

Prompt is defined here, model is queried and error handling is included. Arguments are:
 * extracted text from pdf and limited to first 2000 characters
 * model - default is 'mistral'
 
 Output should be the relevant conditions in json format.

In [16]:
def extract_filled_template(text, model="mistral", max_chars=8000):
    prompt = build_prompt_from_template(text, max_chars=max_chars)
    response = query_ollama(prompt, model=model)
    return strip_code_fences(response)


Uses the `pdf_to_images` function to extract tables, and converts them to markdown table format. Results are appended to a list of dicitonaries 

In [17]:
def extract_tables_from_pdf(pdf_path, model="qwen2.5vl:7b"):
    images = pdf_to_images(pdf_path)
    tables = []
    
    for img in tqdm(images, desc="OCR with Qwen"):
        prompt = """Extract all tables from this page. Return them in Markdown table format. If you think there is not table present, describe what you see instead.
        Return only the first 10 rows.
        """
        response = query_ollama_image(img, prompt, model=model)
        tables.append({"page": img, "tables": response})
    return tables


Finally `process_pdfs` iterates through all the pdfs in a folder, extracts their text, prompts the LLM, and produces a `parsed_conditons.csv` file with the compiled conditions. Arguments are:
* path to a folder containing the pdfs 
* path/name of the ouput csv file - default is `parsed_conditons.csv`
* chosen model - default is 'mistral'

Output is the csv file. 

In [18]:
def process_pdfs_to_word(
    pdf_folder,
    output_docx="parsed_conditions.docx",
    model="mistral",
    max_chars=8000
):
    doc = Document()

    for file in tqdm(os.listdir(pdf_folder), desc="Processing PDFs"):
        if not file.lower().endswith(".pdf"):
            continue

        pdf_path = os.path.join(pdf_folder, file)
        print(f"Processing {file}...")
        
        # 1) extract text from the PDF
        text = extract_text(pdf_path)

        # 2) run your template prompt against the text
        filled = extract_filled_template(text, model=model, max_chars=max_chars)
        if not filled.strip():
            filled = "No extraction produced for this paper."

        # 3) start a new page (except for the first entry)
        if doc.paragraphs:
            doc.add_page_break()

        # 4) heading + body
        doc.add_heading(f"{file}", level=1)

        # preserve line breaks by adding one paragraph per line
        for line in filled.splitlines():
            doc.add_paragraph(line)

    # 5) save once at the end
    doc.save(output_docx)
    print(f"Saved summaries to {output_docx}")


Here we call the `process_pdfs` function to analyze all the pdfs. Don't forget to add a path to the pdf folder 

In [20]:
process_pdfs_to_word(
    pdf_folder="./articles",
    output_docx="parsed_conditions.docx",
    model="mistral"  # or 'mixtral:8x7b-instruct' / 'llama3:8b-instruct'
)


Processing PDFs:   0%|          | 0/10 [00:00<?, ?it/s]

Processing Advanced Energy Materials - 2023 - Song - Li Ni Intermixing  The Real Origin of Lattice Oxygen Stability in Co‐Free Ni‐Rich.pdf...


100%|██████████| 12/12 [00:00<00:00, 394.42it/s]
Processing PDFs:  10%|█         | 1/10 [00:05<00:49,  5.52s/it]

Processing Advanced Materials - 2024 - Tayal - In Situ Insights into Cathode Calcination for Predictive Synthesis  Kinetic.pdf...


100%|██████████| 13/13 [00:00<00:00, 361.69it/s]
Processing PDFs:  20%|██        | 2/10 [00:12<00:52,  6.51s/it]

Processing An in situ structural study on the synthesis and decomposition of LiNiO2.pdf...


100%|██████████| 13/13 [00:00<00:00, 332.52it/s]
Processing PDFs:  30%|███       | 3/10 [00:18<00:43,  6.19s/it]

Processing d4nr00473f.pdf...


100%|██████████| 12/12 [00:00<00:00, 341.51it/s]
Processing PDFs:  40%|████      | 4/10 [00:24<00:36,  6.01s/it]

Processing hua-et-al-2020-chemical-and-structural-evolution-during-the-synthesis-of-layered-li(ni-co-mn)o2-oxides.pdf...


100%|██████████| 14/14 [00:00<00:00, 153.84it/s]
Processing PDFs:  50%|█████     | 5/10 [00:30<00:30,  6.13s/it]

Processing Monitoring the Formation of Nickel-Poor and Nickel-Rich Oxide Cathode Materials for Lithium-Ion Batteries with Synchrotron Radiation.pdf...


100%|██████████| 13/13 [00:00<00:00, 126.60it/s]
Processing PDFs:  60%|██████    | 6/10 [00:37<00:26,  6.56s/it]

Processing On the Sensitivity of the Ni-rich Layered Cathode Materials for Li-ion Batteries to the Different Calcination Conditions.pdf...


100%|██████████| 20/20 [00:00<00:00, 547.24it/s]
Processing PDFs:  70%|███████   | 7/10 [00:43<00:18,  6.08s/it]

Processing Phase Transformation Behavior and Stability of LiNiO2 Cathode Material for Li-Ion Batteries from In Situ Pressure and Gas Analysis and Operando X-Ray Diffraction.pdf...


100%|██████████| 23/23 [00:00<00:00, 468.78it/s]
Processing PDFs:  80%|████████  | 8/10 [00:48<00:12,  6.03s/it]

Processing Process design for calcination of nickel-based cathode materials by in situ characterization and multiscale modeling.pdf...


100%|██████████| 19/19 [00:00<00:00, 288.48it/s]
Processing PDFs:  90%|█████████ | 9/10 [00:54<00:05,  5.81s/it]

Processing Tuning Molten-Salt-Mediated Calcination in Promoting Single-Crystal Synthesis of Ni-Rich LiNixMnyCozO2 Cathode Materials.pdf...


100%|██████████| 16/16 [00:00<00:00, 183.05it/s]
Processing PDFs: 100%|██████████| 10/10 [01:00<00:00,  6.01s/it]

Saved summaries to parsed_conditions.docx



