**Introduction**

A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.
Runs fine on my 64GB RAM laptop under WSL Ubuntu, with 32GB of RAM available to WSL. 

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).
Using Word+Pandoc then ParentDocumentRetriever calling MarkdownTextSplitter chained with RecursiveCharacterTextSplitter solves this problem by parsing PDFs along with layout information.  

Replace any path by your own path structure.
In addition to Langchain and Chroma this code uses the following Open sources:
 * Ollama with Wizardlm2 and all-minilm as embedding . [Click here for Ollama website](https://ollama.com/)
Wizardlm2 and all-minilm and downloaded locally.


**Split text using Markdown but remove documents that have only formatting characters or not enough words**  
Splitting may result on having only table line separators or very short sentences.
Removing these for more relevant searches

In [5]:
from langchain_core.documents import Document
from langchain_text_splitters import MarkdownTextSplitter
from statistics import mean
from typing import Any, List, Literal, Optional, Union
import re

# we can remove markdown from child chunks as we return the parent document
# thus vector store does not hold markdown tags 
def remove_punctuation_and_markdown(input_string):
    # Remove markdown formatting characters
    input_string = re.sub(r'[_*#|+]', '', input_string)
    # Remove punctuation
    input_string = re.sub(r'[^\w\s]', '', input_string)
    return input_string

class CleanMarkdownTextSplitter(MarkdownTextSplitter):
    """Attempts to split the text along Markdown-formatted headings. Only leaving chunks with a meaningful content """
    def _split_text(self, text: str, separators: List[str]) -> List[str]:
        """Split incoming text and return chunks."""
        final_chunks = []
        chunks=super()._split_text(text,separators)
        for chunk in chunks:
            words=chunk.split()
            content_len=len(words)
            if content_len>0:
                meanlength=mean(len(word) for word in words)
            else:
                meanlength=0
            if content_len>10 and meanlength>3:
                final_chunks.append(chunk)
        return final_chunks

In [6]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import MarkdownTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents.base import Document
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.document_loaders import TextLoader
from langchain.storage import LocalFileStore
import tempfile,os
from langchain_core.vectorstores import VectorStore
from langchain_community.embeddings import OllamaEmbeddings

ollama_ef = OllamaEmbeddings(
    model="all-minilm"
)

from langchain.storage._lc_store import create_kv_docstore

# MD splits
parent_splitter = MarkdownTextSplitter(chunk_size=5000,chunk_overlap = 200)
child_splitter = CleanMarkdownTextSplitter(chunk_size=500,chunk_overlap = 60)

md_folder_path = "/mnt/d/data/md"
print("Create Vector store")
vectorstore = Chroma(persist_directory="/mnt/d/data/HIE", embedding_function=ollama_ef)
# Instantiate the LocalFileStore with the root path
print("Create document store")
fs = LocalFileStore("/mnt/d/data/documentstore")
store = create_kv_docstore(fs)
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

print("Start splitting documents")
loaders = []
for i,filename in enumerate(os.listdir(md_folder_path)):
    if filename.endswith('.md'):
        print("load document",filename)
        md_path = os.path.join(md_folder_path, filename)
        loader=TextLoader(md_path) 
        doc=loader.load()
        retriever.add_documents(doc)
print("done")


Create Vector store
Create document store
Start splitting documents
load document 1-s2.0-S0169260721006672-main.md
load document 1-s2.0-S0987705310000080-main.md
load document 1-s2.0-S098770532030109X-am.md
load document 1-s2.0-S0987705320301477-am.md
load document 1-s2.0-S1388245715006136-main.md
load document 1-s2.0-S1388245715006215-main.md
load document 1-s2.0-S2405844021015140-main.md
load document 10.2478_prilozi-2022-0013.md
load document 12519_2023_Article_698.md
load document 2106.00061.md
load document 217656905.md
load document 70581176.md
load document ACI-Hypoxic-ischaemic-encephalopathy-in-newborns-recognition-monitoring-and-early-management.md
load document Acta Paediatrica - July 1955 - ENHORNING - An Experimental Study of the Human Fetus with Special Reference to Asphyxia.md
load document aeeg.md
load document Analyse quantitative et automatisée des EEG néonataux post-anoxiques.md
load document app7-parent-info.md
load document battisti_pediaelectrophysiology.md
load d

**Chatbot to interact with documents loaded before**

In [None]:
from langchain import PromptTemplate
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
import os
from IPython.core.display import  Markdown 
from IPython.display import display
from langchain.text_splitter import MarkdownTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import LocalFileStore
import tempfile,os
from langchain_core.vectorstores import VectorStore
from langchain.storage._lc_store import create_kv_docstore
from langchain_community.embeddings import OllamaEmbeddings

ollama_ef = OllamaEmbeddings(
    model="all-minilm"
)

# MD splits
parent_splitter = MarkdownTextSplitter(chunk_size=5000,chunk_overlap = 200)
child_splitter = MarkdownTextSplitter(chunk_size=500,chunk_overlap = 60)
vectorstore = Chroma(persist_directory="/mnt/d/data/HIE", embedding_function=ollama_ef)
# Instantiate the LocalFileStore with the root path
fs = LocalFileStore("/mnt/d/data/documentstore")
store = create_kv_docstore(fs)
llm = Ollama(model="wizardlm2", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),num_ctx=4096,verbose=True)

while True:
    query = input("\n\nQuery: ")
    if query == "exit":
        break
    if query.strip() == "":
        continue

    # Prompt
    '''template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
    1. If you don't know the answer, don't try to make up an answer..
    2. If you find the answer, write the answer in a concise way
    3. Do not give references
    4. Use relevant table if available
    
    {context}
    Question: {question}
    Helpful Answer:"""'''
    # Prompt
    template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
    1. If you don't know the answer, don't try to make up an answer.
    2. If you find the answer, write the answer in a detailed way without references.
    
    {context}
    Question: {question}
    Helpful Answer:"""
    '''template = """Answer the question with details based only on the following context: {context}
     
     Question: {question}"""'''
    QA_CHAIN_PROMPT = PromptTemplate(
        input_variables=["context", "question"],
        template=template,
    )
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
        search_type="similarity", search_kwargs={"k": 6})

    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=retriever,
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},return_source_documents=True
    )
    print("====================================")
    result = qa_chain.invoke({"query": query})
    print("\n\n Data used")
    for i,doc in enumerate(result.get("source_documents", [])):
        if "source" in doc.metadata:
            print(i+1,os.path.basename(doc.metadata["source"]))
        else:
            print("no source")

        display(Markdown(doc.page_content))



Query: What is infant HIE
 Infant Hypoxic-Ischemic Encephalopathy (HIE) is a condition that occurs when there is a lack of sufficient oxygen supply and/or blood flow to the brain of a newborn infant, particularly around the time of birth. This lack of oxygen and blood flow can lead to brain injury and potentially cause long-term neurological complications, including cerebral palsy, seizures, intellectual disabilities, and in severe cases, death. HIE is often associated with difficult or complicated labor and delivery, prematurity, and various medical conditions that can compromise the circulation of blood to the baby.

The severity of HIE is typically graded on a scale developed by Dr. Sarnat, ranging from mild to severe. The earlier diagnosis and intervention are crucial in improving outcomes for these infants. Treatment options include supportive care measures to maintain vital functions and ensure adequate brain perfusion, as well as neuroprotective and neurorestorative interventi

Low Apgar scores are associated with infant mortality, cerebral palsy (CP), Attention Deficit Hyperactivity Disorder (ADHD) and seizures. (56, 57) Birth asphyxia is the most common cause of a low Apgar score and although it was never intended to be used to assess for perinatal asphyxia, it formed an important part of the inclusion criteria for all of the TH randomised control trials (RCT). Its use as a screening tool was assessed by Hogan et al. who matched 183 infants with Apgar scores &lt;7 at 5 minutes of life with 183 control infants with Apgar scores of 9-10. In this group, 70% of those with an Apgar score of &lt;4 at 5 minutes of life developed HIE compared with 14% of those with an Apgar of 4-6 and 0% of infants with an Apgar score of 9-10. (58) However, low Apgar scores in isolation are not indicative of an acute hypoxic event or subsequent poor outcome. Ruth et al. found that 80% of infants born with an Apgar score of ≤7 at 5 minutes will have a normal outcome (59) and Natarajan et al. reported that 20% of infants with an Apgar score of 0 at 10 minutes will have intact survival at school age. (60)

A limitation of the original Apgar score is that it does not account for the level of intervention or resuscitation that an infant requires. To overcome this, other groups have attempted to build on Dr Apgar’s original scoring system to improve prediction of adverse neonatal outcome. (61, 62, 63) The Specified-Apgar score records the infant’s condition regardless of resuscitative measures, for example, a pink colour with or without oxygen administration. (61) The Expanded-Apgar additionally documents the interventions that the infant receives, for example, continuous positive airway pressure, and the Combined-Apgar includes both scores. (62, 63) Dalili et al. compared all four scores and found that the Combined-Apgar score had the highest sensitivity (97%) and specificity (99%) in predicting infants with perinatal asphyxia. (64) In addition, a Combined-Apgar score of &lt; 10 was associated with HIE (p=0.02) but it was unable to predict the severity or grade of HIE. (64) Although, the conventional Apgar score had the lowest sensitivity (81%) and specificity (81%), it remains the most widely used score in newborn evaluation. Apgar scores are not perfect in isolation but they certainly do succeed in achieving Dr Apgar’s goal of focusing attention on the newborn after delivery and considering the potential of HIE as a diagnosis when an infant is born in poor condition.

#### 1.2.2. Acid-base status 

There is currently no gold standard test for the diagnosis or detection of encephalopathy but it is generally accepted that a low pH (&lt;7.10) or an increased base deficit (&gt;12mmol) are associated with adverse outcome. (65, 66, 67) Although they are not sensitive or specific for outcome prediction, (68, 69) they are key features of perinatal asphyxia. Perinatal asphyxia may occur as a result of a single significant episode of hypoxia, for example following placental abruption, multiple intermittent short episodes during labour or following chronic hypoxia. (70) Perinatal asphyxia results in hypercarbia and thus a respiratory acidosis develops. If the asphyxia is prolonged, there is a switch to anaerobic metabolism and metabolic acidosis ensues. pH and base deficit (BD) are predominantly used as screening tools for encephalopathy and are included in the scoring systems for commencing TH. (71)

pH measurements are widely accessible and can be measured from the foetal scalp during labour giving an early estimation of the infant’s well-being. Low cord pH is significantly associated with neonatal morbidity and mortality (66) and infants with a pH of &lt;7.00 have a significantly higher risk of multi-organ morbidity and poor outcome. (72, 73) A recent study by Kelly et al. demonstrated a dose-dependent relationship between degree of acidosis and adverse outcome. At pH levels of 6.96.99, 6.8-6.89 and &lt;6.8, combined outcome of death or CP was 3%, 10% and 40% respectively. (74)

However pH has been shown to be both a poor predictor of HIE and a poor discriminator of severity of encephalopathy. (75, 76, 77) One study showed that of 103 infants with a pH &lt;7.16, only 19.4% had low Apgar scores (&lt;7 at 5 minutes) and only 1 had evidence of perinatal asphyxia. (75) Another showed that approximately one third of infants with brain injury due to asphyxia had no evidence of acidosis on arterial cord pH. (77)

2 ijms-21-01487.md


> [International Journal o f](http://www.mdpi.com/journal/ijms)
>
> [***Molecular Sciences***](http://www.mdpi.com/journal/ijms)
>
> *Review*

**Treatment of Neonatal Hypoxic-Ischemic Encephalopathy with Erythropoietin Alone, and Erythropoietin Combined with Hypothermia: History, Current Status, and Future Research**

# Dorothy E. Oorschot \*[,](https://orcid.org/0000-0003-0212-4456) Rachel J. Sizemore and Ashraf R. Amer

> Department of Anatomy, School of Biomedical Sciences, and the Brain Health Research Centre, University of Otago, Dunedin 9054, New Zealand; rachel.sizemore@otago.ac.nz (R.J.S.); ashraf.amer@otago.ac.nz (A.R.A.)
>
> **\*** Correspondence: dorothy.oorschot@otago.ac.nz; Tel.: +64-3-479-7379; Fax: +64-3-479-7254
>
> Received: 9 December 2019; Accepted: 16 February 2020; Published: 21 February 2020
>
> **Abstract:** Perinatal hypoxic-ischemic encephalopathy (HIE) remains a major cause of morbidity and mortality. Moderate hypothermia (33.5 <sup>◦</sup>C) is currently the sole established standard treatment. However, there are a large number of infants for whom this therapy is ineffective. This inspired global research to find neuroprotectants to potentiate the effect of moderate hypothermia. Here we examine erythropoietin (EPO) as a prominent candidate. Neonatal animal studies show that immediate, as well as delayed, treatment with EPO post-injury, can be neuroprotective and/or neurorestorative. The observed improvements of EPO therapy were generally not to the level of control uninjured animals, however. This suggested that combining EPO treatment with an adjunct therapeutic strategy should be researched. Treatment with EPO plus hypothermia led to less cerebral palsy in a non-human primate model of perinatal asphyxia, leading to clinical trials. A recent Phase II clinical trial on neonatal infants with HIE reported better 12-month motor outcomes for treatment with EPO plus hypothermia compared to hypothermia alone. Hence, the effectiveness of combined treatment with moderate hypothermia and EPO for neonatal HIE currently looks promising. The outcomes of two current clinical trials on neurological outcomes at 18–24 months-of-age, and at older ages, are now required. Further research on the optimal dose, onset, and duration of treatment with EPO, and critical consideration of the effect of injury severity and of gender, are also required.
>
> **Keywords:** erythropoietin; moderate hypothermia; perinatal hypoxic-ischemic encephalopathy; neonatal hypoxia-ischemia; anemia of prematurity

# Introduction

How to prevent brain damage due to hypoxic-ischemic encephalopathy (HIE) remains a question that needs an answer due to its serious sequels of neonatal death or severe intellectual, cognitive, and motor disabilities \[1\]. Perinatal HIE causes 23% of neonatal deaths \[2,3\] and affects 1.5–2 per 1000 births in developed countries, but the number affected increases to 26 per 1000 in resource-limited settings \[4\]. HIE involves a combination of decreased delivery of oxygen in the blood supply (i.e., hypoxia) and decreased blood flow (i.e., ischemia) to the brain.

Neonatal HIE can be subdivided into mild, moderate, and severe using modified Sarnat staging \[5\]. The Sarnat scale was introduced in 1976 \[6\]. In 1997, Thompson et al. developed a scoring system that was based on Sarnat and Sarnat (1976) but was simpler \[7\]. Nine symptoms were scored, including mental state, cranial nerve function (e.g., the ability to suck), motor ability and seizure activity. The outcome of treatment can depend on whether an infant experienced mild, moderate, or severe HIE. The goals of management of neonates affected by HIE are:

*Int. J. Mol. Sci.* **2020**, *21*, 1487; doi[:10.3390/ijms21041487](http://dx.doi.org/10.3390/ijms21041487) [www.mdpi.com/journal/ijms](http://www.mdpi.com/journal/ijms)

1.  Early identification, within 2–6 h of birth, of those at high risk. A high risk of HIE is likely in infants with fetal bradycardia (&lt;100 beats/minute), an Apgar score of five or less at 5 minutes \[8\], a cord blood pH of 7 or less, and/or a base deficit of 16 or more \[9\]. The Apgar score enables a quick and accurate assessment of the respiratory, cardio-circulatory, and neurological condition of the newborn. The Apgar score at 10 minutes correlates with poor outcomes following HIE \[10\].

2.  Adequate perfusion of the brain through supportive care. The supportive care can involve the provision of oxygen, volume expanders, ionotropes, diurectics, and antibiotics (see also Section 5).

3.  Amelioration of the process of ongoing brain injury through neuroprotective and neurorestorative interventions \[9,11\]. Neuroprotective interventions are delivered within 6 h of HIE, while neurorestorative interventions have a delayed onset.



Query: What is the history of therapeutic hypothermia starting in ancient times
 The history of therapeutic hypothermia, particularly as it relates to asphyxial brain injury (HIE), dates back to antiquity. The Ancient Egyptians, Greeks, and Romans were among the first civilizations to recognize the potential benefits of induced cooling for treating trauma and cerebral disturbances. This practice was based on empirical observations that cooler environments could mitigate the effects of injuries and illnesses.

Hippocrates, a renowned Greek physician, noted that infants exposed to the open air survived longer in colder weather, which suggested an early understanding of the protective effects of hypothermia against asphyxial events. This observation set the stage for future research into the benefits of cooling in the context of brain injury.

Centuries later, physiologists like Claude Bernard and William Edwards conducted experiments that further elucidated the effects of hypothermia o