Another use case is that we have set of DiagnosticReports, and we need to find out if they contain
some particular information we are interested in.

The process:
1. Set of DiagnosticReports, for example the ones from 2021 or later.
2. Investigate the reports for the RegEx "Metabolic".
3. Obtain a list of interesting documents.


In [1]:
from fhir_pyrate import Pirate, Miner
from typing import Dict, List
from fhir_pyrate.util.fhirobj import FHIRObj

search = Pirate(
    auth=None,
    base_url="http://hapi.fhir.org/baseDstu2",
    print_request_url=True,  # Set it to true to get the URL calls
    num_processes=1,
)
diagnostic_df = search.steal_bundles_to_dataframe(
    resource_type="DiagnosticReport",
    request_params={
        "_count": 100,
        "_lastUpdated": "ge2021",
    },
    fhir_paths=["text.status", "text.div"],
)
diagnostic_df



http://hapi.fhir.org/baseDstu2/DiagnosticReport?_count=100&_lastUpdated=ge2021


Query & Build DF:   0%|          | 0/1 [00:00<?, ?it/s]

Exception: Infix invoke should have arity 2

As you see, we get a warning and an error. This is because of the FHIRPath definition of `div`,
so we need to use a processing function instead. Maybe this problem will be solved in the future,
 but both [fhirpath-py](https://github.com/beda-software/fhirpath-py) and [FHIRPath.js] (https://github.com/hl7/fhirpath.js/)
 have the same problem.

In [5]:
def get_diagnostic_text(bundle: FHIRObj) -> List[Dict]:
    records = []
    for entry in bundle.entry or []:
        resource = entry.resource
        records.append(
            {
                "diagnostic_report_id": resource.id,
                "report_status": resource.text.status,
                "report_text": resource.text.div,
            }
        )
    return records


diagnostic_df = search.steal_bundles_to_dataframe(
    resource_type="DiagnosticReport",
    request_params={
        "_count": 100,
        "_lastUpdated": "ge2021",
    },
    process_function=get_diagnostic_text,  # Use processing function
)
diagnostic_df

Query & Build DF: 100%|██████████| 1/1 [01:22<00:00, 82.03s/it]

http://hapi.fhir.org/baseDstu2/DiagnosticReport?_count=100&_lastUpdated=ge2021



Query & Build DF: 100%|██████████| 1/1 [00:00<00:00, 3269.14it/s]


Unnamed: 0,diagnostic_report_id,report_status,report_text
0,267070,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
1,267041,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
2,267148,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
3,267199,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
4,266996,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
5,267149,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
6,267069,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
7,266997,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
8,267179,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."
9,266955,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div..."


Now we got all the DiagnosticReports and their contents. We want to be able to go through the
 content and look for the information we need.
 We initialize the Miner with the information that is interesting for us as a RegEx, pass a decoding
 function as parameter, and select the number of processes.
 The `decode_text` function can be used to store the logic that should be used to process the
 single texts. It may be that the texts are encoded, or it may be used to parse some
  HTML code.

In [4]:
from bs4 import BeautifulSoup

# Processing function to process each single text
def decode_text(text: str) -> str:
    soup = BeautifulSoup(text, "html.parser")
    div = soup.find("div", {"class": "hapiHeaderText"})
    return div.text


miner = Miner(target_regex="Metabolic", decode_text=decode_text, num_processes=1)
df_filtered = miner.nlp_on_dataframe(
    diagnostic_df,
    text_column_name="report_text",
    new_column_name="text_found",
)
df_filtered


Searching for Sentences with Metabolic: 100%|██████████| 48/48 [00:00<00:00, 1001624.84it/s]


Unnamed: 0,diagnostic_report_id,report_status,report_text,text_found_sentences,text_found
0,267070,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...",,False
1,267041,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
2,267148,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
3,267199,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
4,266996,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
5,267149,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...",,False
6,267069,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
7,266997,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...",,False
8,267179,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...","[( , Basic, Metabolic, Panel)]",True
9,266955,generated,"<div xmlns=""http://www.w3.org/1999/xhtml""><div...",,False


What we obtain is then a DataFrame, with a column telling us whether the RegEx was present in the
text or not.