Skip to content

Commit

Permalink
Merge dev into main (#118)
Browse files Browse the repository at this point in the history
* Update to 3.8 because of numpy security update (#64)

* Update to 3.8 because of numpy security update

* Remove openpyxl because 3.7 is not supported anymore

* 33 convert basic bundle collection to generator (#60)

* Convert functions to return generators

* Update readme

* Add more tests

* Remove pyrohealth from tests because it is currently not working

* Add another type ignore for function

* Remove all fhirpathpy import tries

* Update example 4

* Change functions used in public tests

* Fix problems with trade_rows_for_dataframe_with_ref and mypy issues

* Update example 2

* Update example 1

* Fix tqdm text name

* Update example 2&4

* Update example 3

* Re-add none case to reset empty lists in fhirpaths

* Add history support (#74)

* 8 add handling for dataframe functions using  (#72)

* Remove initial reference name in df_constraints if the search parameter is _id

* Add error if there are None values in a DF constraint column

* Bump up version because of incompatibility

* Add pipe for system only if the first element of the df_contraints is a URL

* Update README.md

* Fix Error for Paging when the URL changes from HTTP to HTTPS (#78)

* Fix the bug by introducing a new variable that contains the domain

* Add forgotten regex group and add test

* Update README.md

* 75 allow specifying multiple query arguments in df contraints (#76)

* Add option to have multiple values for df_contraints keys

* Working now, I did not remember the structure that should be expected, which also explains why mypy was not happy with it

* Fix error for history, since it always expected a string instead of a list of strings

* Add conversion to string when adding identifiers

* Add more tests

* Make tests less time-consuming

* Update README.md

* Update README.md

* Create CITATION.cff

* Update CITATION.cff

* Update CITATION.cff

* Update README.md

* Fix CITATION.cff

* Add version attribute to init (#80)

* Check for file existance and return a warning and none in fix_mapping_dataframe (#82)

* Update pyproject.toml

* Update CITATION.cff

* Fix readme error due to changes in code

* Update CITATION.cff

* Update CITATION.cff

* 83 allow turning off checks for dicom download (#85)

* Add turn of checks option

* Make the continue statement depend on the turn_off_checks variable

* Add resource name to the TQDM description (#88)

* 84 merge trade rows for dataframe and trade rows for dataframe with ref (#89)

* Merge trade_rows_for_dataframe and trade_rows_for_dataframe_with_ref

* Update version

* Modify tests

* Update notebooks

* 86 allow addition of any column to the trade rows for dataframe result (#90)

* Update version

* Add the option of adding to the output DF of trade_rows_for_dataframe any input column

* Add the option of renaming the columns

* Add space in warning

* Remove blank lines

* Make pre-commit up to date with pyproject

* Remove version 3.7 from automatic tests and add 3.10

* 91 add merge on option to steal bundles to dataframe (#92)

* Add merge_on to all dataframe functions

* Modify mypy command in github workflows

* Add directory for mypy check

* Fix file after wrong conflict resolving

* Fix readme

* Fix readme hyperlink

* Fix consistency in README

* Update CITATION.cff

* Update pyproject.toml

* Update authors in readme

* Update CITATION.cff

* Update poetry in github actions

* 95 fix docstring for with columns (#98)

* Fix link to part of readme

* Add docstring about with_columns

* Improve spacing in docstring

* Test for mypy

* Check if the error depends on 3.10

* Update packages and try again with 3.10

* Set 3.10.6 as py version

* Add type ignores

* Add type ignores

* Update version

* 100 force read action on  (#101)

* Force read request for IDs

* Update the bundle processing to go through the resources

* 96 add format options for dicomdownloader (#99)

* Add the possibility to always save in the study folder

* Add new formats to store the downloaded data

* Update README.md

* Update README.md

* Convert logging.warn to warnings when appropriate (#105)

* Standardize bundles for read and search (#106)

* Update CITATION.cff

* Update CITATION.cff

* 104 smarter caching (#107)

* First draft of http caching

* Improve docstrings

* Add a couple of todos

* Add retry option and custom create_key parameter for caching

* Sort request params to ensure same order for caching

* Update CITATION.cff

* 111 make current beta compatible with 010 (#112)

* Bump up version

* Add query_to_dataframe function to ensure compatibility with v0.1.0

* Fix text in examples

* Fix parameter inconsistency in query_to_dataframe§

* Update tests with query_to_dataframe

* Adjust tests

* 94 efficiency problem with merge on (#113)

* Remove the merge on parameter and return one dataframe per resource

* Filter out none values directly from the returned records

* Modify bundle_to_dataframe to take the union of all processed bundles per resource

* Adjust tests

* Different outputs for query_to_dataframe

* Remove always return dict overwrite, that may kill everything

* Update pyproject.toml

* Make fhirpathpy input greedy (#116)

Co-authored-by: Giulia Baldini <Giulia.Baldini@uk-essen.de>

* Update pyproject.toml

---------

Co-authored-by: Giulia Baldini <Giulia.Baldini@uk-essen.de>
  • Loading branch information
giuliabaldini and Giulia Baldini committed Mar 17, 2023
1 parent 1425313 commit a677efc
Show file tree
Hide file tree
Showing 15 changed files with 3,181 additions and 1,780 deletions.
9 changes: 4 additions & 5 deletions .github/workflows/mypy-flake-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9"]
python-version: ["3.8", "3.9", "3.10.6"]

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
Expand All @@ -41,14 +41,13 @@ jobs:
# Install poetry
- name: Install poetry
run: |
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -
export PATH="/root/.local/bin:$PATH"
curl -sSL https://install.python-poetry.org | python3 -
poetry run pip install -U pip
poetry install -E all
- name: Run MyPy
run: |
poetry run mypy --install-types --non-interactive fhir_pyrate/
poetry run mypy --install-types --non-interactive tests/
poetry run mypy --install-types --non-interactive fhir_pyrate
poetry run mypy --install-types --non-interactive tests
- name: Run Flake8
run: |
poetry run flake8 fhir_pyrate/
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ repos:
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 21.7b0
rev: 22.3.0
hooks:
- id: black
- repo: https://github.com/pycqa/isort
Expand Down
114 changes: 51 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Supported Python version](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![Supported Python version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Stable Version](https://img.shields.io/pypi/v/fhir-pyrate?label=stable)](https://pypi.org/project/fhir-pyrate/)
[![Pre-release Version](https://img.shields.io/github/v/release/UMEssen/fhir-pyrate?label=pre-release&include_prereleases&sort=semver)](https://pypi.org/project/fhir-pyrate/#history)
[![DOI](https://zenodo.org/badge/456893108.svg)](https://zenodo.org/badge/latestdoi/456893108)

<!-- PROJECT LOGO -->
<br />
Expand All @@ -10,7 +13,7 @@
</div>

This package is meant to provide a simple abstraction to query and structure FHIR resources as
pandas DataFrames.
pandas DataFrames. Want to use R instead? Try out [fhircrackr](https://github.com/POLAR-fhiR/fhircrackr)!

There are four main classes:
* [Ahoy](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/ahoy.py): Authenticate on the FHIR API
Expand Down Expand Up @@ -53,8 +56,7 @@ Table of Contents:
* [sail_through_search_space](https://github.com/UMEssen/FHIR-PYrate/#sail_through_search_space)
* [trade_rows_for_bundles](https://github.com/UMEssen/FHIR-PYrate/#trade_rows_for_bundles)
* [bundles_to_dataframe](https://github.com/UMEssen/FHIR-PYrate/#bundles_to_dataframe)
* [query_to_dataframe](https://github.com/UMEssen/FHIR-PYrate/#query_to_dataframe)
* [trade_rows_for_dataframe](https://github.com/UMEssen/FHIR-PYrate/#trade_rows_for_dataframe)
* [***_dataframe](https://github.com/UMEssen/FHIR-PYrate/#_dataframe)
* [Miner](https://github.com/UMEssen/FHIR-PYrate/#miner)
* [DicomDownloader](https://github.com/UMEssen/FHIR-PYrate/#dicomdownloader)
* [Contributing](https://github.com/UMEssen/FHIR-PYrate/#contributing)
Expand All @@ -75,7 +77,7 @@ or using GitHub (always the newest version).
pip install git+https://github.com/UMEssen/FHIR-PYrate.git
```

These two commands only install the packages needed for `Pirate`. If you also want to use the `Miner` or the `DicomDownloader`, then you need to install them as extra dependencies with
These two commands only install the packages needed for **Pirate**. If you also want to use the **Miner** or the **DicomDownloader**, then you need to install them as extra dependencies with
```bash
pip install "fhir-pyrate[miner]" # only for miner
pip install "fhir-pyrate[downloader]" # only for downloader
Expand Down Expand Up @@ -105,7 +107,7 @@ and then run
poetry lock
```

Also in poetry, the above only installs the packages for `Pirate`. If you also want to use the `Miner` or the `DicomDownloader`, then you need to install them as extra dependencies with
Also in poetry, the above only installs the packages for **Pirate**. If you also want to use the **Miner** or the **DicomDownloader**, then you need to install them as extra dependencies with
```bash
poetry add "fhir-pyrate[miner]" # only for miner
poetry add "fhir-pyrate[downloader]" # only for downloader
Expand Down Expand Up @@ -146,8 +148,8 @@ from fhir_pyrate import Ahoy
auth = Ahoy(
username="your_username",
auth_method="password",
auth_url=auth-url, # The URL for authentication
refresh_url=refresh-url, # The URL to refresh the authentication
auth_url="auth-url", # Your URL for authentication
refresh_url="refresh-url", # Your URL to refresh the authentication token (if available)
)
```

Expand All @@ -173,35 +175,32 @@ auth = ...
# Init Pirate
search = Pirate(
auth=auth,
base_url=fhir-url, # e.g. "http://hapi.fhir.org/baseDstu2"
base_url="fhir-url", # e.g. "http://hapi.fhir.org/baseDstu2"
print_request_url=False, # If set to true, you will see all requests
)
```

The Pirate functions do one of three things:
1. They run the query and collect the resources and store them in a list of bundles.
1. They run the query and collect the resources and store them in a generator of bundles.
* `steal_bundles`: single process, no timespan to specify
* `steal_bundles_for_timespan`: single process, timespan can be specified
* `sail_through_search_space`: multiprocess, divide&conquer with many smaller timespans, uses `steal_bundles_for_timespan`
* `trade_rows_for_bundles`: multiprocess, takes DataFrame as input and runs one query per row,
uses `steal_bundles`
2. They take a list of bundles and build a DataFrame.
* `bundles_to_dataframe`: multiprocess
* `sail_through_search_space`: multiprocess, divide&conquer with many smaller timespans
* `trade_rows_for_bundles`: multiprocess, takes DataFrame as input and runs one query per row
2. They take a generator of bundles and build a DataFrame.
* `bundles_to_dataframe`: multiprocess, builds the DataFrame from the bundles.
3. They are wrapper that combine the functionalities of 1&2, or that set some particular parameters.
* `query_to_dataframe`: multiprocess, executes any function selected with `bundles_function`
(any of the functions in 1.) and then runs `bundles_to_dataframe` on the result.
* `trade_rows_for_dataframe`: multiprocess, executes `steal_bundles`&`bundles_to_dataframe`
for each row of the DataFrame.

| Name | Type | Multiprocessing | DF Input? | Output |
|:---------------------------|:----:|:---------------:|:---------:|:--------------------------:|
| steal_bundles | 1 | No | No | List of Bundles of FHIRObj |
| steal_bundles_for_timespan | 1 | No | No | List of Bundles of FHIRObj |
| sail_through_search_space | 1 | Yes | No | List of Bundles of FHIRObj |
| trade_rows_for_bundles | 1 | Yes | Yes | List of Bundles of FHIRObj |
| bundles_to_dataframe | 2 | Yes | No | DataFrame |
| query_to_dataframe | 3 | Yes | Yes | DataFrame |
| trade_rows_for_dataframe | 3 | Yes | Yes | DataFrame |
* `steal_bundles_to_dataframe`: single process, executes `steal_bundles` and then runs `bundles_to_dataframe` on the result.
* `sail_through_search_space_to_dataframe`: multiprocess, executes `sail_through_search_space` and then runs `bundles_to_dataframe` on the result.
* `trade_rows_for_dataframe`: multiprocess, executes `trade_rows_for_bundles` and then runs `bundles_to_dataframe` on the result, it is also possible to add columns from the original DataFrame to the result

| Name | Type | Multiprocessing | DF Input? | Output |
|:----------------------------------------|:----:|:---------------:|:---------:|:--------------------:|
| steal_bundles | 1 | No | No | Generator of FHIRObj |
| sail_through_search_space | 1 | Yes | No | Generator of FHIRObj |
| trade_rows_for_bundles | 1 | Yes | Yes | Generator of FHIRObj |
| bundles_to_dataframe | 2 | Yes | / | DataFrame |
| steal_bundles_to_dataframe | 3 | No | No | DataFrame |
| sail_through_search_space_to_dataframe | 3 | Yes | No | DataFrame |
| trade_rows_for_dataframe | 3 | Yes | Yes | DataFrame |


**BETA FEATURE**: It is also possible to cache the bundles using the `bundle_caching` parameter,
Expand All @@ -215,8 +214,7 @@ A toy request for ImagingStudy:
search = ...

# Make the FHIR call
bundles = search.query_to_dataframe(
bundles_function=search.sail_through_search_space,
bundles = search.sail_through_search_space_to_dataframe(
resource_type="ImagingStudy",
date_init="2021-04-01",
time_attribute_name="started",
Expand All @@ -230,26 +228,25 @@ bundles = search.query_to_dataframe(
The argument `request_params` is a dictionary that takes a string as key (the FHIR identifier) and anything as value.
If the value is a list or tuple, then all values will be used to build the request to the FHIR API.

`query_to_dataframe` is a wrapper function. It collects the bundles that result from the
`bundles_function` that was called and calls `bundles_to_dataframe`. In this case, we used
sail_through_search_space.
`sail_through_search_space_to_dataframe` is a wrapper function that directly converts the result of
`sail_through_search_space` into a DataFrame.

#### [`sail_through_search_space`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)

The `sail_through_search_space` function uses the multiprocessing module to speed up some queries.
The multiprocessing is done as follows:
The time frame is divided into multiple time spans (as many as there are processes) and each smaller
time frame is investigated simultaneously. This is why it is necessary to give a `date_init`
and `date_end` param to the
`sail_through_search_space` function. The default values are `date_init=2010-01-01` and today (the day
when the query is performed)
for `date_end`.
and `date_end` param to the `sail_through_search_space` function.

**Note** that if the `date_init` or `date_end` parameters are given as strings, they will be converted
to `datetime.datetime` objects, so any non specified parameters (month, day or time) will be assumed
according to the `datetime` workflow, and then converted to string according to the `time_format`
specified in the **Pirate** constructor.

A problematic aspect of the resources is that the date in which the resource was acquired is defined
using different attributes. Also, some resources use a fixed date, other use a time period.
You can specify the date attribute that you want to use with `time_attribute_name`.
In the following table you can see which resource attributes we use of each of the resources.
The default attribute is `_lastUpdated`.

The resources where the date is based on a period (such as `Encounter` or `Procedure`) may cause
duplicates in the multiprocessing because one entry may belong to multiple time spans that are
Expand Down Expand Up @@ -286,12 +283,12 @@ contains a bunch of different LOINC codes. Our `df_constraints` could look as fo
df_constraints={"code": ("http://loinc.org", "loinc_code")}
```

This function also uses multiprocessing, but differently from before, it will investigate the rows
This function also uses multiprocessing, but differently from before, it will process the rows
of the DataFrame in parallel.

#### [`bundles_to_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)

The two functions described above return a list of `FHIRObj` bundles which can then be
The two functions described above return a generator of `FHIRObj` bundles which can then be
converted to a `DataFrame` using this function.

The `bundles_to_dataframe` has three options on how to handle and extract the relevant information
Expand Down Expand Up @@ -365,8 +362,7 @@ instead (as in 2.).
pieces of information but for the same resource, the field will be only filled with the first
occurence that is not None.
```python
df = search.query_to_dataframe(
bundles_function=search.steal_bundles,
df = search.steal_bundles_to_dataframe(
resource_type="DiagnosticReport",
request_params={
"_count": 1,
Expand All @@ -387,29 +383,20 @@ df = search.query_to_dataframe(
("code_abc", "code.coding.where(system = 'ABC').code"),
("code_def", "code.coding.where(system = 'DEF').code"),
],
stop_after_first_page=True,
num_pages=1,
)
```

#### [`***_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
The `steal_bundles_to_dataframe`, `sail_through_search_space_to_dataframe` and `trade_rows_for_dataframe`
are facade functions which retrieve the bundles and then run `bundles_to_dataframe`.

In case you are not sure whether we have collected the same entry multiple times
(i.e. when using multiprocessing in `sail_through_search_space` with a resource that uses a time
period), please use the `drop_duplicates` function from pandas. A list of column names for which
we do not want duplicates shall be passed as parameter and all duplicate rows will disappear.

#### [`query_to_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
This function is simply a wrapper that can be used to combine any function of Type 1 and
`bundles_to_dataframe`. Look at [examples](examples) for some use cases.

#### [`trade_rows_for_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
This function has an output similar to `query_to_dataframe` with
`bundles_function=trade_rows_for_bundles`, but with two main differences:
1. Here, the bundles are retrieved and the DataFrame is computed straight away. In
`query_to_dataframe(bundles_function=trade_rows_for_bundles, ...)` first all the bundles are
retrieved, and then they are converted into a DataFrame.
2. If the `df_constraints` constraints are specified, they will end up in the final DataFrame.

In `trade_rows_for_dataframe` you can also specify the `with_ref` parameter to also add the
parameters specified in `df_constraints` as columns of the final DataFrame.
You can find an example in [Example 3](https://github.com/UMEssen/FHIR-PYrate/blob/main/examples/3-patients-for-condition.ipynb).
Additionally, you can specify the `with_columns` parameter, which can add any columns from the original
DataFrame. The columns can be either specified as a list of columns `[col1, col2, ...]` or as a
list of tuples `[(new_name_for_col1, col1), (new_name_for_col2, col2), ...]`

### [Miner](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/miner.py)

Expand Down Expand Up @@ -526,6 +513,7 @@ request. You can also simply open an issue with the tag "enhancement".
5. Open a Pull Request

## Authors and acknowledgment

This package was developed by the [SHIP-AI group at the Institute for Artificial Intelligence in Medicine](https://ship-ai.ikim.nrw/).

- [goku1110](https://github.com/goku1110): initial idea, development, logo & figures
Expand Down
Loading

0 comments on commit a677efc

Please sign in to comment.