Merge dev into main (#118)

* Update to 3.8 because of numpy security update (#64) * Update to 3.8 because of numpy security update * Remove openpyxl because 3.7 is not supported anymore * 33 convert basic bundle collection to generator (#60) * Convert functions to return generators * Update readme * Add more tests * Remove pyrohealth from tests because it is currently not working * Add another type ignore for function * Remove all fhirpathpy import tries * Update example 4 * Change functions used in public tests * Fix problems with trade_rows_for_dataframe_with_ref and mypy issues * Update example 2 * Update example 1 * Fix tqdm text name * Update example 2&4 * Update example 3 * Re-add none case to reset empty lists in fhirpaths * Add history support (#74) * 8 add handling for dataframe functions using (#72) * Remove initial reference name in df_constraints if the search parameter is _id * Add error if there are None values in a DF constraint column * Bump up version because of incompatibility * Add pipe for system only if the first element of the df_contraints is a URL * Update README.md * Fix Error for Paging when the URL changes from HTTP to HTTPS (#78) * Fix the bug by introducing a new variable that contains the domain * Add forgotten regex group and add test * Update README.md * 75 allow specifying multiple query arguments in df contraints (#76) * Add option to have multiple values for df_contraints keys * Working now, I did not remember the structure that should be expected, which also explains why mypy was not happy with it * Fix error for history, since it always expected a string instead of a list of strings * Add conversion to string when adding identifiers * Add more tests * Make tests less time-consuming * Update README.md * Update README.md * Create CITATION.cff * Update CITATION.cff * Update CITATION.cff * Update README.md * Fix CITATION.cff * Add version attribute to init (#80) * Check for file existance and return a warning and none in fix_mapping_dataframe (#82) * Update pyproject.toml * Update CITATION.cff * Fix readme error due to changes in code * Update CITATION.cff * Update CITATION.cff * 83 allow turning off checks for dicom download (#85) * Add turn of checks option * Make the continue statement depend on the turn_off_checks variable * Add resource name to the TQDM description (#88) * 84 merge trade rows for dataframe and trade rows for dataframe with ref (#89) * Merge trade_rows_for_dataframe and trade_rows_for_dataframe_with_ref * Update version * Modify tests * Update notebooks * 86 allow addition of any column to the trade rows for dataframe result (#90) * Update version * Add the option of adding to the output DF of trade_rows_for_dataframe any input column * Add the option of renaming the columns * Add space in warning * Remove blank lines * Make pre-commit up to date with pyproject * Remove version 3.7 from automatic tests and add 3.10 * 91 add merge on option to steal bundles to dataframe (#92) * Add merge_on to all dataframe functions * Modify mypy command in github workflows * Add directory for mypy check * Fix file after wrong conflict resolving * Fix readme * Fix readme hyperlink * Fix consistency in README * Update CITATION.cff * Update pyproject.toml * Update authors in readme * Update CITATION.cff * Update poetry in github actions * 95 fix docstring for with columns (#98) * Fix link to part of readme * Add docstring about with_columns * Improve spacing in docstring * Test for mypy * Check if the error depends on 3.10 * Update packages and try again with 3.10 * Set 3.10.6 as py version * Add type ignores * Add type ignores * Update version * 100 force read action on (#101) * Force read request for IDs * Update the bundle processing to go through the resources * 96 add format options for dicomdownloader (#99) * Add the possibility to always save in the study folder * Add new formats to store the downloaded data * Update README.md * Update README.md * Convert logging.warn to warnings when appropriate (#105) * Standardize bundles for read and search (#106) * Update CITATION.cff * Update CITATION.cff * 104 smarter caching (#107) * First draft of http caching * Improve docstrings * Add a couple of todos * Add retry option and custom create_key parameter for caching * Sort request params to ensure same order for caching * Update CITATION.cff * 111 make current beta compatible with 010 (#112) * Bump up version * Add query_to_dataframe function to ensure compatibility with v0.1.0 * Fix text in examples * Fix parameter inconsistency in query_to_dataframe§ * Update tests with query_to_dataframe * Adjust tests * 94 efficiency problem with merge on (#113) * Remove the merge on parameter and return one dataframe per resource * Filter out none values directly from the returned records * Modify bundle_to_dataframe to take the union of all processed bundles per resource * Adjust tests * Different outputs for query_to_dataframe * Remove always return dict overwrite, that may kill everything * Update pyproject.toml * Make fhirpathpy input greedy (#116) Co-authored-by: Giulia Baldini <Giulia.Baldini@uk-essen.de> * Update pyproject.toml --------- Co-authored-by: Giulia Baldini <Giulia.Baldini@uk-essen.de>
UMEssen · Mar 17, 2023 · a677efc · a677efc
1 parent 1425313
commit a677efc
Show file tree

Hide file tree

Showing 15 changed files with 3,181 additions and 1,780 deletions.
diff --git a/.github/workflows/mypy-flake-test.yml b/.github/workflows/mypy-flake-test.yml
@@ -25,7 +25,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.7", "3.8", "3.9"]
+        python-version: ["3.8", "3.9", "3.10.6"]
 
     # Steps represent a sequence of tasks that will be executed as part of the job
     steps:
@@ -41,14 +41,13 @@ jobs:
       # Install poetry
       - name: Install poetry
         run: |
-          curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -
-          export PATH="/root/.local/bin:$PATH"
+          curl -sSL https://install.python-poetry.org | python3 -
           poetry run pip install -U pip
           poetry install -E all
       - name: Run MyPy
         run: |
-          poetry run mypy --install-types --non-interactive fhir_pyrate/
-          poetry run mypy --install-types --non-interactive tests/
+          poetry run mypy --install-types --non-interactive fhir_pyrate
+          poetry run mypy --install-types --non-interactive tests
       - name: Run Flake8
         run: |
           poetry run flake8 fhir_pyrate/

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ repos:
   - id: end-of-file-fixer
   - id: trailing-whitespace
 - repo: https://github.com/psf/black
-  rev: 21.7b0
+  rev: 22.3.0
   hooks:
   - id: black
 - repo: https://github.com/pycqa/isort

diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Supported Python version](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)
+[![Supported Python version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-380/)
+[![Stable Version](https://img.shields.io/pypi/v/fhir-pyrate?label=stable)](https://pypi.org/project/fhir-pyrate/)
+[![Pre-release Version](https://img.shields.io/github/v/release/UMEssen/fhir-pyrate?label=pre-release&include_prereleases&sort=semver)](https://pypi.org/project/fhir-pyrate/#history)
+[![DOI](https://zenodo.org/badge/456893108.svg)](https://zenodo.org/badge/latestdoi/456893108)
 
 <!-- PROJECT LOGO -->
 <br />
@@ -10,7 +13,7 @@
 </div>
 
 This package is meant to provide a simple abstraction to query and structure FHIR resources as
-pandas DataFrames.
+pandas DataFrames. Want to use R instead? Try out [fhircrackr](https://github.com/POLAR-fhiR/fhircrackr)!
 
 There are four main classes:
 * [Ahoy](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/ahoy.py): Authenticate on the FHIR API
@@ -53,8 +56,7 @@ Table of Contents:
       * [sail_through_search_space](https://github.com/UMEssen/FHIR-PYrate/#sail_through_search_space)
       * [trade_rows_for_bundles](https://github.com/UMEssen/FHIR-PYrate/#trade_rows_for_bundles)
       * [bundles_to_dataframe](https://github.com/UMEssen/FHIR-PYrate/#bundles_to_dataframe)
-      * [query_to_dataframe](https://github.com/UMEssen/FHIR-PYrate/#query_to_dataframe)
-      * [trade_rows_for_dataframe](https://github.com/UMEssen/FHIR-PYrate/#trade_rows_for_dataframe)
+      * [***_dataframe](https://github.com/UMEssen/FHIR-PYrate/#_dataframe)
    * [Miner](https://github.com/UMEssen/FHIR-PYrate/#miner)
    * [DicomDownloader](https://github.com/UMEssen/FHIR-PYrate/#dicomdownloader)
 * [Contributing](https://github.com/UMEssen/FHIR-PYrate/#contributing)
@@ -75,7 +77,7 @@ or using GitHub (always the newest version).
 pip install git+https://github.com/UMEssen/FHIR-PYrate.git
 ```
 
-These two commands only install the packages needed for `Pirate`. If you also want to use the `Miner` or the `DicomDownloader`, then you need to install them as extra dependencies with
+These two commands only install the packages needed for **Pirate**. If you also want to use the **Miner** or the **DicomDownloader**, then you need to install them as extra dependencies with
 ```bash
 pip install "fhir-pyrate[miner]" # only for miner
 pip install "fhir-pyrate[downloader]" # only for downloader
@@ -105,7 +107,7 @@ and then run
 poetry lock
 ```
 
-Also in poetry, the above only installs the packages for `Pirate`. If you also want to use the `Miner` or the `DicomDownloader`, then you need to install them as extra dependencies with
+Also in poetry, the above only installs the packages for **Pirate**. If you also want to use the **Miner** or the **DicomDownloader**, then you need to install them as extra dependencies with
 ```bash
 poetry add "fhir-pyrate[miner]" # only for miner
 poetry add "fhir-pyrate[downloader]" # only for downloader
@@ -146,8 +148,8 @@ from fhir_pyrate import Ahoy
 auth = Ahoy(
   username="your_username",
   auth_method="password",
-  auth_url=auth-url, # The URL for authentication
-  refresh_url=refresh-url, # The URL to refresh the authentication
+  auth_url="auth-url", # Your URL for authentication
+  refresh_url="refresh-url", # Your URL to refresh the authentication token (if available)
 )
 ```
 
@@ -173,35 +175,32 @@ auth = ...
 # Init Pirate
 search = Pirate(
     auth=auth,
-    base_url=fhir-url, # e.g. "http://hapi.fhir.org/baseDstu2"
+    base_url="fhir-url", # e.g. "http://hapi.fhir.org/baseDstu2"
     print_request_url=False, # If set to true, you will see all requests
 )
 ```
 
 The Pirate functions do one of three things:
-1. They run the query and collect the resources and store them in a list of bundles.
+1. They run the query and collect the resources and store them in a generator of bundles.
    * `steal_bundles`: single process, no timespan to specify
-   * `steal_bundles_for_timespan`: single process, timespan can be specified
-   * `sail_through_search_space`: multiprocess, divide&conquer with many smaller timespans, uses `steal_bundles_for_timespan`
-   * `trade_rows_for_bundles`: multiprocess, takes DataFrame as input and runs one query per row,
-     uses `steal_bundles`
-2. They take a list of bundles and build a DataFrame.
-   * `bundles_to_dataframe`: multiprocess
+   * `sail_through_search_space`: multiprocess, divide&conquer with many smaller timespans
+   * `trade_rows_for_bundles`: multiprocess, takes DataFrame as input and runs one query per row
+2. They take a generator of bundles and build a DataFrame.
+   * `bundles_to_dataframe`: multiprocess, builds the DataFrame from the bundles.
 3. They are wrapper that combine the functionalities of 1&2, or that set some particular parameters.
-   * `query_to_dataframe`: multiprocess, executes any function selected with `bundles_function`
-     (any of the functions in 1.) and then runs `bundles_to_dataframe` on the result.
-   * `trade_rows_for_dataframe`: multiprocess, executes `steal_bundles`&`bundles_to_dataframe`
-     for each row of the DataFrame.
-
-| Name                       | Type | Multiprocessing | DF Input? |           Output           |
-|:---------------------------|:----:|:---------------:|:---------:|:--------------------------:|
-| steal_bundles              |  1   |       No        |    No     | List of Bundles of FHIRObj |
-| steal_bundles_for_timespan |  1   |       No        |    No     | List of Bundles of FHIRObj |
-| sail_through_search_space  |  1   |       Yes       |    No     | List of Bundles of FHIRObj |
-| trade_rows_for_bundles     |  1   |       Yes       |    Yes    | List of Bundles of FHIRObj |
-| bundles_to_dataframe       |  2   |       Yes       |    No     |         DataFrame          |
-| query_to_dataframe         |  3   |       Yes       |    Yes    |         DataFrame          |
-| trade_rows_for_dataframe   |  3   |       Yes       |    Yes    |         DataFrame          |
+   * `steal_bundles_to_dataframe`: single process, executes `steal_bundles` and then runs `bundles_to_dataframe` on the result.
+   * `sail_through_search_space_to_dataframe`: multiprocess, executes `sail_through_search_space` and then runs `bundles_to_dataframe` on the result.
+   * `trade_rows_for_dataframe`: multiprocess, executes `trade_rows_for_bundles` and then runs `bundles_to_dataframe` on the result, it is also possible to add columns from the original DataFrame to the result
+
+| Name                                    | Type | Multiprocessing | DF Input? |        Output        |
+|:----------------------------------------|:----:|:---------------:|:---------:|:--------------------:|
+| steal_bundles                           |  1   |       No        |    No     | Generator of FHIRObj |
+| sail_through_search_space               |  1   |       Yes       |    No     | Generator of FHIRObj |
+| trade_rows_for_bundles                  |  1   |       Yes       |    Yes    | Generator of FHIRObj |
+| bundles_to_dataframe                    |  2   |       Yes       |     /     |      DataFrame       |
+| steal_bundles_to_dataframe              |  3   |       No        |    No     |      DataFrame       |
+| sail_through_search_space_to_dataframe  |  3   |       Yes       |    No     |      DataFrame       |
+| trade_rows_for_dataframe                |  3   |       Yes       |    Yes    |      DataFrame       |
 
 
 **BETA FEATURE**: It is also possible to cache the bundles using the `bundle_caching` parameter,
@@ -215,8 +214,7 @@ A toy request for ImagingStudy:
 search = ...
 
 # Make the FHIR call
-bundles = search.query_to_dataframe(
-    bundles_function=search.sail_through_search_space,
+bundles = search.sail_through_search_space_to_dataframe(
     resource_type="ImagingStudy",
     date_init="2021-04-01",
     time_attribute_name="started",
@@ -230,26 +228,25 @@ bundles = search.query_to_dataframe(
 The argument `request_params` is a dictionary that takes a string as key (the FHIR identifier) and anything as value.
 If the value is a list or tuple, then all values will be used to build the request to the FHIR API.
 
-`query_to_dataframe` is a wrapper function. It collects the bundles that result from the
-`bundles_function` that was called and calls `bundles_to_dataframe`. In this case, we used
-sail_through_search_space.
+`sail_through_search_space_to_dataframe` is a wrapper function that directly converts the result of
+`sail_through_search_space` into a DataFrame.
 
 #### [`sail_through_search_space`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
 
 The `sail_through_search_space` function uses the multiprocessing module to speed up some queries.
 The multiprocessing is done as follows:
 The time frame is divided into multiple time spans (as many as there are processes) and each smaller
 time frame is investigated simultaneously. This is why it is necessary to give a `date_init`
-and `date_end` param to the
-`sail_through_search_space` function. The default values are `date_init=2010-01-01` and today (the day
-when the query is performed)
-for `date_end`.
+and `date_end` param to the `sail_through_search_space` function.
+
+**Note** that if the `date_init` or `date_end` parameters are given as strings, they will be converted
+to `datetime.datetime` objects, so any non specified parameters (month, day or time) will be assumed
+according to the `datetime` workflow, and then converted to string according to the `time_format`
+specified in the **Pirate** constructor.
 
 A problematic aspect of the resources is that the date in which the resource was acquired is defined
 using different attributes. Also, some resources use a fixed date, other use a time period.
 You can specify the date attribute that you want to use with `time_attribute_name`.
-In the following table you can see which resource attributes we use of each of the resources.
-The default attribute is `_lastUpdated`.
 
 The resources where the date is based on a period (such as `Encounter` or `Procedure`) may cause
 duplicates in the multiprocessing because one entry may belong to multiple time spans that are
@@ -286,12 +283,12 @@ contains a bunch of different LOINC codes. Our `df_constraints` could look as fo
 df_constraints={"code": ("http://loinc.org", "loinc_code")}
 ```
 
-This function also uses multiprocessing, but differently from before, it will investigate the rows
+This function also uses multiprocessing, but differently from before, it will process the rows
 of the DataFrame in parallel.
 
 #### [`bundles_to_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
 
-The two functions described above return a list of `FHIRObj` bundles which can then be
+The two functions described above return a generator of `FHIRObj` bundles which can then be
 converted to a `DataFrame` using this function.
 
 The `bundles_to_dataframe` has three options on how to handle and extract the relevant information
@@ -365,8 +362,7 @@ instead (as in 2.).
 pieces of information but for the same resource, the field will be only filled with the first
 occurence that is not None.
 ```python
-df = search.query_to_dataframe(
-    bundles_function=search.steal_bundles,
+df = search.steal_bundles_to_dataframe(
     resource_type="DiagnosticReport",
     request_params={
         "_count": 1,
@@ -387,29 +383,20 @@ df = search.query_to_dataframe(
         ("code_abc", "code.coding.where(system = 'ABC').code"),
         ("code_def", "code.coding.where(system = 'DEF').code"),
     ],
-    stop_after_first_page=True,
+    num_pages=1,
 )
 ```
 
+#### [`***_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
+The `steal_bundles_to_dataframe`, `sail_through_search_space_to_dataframe` and `trade_rows_for_dataframe`
+are facade functions which retrieve the bundles and then run `bundles_to_dataframe`.
 
-In case you are not sure whether we have collected the same entry multiple times
-(i.e. when using multiprocessing in `sail_through_search_space` with a resource that uses a time
-period), please use the `drop_duplicates` function from pandas. A list of column names for which
-we do not want duplicates shall be passed as parameter and all duplicate rows will disappear.
-
-#### [`query_to_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
-This function is simply a wrapper that can be used to combine any function of Type 1 and
-`bundles_to_dataframe`. Look at [examples](examples) for some use cases.
-
-#### [`trade_rows_for_dataframe`](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/pirate.py)
-This function has an output similar to `query_to_dataframe` with
-`bundles_function=trade_rows_for_bundles`, but with two main differences:
-1. Here, the bundles are retrieved and the DataFrame is computed straight away. In
-   `query_to_dataframe(bundles_function=trade_rows_for_bundles, ...)` first all the bundles are
-   retrieved, and then they are converted into a DataFrame.
-2. If the `df_constraints` constraints are specified, they will end up in the final DataFrame.
-
+In `trade_rows_for_dataframe` you can also specify the `with_ref` parameter to also add the
+parameters specified in `df_constraints` as columns of the final DataFrame.
 You can find an example in [Example 3](https://github.com/UMEssen/FHIR-PYrate/blob/main/examples/3-patients-for-condition.ipynb).
+Additionally, you can specify the `with_columns` parameter, which can add any columns from the original
+DataFrame. The columns can be either specified as a list of columns `[col1, col2, ...]` or as a
+list of tuples `[(new_name_for_col1, col1), (new_name_for_col2, col2), ...]`
 
 ### [Miner](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/miner.py)
 
@@ -526,6 +513,7 @@ request. You can also simply open an issue with the tag "enhancement".
 5. Open a Pull Request
 
 ## Authors and acknowledgment
+
 This package was developed by the [SHIP-AI group at the Institute for Artificial Intelligence in Medicine](https://ship-ai.ikim.nrw/).
 
 - [goku1110](https://github.com/goku1110): initial idea, development, logo & figures