121 improve documentation (#122)

* Add logic to disable multiprocessing if processes=1 * Update examples * Add explanation about multiple resources output * Add example for multi dataframe output --------- Co-authored-by: Giulia Baldini <Giulia.Baldini@uk-essen.de>
UMEssen · Mar 17, 2023 · ad3eefe · ad3eefe
1 parent 004036b
commit ad3eefe
Show file tree

Hide file tree

Showing 6 changed files with 174 additions and 140 deletions.
diff --git a/README.md b/README.md
@@ -203,10 +203,12 @@ The Pirate functions do one of three things:
 | trade_rows_for_dataframe                |  3   |       Yes       |    Yes    |      DataFrame       |
 
 
-**BETA FEATURE**: It is also possible to cache the bundles using the `bundle_caching` parameter,
-which specifies a caching folder. This has not yet been tested extensively and does not have any
-cache invalidation mechanism.
-
+**CACHING**: It is also possible to cache the bundles using the `cache_folder` parameter.
+This unfortunately does not currently work with multiprocessing, but saves a lot of time if you
+need to download a lot of data and you are always doing the same requests.
+You can also specify how long the cache should be valid with the `cache_expiry_time` parameter.
+Additionally, you can also specify whether the requests should be retried using the `retry_requests`
+parameter. There is an example of this in the docstrings of the Pirate class.
 
 A toy request for ImagingStudy:
 
@@ -396,7 +398,65 @@ parameters specified in `df_constraints` as columns of the final DataFrame.
 You can find an example in [Example 3](https://github.com/UMEssen/FHIR-PYrate/blob/main/examples/3-patients-for-condition.ipynb).
 Additionally, you can specify the `with_columns` parameter, which can add any columns from the original
 DataFrame. The columns can be either specified as a list of columns `[col1, col2, ...]` or as a
-list of tuples `[(new_name_for_col1, col1), (new_name_for_col2, col2), ...]`
+list of tuples `[(new_name_for_col1, col1), (new_name_for_col2, col2), ...]`.
+
+Currently, whenever a column is completely empty (i.e., no resources
+have a corresponding value for that column), it is just removed from the DataFrame.
+This is to ensure that we output clean DataFrames when we are handling multiple resources.
+More on that in the following section.
+
+#### Note on Querying Multiple Resources
+
+Not all FHIR servers allow this (at least not the public ones that we have tried),
+but it is also possible to obtain multiple resources with just one query:
+```python
+search = ...
+result_dfs = search.steal_bundles_to_dataframe(
+    resource_type="ImagingStudy",
+    request_params={
+        "_lastUpdated": "ge2022-12",
+        "_count": "3",
+        "_include": "ImagingStudy:subject",
+    },
+    fhir_paths=[
+        "id",
+        "started",
+        ("modality", "modality.code"),
+        ("procedureCode", "procedureCode.coding.code"),
+        (
+            "study_instance_uid",
+            "identifier.where(system = 'urn:dicom:uid').value.replace('urn:oid:', '')",
+        ),
+        ("series_instance_uid", "series.uid"),
+        ("series_code", "series.modality.code"),
+        ("numberOfInstances", "series.numberOfInstances"),
+        ("family_first", "name[0].family"),
+        ("given_first", "name[0].given"),
+    ],
+    num_pages=1,
+)
+```
+In this case, a dictionary of DataFrames is returned, where the keys are the resource types.
+You can then select the single dictionary by doing `result_dfs["ImagingStudy"]`
+or `result_dfs["Patient"]`.
+You can find an example of this in [Example 2](https://github.com/UMEssen/FHIR-PYrate/blob/main/examples/2-condition-to-imaging-study.ipynb)
+where the `ImagingStudy` resource is queried.
+
+In theory, it would be smarter to specify the resource name in front of the FHIRPaths,
+e.g. `ImagingStudy.series.uid` instead of `series.uid`, and for each DataFrame only return the
+corresponding attributes.
+However, we do not want to force the user to always specify the resource type, and in the current
+version the DataFrames
+coming from multiple resources have the same columns, because
+we cannot filter which resource was actually intended.
+Currently, we solved this by just removing all columns that do not have any results.
+Which means however, that if you are actually requesting an attribute for a specific resource and it
+is not found, that that column will not appear.
+In the future, [we plan to do a smarter filtering of the FHIRPaths](https://github.com/UMEssen/FHIR-PYrate/issues/120),
+such that only the ones containing
+the actual resource name are kept if the resource name is specified in the path,
+and that a column full of `None`s is obtained in case no resource type is specified.
+
 
 ### [Miner](https://github.com/UMEssen/FHIR-PYrate/blob/main/fhir_pyrate/miner.py)
 

diff --git a/examples/1-simple-json-to-df.ipynb b/examples/1-simple-json-to-df.ipynb
@@ -14,7 +14,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 4,
    "outputs": [],
    "source": [
     "from fhir_pyrate import Pirate\n",
@@ -65,28 +65,28 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "http://hapi.fhir.org/baseDstu2/Observation?_id=86092\n"
+      "http://hapi.fhir.org/baseDstu2/Observation/86092\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Query & Build DF: 100%|██████████| 1/1 [00:00<00:00, 5882.61it/s]\n"
+      "Query & Build DF (Observation): 100%|██████████| 1/1 [00:00<00:00, 12372.58it/s]\n"
      ]
     },
     {
      "data": {
       "text/plain": "  resourceType     id meta_versionId               meta_lastUpdated status  \\\n0  Observation  86092              1  2018-11-19T12:59:31.238+00:00  final   \n\n                   category_coding_0_system category_coding_0_code  \\\n0  http://hl7.org/fhir/observation-category            vital-signs   \n\n  code_coding_0_system code_coding_0_code code_coding_0_display    code_text  \\\n0     http://loinc.org            29463-7           Body Weight  Body Weight   \n\n  subject_reference encounter_reference          effectiveDateTime  \\\n0     Patient/86079     Encounter/86090  2011-03-10T20:47:29-05:00   \n\n                      issued  valueQuantity_value valueQuantity_unit  \\\n0  2011-03-10T20:47:29-05:00             6.079781                 kg   \n\n         valueQuantity_system valueQuantity_code  \n0  http://unitsofmeasure.org/                 kg  ",
       "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>resourceType</th>\n      <th>id</th>\n      <th>meta_versionId</th>\n      <th>meta_lastUpdated</th>\n      <th>status</th>\n      <th>category_coding_0_system</th>\n      <th>category_coding_0_code</th>\n      <th>code_coding_0_system</th>\n      <th>code_coding_0_code</th>\n      <th>code_coding_0_display</th>\n      <th>code_text</th>\n      <th>subject_reference</th>\n      <th>encounter_reference</th>\n      <th>effectiveDateTime</th>\n      <th>issued</th>\n      <th>valueQuantity_value</th>\n      <th>valueQuantity_unit</th>\n      <th>valueQuantity_system</th>\n      <th>valueQuantity_code</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Observation</td>\n      <td>86092</td>\n      <td>1</td>\n      <td>2018-11-19T12:59:31.238+00:00</td>\n      <td>final</td>\n      <td>http://hl7.org/fhir/observation-category</td>\n      <td>vital-signs</td>\n      <td>http://loinc.org</td>\n      <td>29463-7</td>\n      <td>Body Weight</td>\n      <td>Body Weight</td>\n      <td>Patient/86079</td>\n      <td>Encounter/86090</td>\n      <td>2011-03-10T20:47:29-05:00</td>\n      <td>2011-03-10T20:47:29-05:00</td>\n      <td>6.079781</td>\n      <td>kg</td>\n      <td>http://unitsofmeasure.org/</td>\n      <td>kg</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
      },
-     "execution_count": 4,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -120,28 +120,28 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "http://hapi.fhir.org/baseDstu2/Observation?_count=1&_id=86092\n"
+      "http://hapi.fhir.org/baseDstu2/Observation/86092\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Query & Build DF: 100%|██████████| 1/1 [00:00<00:00, 1197.69it/s]\n"
+      "Query & Build DF (Observation): 100%|██████████| 1/1 [00:00<00:00, 1379.71it/s]\n"
      ]
     },
     {
      "data": {
       "text/plain": "      id          effectiveDateTime     value unit patient\n0  86092  2011-03-10T20:47:29-05:00  6.079781   kg   86079",
       "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>id</th>\n      <th>effectiveDateTime</th>\n      <th>value</th>\n      <th>unit</th>\n      <th>patient</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>86092</td>\n      <td>2011-03-10T20:47:29-05:00</td>\n      <td>6.079781</td>\n      <td>kg</td>\n      <td>86079</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
      },
-     "execution_count": 5,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }