Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent encoding of some polymorphic types #1785

Closed
chgl opened this issue Apr 2, 2024 · 5 comments
Closed

Inconsistent encoding of some polymorphic types #1785

chgl opened this issue Apr 2, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@chgl
Copy link
Collaborator

chgl commented Apr 2, 2024

Describe the bug

Unset FHIR parameters are sometimes encoded as NULL sometimes as a complex type/struct with all values set to NULL.

For example, none of the Condition resources have the abatementAge set, yet sometimes the raw parquet files contain a NULL there, sometimes a {"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}. The distribution is roughly 50:50 for a dataset of around 70.000 resources:

SELECT abatementAge, COUNT(*)
FROM delta.`s3a://pathling-warehouse/default/Condition.parquet` AS Condition
GROUP BY abatementAge;
abatementAge	count(1)
NULL	35000
{"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}	37000

I noticed this because https://docs.profiling.ydata.ai/latest/ would consider the NULL as a missing value and the null-struct as non-null.

Here's another screenshot:

image

Definitely something I can easily work-around, just thought you might be able to share some insights into why it is happening.

To Reproduce

Import a lot (?) of resources via $import and observe that seemingly randomly some unset values are encoded as NULL, some as a null-struct.

Expected behavior

I would expect all values to be encoded consistently, that being said, it's definitely not wrong in this way.

@johngrimes
Copy link
Member

Thanks for posting this @chgl.

This definitely seems quite strange, and a defect as described. We will work on reproducing this.

@johngrimes johngrimes added the bug Something isn't working label Apr 8, 2024
@piotrszul
Copy link
Collaborator

Hi @chgl

could you please describe in more detail how did you produce s3a://pathling-warehouse/default/Condition.parquet file?
In particular which version of Pathing and which API was used to create it from what input.

I tried to reproduce the issue importing a large number of Condition resources, but that by itself does not seem be to leading to this problem.
I did manage to observe the difference between:

{"resourceType":"Condition","id":"xxx", "abatementAge":null}
and
{"resourceType":"Condition","id":"xxx"}

when the latter produces a NULL while the former and empty struct.

Could you please check if your source data does not use JSON nulls in some of the resources?

If this is not the case would you be able to identify a pair or Condition resources (and I assume Observations) that have different encodings and send us the (anonymised) JSON representations of them?

@chgl
Copy link
Collaborator Author

chgl commented Apr 11, 2024

Ah, it happens if a resource is merged, either via the $import mode merge or via update-as-create.

To reproduce:

services:
  minio:
    image: docker.io/bitnami/minio:2024.2.17-debian-12-r2@sha256:4d04a41f9d385d51ecd9be8dafca13fe9d56be2cc1c5ea8f98e6cfb235d87ae5
    environment:
      MINIO_ROOT_USER: "admin"
      # kics-scan ignore-line
      MINIO_ROOT_PASSWORD: "miniopass" # gitleaks:allow
      MINIO_DEFAULT_BUCKETS: "fhir"
    ports:
      - "9000:9000"
      - "127.0.0.1:9001:9001"

  pathling:
    image: docker.io/aehrc/pathling:6.4.2@sha256:9b8ee32d4b8bb40192d6bf25814492a616153a0df15d178c286db9ec80c1c85e
    environment:
      pathling.storage.warehouseUrl: s3://fhir
      pathling.import.allowableSources: s3://fhir/staging
      pathling.terminology.enabled: false
      pathling.terminology.serverUrl: http://localhost:8080/i-dont-exist
      fs.s3a.endpoint: "http://minio:9000"
      fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
      fs.s3a.access.key: "admin"
      fs.s3a.secret.key: "miniopass"
      fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
      fs.s3a.path.style.access: "true"
    ports:
      - "8082:8080"
    depends_on:
      minio:
        condition: service_started

Save as compose.yaml and run

docker compose -f compose.yaml up

Using the following bundle (I don't believe the exact contents matter):

{
  "resourceType": "Bundle",
  "type": "transaction",
  "entry": [
    {
      "fullUrl": "Condition/c123",
      "resource": {
        "resourceType": "Condition",
        "id": "c123",
        "code": {
          "coding": [
            {
              "system": "http://fhir.de/CodeSystem/bfarm/icd-10-gm",
              "version": "2021",
              "code": "C72.0"
            }
          ]
        },
        "subject": {
          "reference": "Patient/p123"
        }
      },
      "request": {
        "method": "PUT",
        "url": "Condition/c123"
      }
    }
  ]
}

POST once:

curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir

I used duckdb to read the Parquet file:

duckdb
SET s3_region='us-east-1';
SET s3_access_key_id='admin';
SET s3_secret_access_key='miniopass';
SET s3_endpoint='localhost:9000';
SET s3_url_style='path';
SET s3_use_ssl=false;

SELECT id,abatementAge
  FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│   id    │                            abatementAge                             │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co…  │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123    │                                                                     │
└─────────┴─────────────────────────────────────────────────────────────────────┘

The "empty" row means null

POST again:

curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir
SELECT id,abatementAge
  FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│   id    │                            abatementAge                             │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co…  │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123    │                                                                     │
│ c123    │ {'id': NULL, 'value': NULL, 'value_scale': NULL, 'comparator': NU…  │
└─────────┴─────────────────────────────────────────────────────────────────────┘

You can see the most recent version is now using a struct with NULLs while the older version is just NULL.

@johngrimes
Copy link
Member

Hi @chgl!

Good news, we were able to reproduce this problem.

It turns out that it is related to the spark.databricks.delta.schema.autoMerge.enabled configuration setting, which is currently set to true by default.

I believe you may be able to fix this problem simply by setting this to false.

If you are wondering why you are getting multiple rows for the same resource when querying with DuckDB, it is because these are Delta files. Delta is a specialisation of Parquet that includes history. Because you are updating these rows, old versions are being preserved.

These will be invisible to a Delta-aware query engine such as Spark (unless you explicitly query the history).

We're working up a new release which will change the default setting to fix this problem.

@chgl
Copy link
Collaborator Author

chgl commented Apr 19, 2024

Sounds good, thanks for the update @johngrimes !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants