-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent encoding of some polymorphic types #1785
Comments
Thanks for posting this @chgl. This definitely seems quite strange, and a defect as described. We will work on reproducing this. |
Hi @chgl could you please describe in more detail how did you produce s3a://pathling-warehouse/default/Condition.parquet file? I tried to reproduce the issue importing a large number of Condition resources, but that by itself does not seem be to leading to this problem.
when the latter produces a NULL while the former and empty struct. Could you please check if your source data does not use JSON nulls in some of the resources? If this is not the case would you be able to identify a pair or Condition resources (and I assume Observations) that have different encodings and send us the (anonymised) JSON representations of them? |
Ah, it happens if a resource is merged, either via the To reproduce: services:
minio:
image: docker.io/bitnami/minio:2024.2.17-debian-12-r2@sha256:4d04a41f9d385d51ecd9be8dafca13fe9d56be2cc1c5ea8f98e6cfb235d87ae5
environment:
MINIO_ROOT_USER: "admin"
# kics-scan ignore-line
MINIO_ROOT_PASSWORD: "miniopass" # gitleaks:allow
MINIO_DEFAULT_BUCKETS: "fhir"
ports:
- "9000:9000"
- "127.0.0.1:9001:9001"
pathling:
image: docker.io/aehrc/pathling:6.4.2@sha256:9b8ee32d4b8bb40192d6bf25814492a616153a0df15d178c286db9ec80c1c85e
environment:
pathling.storage.warehouseUrl: s3://fhir
pathling.import.allowableSources: s3://fhir/staging
pathling.terminology.enabled: false
pathling.terminology.serverUrl: http://localhost:8080/i-dont-exist
fs.s3a.endpoint: "http://minio:9000"
fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
fs.s3a.access.key: "admin"
fs.s3a.secret.key: "miniopass"
fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
fs.s3a.path.style.access: "true"
ports:
- "8082:8080"
depends_on:
minio:
condition: service_started Save as docker compose -f compose.yaml up Using the following bundle (I don't believe the exact contents matter): {
"resourceType": "Bundle",
"type": "transaction",
"entry": [
{
"fullUrl": "Condition/c123",
"resource": {
"resourceType": "Condition",
"id": "c123",
"code": {
"coding": [
{
"system": "http://fhir.de/CodeSystem/bfarm/icd-10-gm",
"version": "2021",
"code": "C72.0"
}
]
},
"subject": {
"reference": "Patient/p123"
}
},
"request": {
"method": "PUT",
"url": "Condition/c123"
}
}
]
} POST once: curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir I used duckdb to read the Parquet file: duckdb SET s3_region='us-east-1';
SET s3_access_key_id='admin';
SET s3_secret_access_key='miniopass';
SET s3_endpoint='localhost:9000';
SET s3_url_style='path';
SET s3_use_ssl=false;
SELECT id,abatementAge
FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│ id │ abatementAge │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co… │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123 │ │
└─────────┴─────────────────────────────────────────────────────────────────────┘ The "empty" row means null POST again: curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir SELECT id,abatementAge
FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│ id │ abatementAge │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co… │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123 │ │
│ c123 │ {'id': NULL, 'value': NULL, 'value_scale': NULL, 'comparator': NU… │
└─────────┴─────────────────────────────────────────────────────────────────────┘ You can see the most recent version is now using a struct with |
Hi @chgl! Good news, we were able to reproduce this problem. It turns out that it is related to the I believe you may be able to fix this problem simply by setting this to If you are wondering why you are getting multiple rows for the same resource when querying with DuckDB, it is because these are Delta files. Delta is a specialisation of Parquet that includes history. Because you are updating these rows, old versions are being preserved. These will be invisible to a Delta-aware query engine such as Spark (unless you explicitly query the history). We're working up a new release which will change the default setting to fix this problem. |
Sounds good, thanks for the update @johngrimes ! |
Describe the bug
Unset FHIR parameters are sometimes encoded as
NULL
sometimes as a complex type/struct with all values set toNULL
.For example, none of the Condition resources have the
abatementAge
set, yet sometimes the raw parquet files contain aNULL
there, sometimes a{"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}
. The distribution is roughly 50:50 for a dataset of around 70.000 resources:I noticed this because https://docs.profiling.ydata.ai/latest/ would consider the
NULL
as a missing value and thenull-struct
as non-null.Here's another screenshot:
Definitely something I can easily work-around, just thought you might be able to share some insights into why it is happening.
To Reproduce
Import a lot (?) of resources via
$import
and observe that seemingly randomly some unset values are encoded as NULL, some as a null-struct.Expected behavior
I would expect all values to be encoded consistently, that being said, it's definitely not wrong in this way.
The text was updated successfully, but these errors were encountered: