Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdbufr does not parse data correctly. #25

Closed
meteoDaniel opened this issue Feb 14, 2021 · 4 comments
Closed

pdbufr does not parse data correctly. #25

meteoDaniel opened this issue Feb 14, 2021 · 4 comments

Comments

@meteoDaniel
Copy link

meteoDaniel commented Feb 14, 2021

I know dealing with bufr is a mess and my first impression of pdbufr is good. But it is not able to deal with one of the main problems of observation data decoded in bufr. So my intention is to work on that issue with you together to provide the world a bufr reader that is as good as cfgrib is.

Take a look into the data I attached and you will find out that airTemperature is defined multiple times within one subset. (From my experience the reports are seperated in subsets) . So now I thought I could use pdbufr filter to access the right temperature.

df = pdbufr.read_bufr(
         'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
              columns=('airTemperature'),
              filters={'heightOfSensorAboveLocalGroundOrDeckOfMarinePlatform': [2.0]},
              required_columns=False)

The results in an empty Dataframe. As well It is necessary to filter for timePeriod but it does not work, too.

In my own implementation

        parsed_bufr_data = subprocess.run(
            f"{os.environ['BUFR_DUMP_PATH']} -jf {local_bufr_file}",
            stdout=subprocess.PIPE,
            check=True,
            shell=True
        ).stdout

    synop_df = pd.DataFrame(
        json.loads(
            parsed_bufr_data.decode("utf-8", errors='ignore')
        )[SYNOP_DATA_KEY_MESSAGES])

I use bufr_dump and parse the output first as a bytes object and afterwards as a JSON and dump it into a dataframe.
Then I loop through the lines and store each timePeriod and heightOfSensor information to map them to the measures. The rule is that that the latest sensor Information and/or time period information is valid for the value. I guess this behaviour should be implement behind the filter function too.

Why do I name this Issue that pdbufr does not parse the data correctly?
-> It is not clear what kind of airTemperature is parsed (2m or 0.05m) but it is mandatory to know this information to parse the data correctly from my point of view.

Another point: During my investigation of eccodes+python and bufr_dump I have found out that bufr_dump is much compared to the use of the eccodes python interface (or what is suggested in the documentation eccodes doc ).

@alexamici

Z__C_EDZW_20210214100000_bda01.synop_bufr_GER_999999_999999__MW_536.zip

@iainrussell
Copy link
Member

Hi @meteoDaniel,

Many thanks for this report! I hope I have good news for you :)

First, it seems that you found an interesting behaviour in pdbufr - it expects 'proper' tuples as input, and if you omit the trailing comma as you do in columns=('airTemperature') then it is not passed as a tuple, but as a single element, and pdbufr does not properly handle that. So change that line to columns=('airTemperature',) to get something generated by pdbufr! I'll add an issue to get that fixed.

The second point is that we have quite a large code refactor waiting to be released - if you're in a position to install pdbufr from git, I encourage you to do so and use the latest master branch. We used data very similar to yours to test develop and test it with, so we expect it to work with this version. When I do this, I get sensible results from your filter (I hand-checked a few using Metview's BUFR examiner, otherwise known as CodesUI in its standalone form).

I'd be interested to know if these tips allow you to get what you need from pdbufr.

Cheers,
Iain

@meteoDaniel
Copy link
Author

@iainrussell thanks for this update. E.g. filtering for the 2m Temperature works right now.

I want to give you an update of my investigations:

  1. Filtering maximumWindGustSpeed by timePeriod is not supported (from the version I have took yesterday from the repo).
df = pdbufr.read_bufr(
        'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
             columns=('maximumWindGustSpeed',  'stationNumber', 'data_datetime'),
             filters={'timePeriod': [10.]},
             required_columns=False)
  1. To extract all informations correctly I think the best strategy (please correct me If I am wrong) is to parse the bufr file for meta data first:
meta_data = pdbufr.read_bufr(
        'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
            columns=( 'latitude', 'longitude', 'stationNumber', 'stationOrSiteName', 'heightOfStationGroundAboveMeanSeaLevel', 'data_datetime'),
            required_columns=False)

And then a value (here airTemperature) with the stationNumber to map meta_data to the measures:

df = pdbufr.read_bufr(
        'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
             columns=('airTemperature', 'stationNumber', 'data_datetime'),
              filters={'heightOfSensorAboveLocalGroundOrDeckOfMarinePlatform': 2.0},
           required_columns=False)

I think the point is that this strategy only works if I can match each measure to the correct stationNumber?!
What do you think ?

And I have testes other timePeriod variables and they did not worked, too.

@iainrussell
Copy link
Member

Hi @meteoDaniel,

Glad the fix is working!

For timePeriod, I think you may just have a typo? There is no timePeriod of 10.0 in the data (I believe), but there is -10, and putting that in the filter instead of 10 works for me. I can also query all the unique values of timePeriod like this:

# use a tuple to represent a range of values
df = pdbufr.read_bufr(
        'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
             columns=('maximumWindGustSpeed',  'stationNumber', 'data_datetime', 'timePeriod'),
             filters={'timePeriod': slice(-10000.0, 10000.0)},
             required_columns=False)
print(df)
un = np.unique(df.timePeriod)
print(df.timePeriod[un])

and I get this:

-1800 
-360
-60
-30
-24
-12
-10
-1 
 0

Metview also agrees that these are all the values in the file.

For your larger question, I think I'm getting a bit lost in terms of what you want. I can indeed see that this is a complicated BUFR file, so it would be good to be able to handle it properly. Can you describe what information you'd like to retrieve from it please?

Many thanks!
Iain

@meteoDaniel
Copy link
Author

Thanks a lot for your support. I will make further investigations later on pdbufr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants