Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sprint 3: 31 Jan - 4 Feb #23

Closed
2 of 7 tasks
backeb opened this issue Jan 24, 2022 · 17 comments
Closed
2 of 7 tasks

Sprint 3: 31 Jan - 4 Feb #23

backeb opened this issue Jan 24, 2022 · 17 comments

Comments

@backeb
Copy link
Contributor

backeb commented Jan 24, 2022

Notes from sprint planning meeting can be found here: https://confluence.egi.eu/display/CSCALE/2022-01-20+Planning+the+next+Aquamonitor+sprint

Objectives

The overarching goal is to work towards running the Aquamonitor workflow on

  1. INCD compute, accessing data available at INCD
  2. INCD compute, accessing data remotely on CREODIAS
    Report on performance differences.

Make the data available for the use case

  • @jdries @MZICloudferro arrange access to CREODIAS object storage, ensure that cost can be reimbursed from CREODIAS VA allocation
  • @jdries configure CREODIAS layer for INCD instance so that the OpenEO workflow can access the data remotely from INCD
  • @zbenta @tiagofglip raise issue on EODAG Github re download and unzip issue
  • @zbenta @tiagofglip @mariojmdavid provide object storage (swift/S3 interfaces), check integration with EGI Checkin.

Progress on Notebook (MVP)

  • @Jaapel Continue testing and improving Notebook using data for whole of Spain on Terrascope
  • @Jaapel Switch to CREODIAS backend and test.
  • @Jaapel Switch to INCD backend and test local data access performance.

cc @gena @Jaapel @gdonvito @mariojmdavid @jopina @jorge-lip

@jdries
Copy link

jdries commented Jan 31, 2022

@Jaapel I finally was able to add an experimental feature to our backend so that overviews are used when you work at lower resolutions. This is a very basic example, note this line where I set to 'experimental' feature flag, which is important to make it work, all against openeo-dev.vito.be:

    from openeo.processes import lte, eq
    rgb = connection.load_collection("TERRASCOPE_S2_TOC_V2",
            spatial_extent={'west':3.758216409030558,'east':4.087806252,'south':51.291835566,'north':51.3927399},
            temporal_extent=["2020-03-11","2020-03-15"],bands=['B04'],properties={
            "eo:cloud_cover": lambda cc:eq(cc, 50 )
        } )

    rgb._pg.arguments['featureflags'] = {"experimental": True}
    #specify process graph
    download = rgb.min_time().resample_spatial(resolution=80,projection=3857).download("/tmp/openeo-rgb-sen2cor-manyclouds-resampled.tiff")

Can you integrate this in your code and do a test run on a larger scale?

By the way, can you confirm that the full processing of spain will also work against a lower resolution? This is quite important for getting a view on data needs.

@Jaapel
Copy link
Contributor

Jaapel commented Feb 1, 2022

@jdries I can try it tomorrow, worked today on an example with all the data using the .resample method.

Do you know how both resampling and this new experimental feature work with masks or missing data? When upsampling, do NaN values affect the result?

@Jaapel
Copy link
Contributor

Jaapel commented Feb 1, 2022

Also caching DataCubes causes missing metadata errors, as described here, which makes quick iteration on larger datasets difficult. Todays run Took ~4 hours to complete. If you have some time this sprint, I can guide you through how I set it up!

@jdries
Copy link

jdries commented Feb 2, 2022

@Jaapel upsampling can indeed have various approaches to NaN values, but when we speed things up by using the overviews in the native products, we can't control that anymore. Also for the sceneclassification, I don't really know what was to generate them. Will be interesting to compare results perhaps.

We clearly need to work on this load_result to simpllify the caching, but this experimental use of overviews also has the potential to drastically reduce that 4 hours job duration.

@Jaapel
Copy link
Contributor

Jaapel commented Feb 2, 2022

@jdries any place where I can find information about how resampling / upsampling method work with NaN / filtered values?

@jdries
Copy link

jdries commented Feb 3, 2022

It seems that both openEO and GDAL explicitly mention how NODATA/valid pixels are treated, per resampling method:
https://gdal.org/programs/gdalwarp.html#cmdoption-gdalwarp-r
https://processes.openeo.org/#resample_spatial

I have been searching through Sentinel-2 docs, but unfortunately cannot find which resampling method is used to generate overviews.

@Jaapel
Copy link
Contributor

Jaapel commented Feb 3, 2022

This is great @jdries ! Let me try to see if I can improve the masking in the algorithm.

@backeb
Copy link
Contributor Author

backeb commented Feb 4, 2022

Retrospective

Tops

Tips

  • ...

Review objectives

The overarching goal is to work towards running the Aquamonitor workflow on

  1. INCD compute, accessing data available at INCD
  2. INCD compute, accessing data remotely on CREODIAS
    Report on performance differences.

Make the data available for the use case

  • provide object storage (swift/S3 interfaces), check integration with EGI Checkin
    • didn't have time test it yet
    • in principle should work
    • need access token to authenticate via check-in with the S3/swift endpoint
    • check how to do this from python with EGI Check-in etc
  • raise issue on EODAG Github re download and unzip issue
  • CREODIAS has files unzipped on object storage
    • then for downloads and transfers over http they zip them
    • ❗ If we can sort out the VA amendment for CREODIAS then would be much easier to get the data
  • arrange access to CREODIAS object storage, ensure that cost can be reimbursed from CREODIAS VA allocation
    • meeting took place and planning to test
  • configure CREODIAS layer for INCD instance so that the OpenEO workflow can access the data remotely from INCD
    • in progress
  • in terms of optimisation can work with lower resolution data, has performance impacts for transfer, analysis etc.
    • outstanding questions about resampling
    • @Jaapel consider mailing Copernicus helpdesk

Progress on Notebook (MVP)

  • Continue testing and improving Notebook using data for whole of Spain on Terrascope
    • have results for half of spain
    • updated visualisation
    • notebook in good state to test full dataset - main analysis ready for testing
    • working on optimisation
  • Switch to CREODIAS backend and test.
  • Switch to INCD backend and test local data access performance.
    • dependent on data available on INCD

Current objective = make the data available

other actions

follow up progress meeting

18 Feb 15h00 CET

@zbenta
Copy link

zbenta commented Feb 18, 2022

We have recreated the stac server to use an nfs moutend PVC.
We have also have recreated the spark-executor/driver to also mount the same nfs enabled PVC to the /opt/workdir/ path.
The python script we created to download the data is running on the stac server and is currencly downloading the data into the said nfs enabled PVC.
We believe that this solution is the best one to allow the spark-executor/driver access to the downloaded products.
The remaining work to be done is to somehow enable access to the data inside the spark-executer/driver pod as an existing collection, so that @Jaapel can use his jupyter notebook for processing the data.

@backeb
Copy link
Contributor Author

backeb commented Feb 18, 2022

Progress meeting: 18 Feb

provide object storage (swift/S3 interfaces), check integration with EGI Checkin

Make data available on INCD

Update
#23 (comment)

  • recreated the stac server to use an nfs moutend PVC.
  • recreated the spark-executor/driver to also mount the same nfs enabled PVC to the /opt/workdir/ path.
  • python script to download the data is running on the stac server and is currencly downloading the data into the said nfs enabled PVC.
  • We believe that this solution is the best one to allow the spark-executor/driver access to the downloaded products.
  • Downloading from CREODIAS provider - the free access service (sequential downloads of zip files)

Next steps

  • enable access to the data inside the spark-executer/driver pod as an existing collection, so that @Jaapel can use his jupyter notebook for processing the data.
    • So far only data is downloaded - not connected to the STAC server
    • Still have issues with the EODAG STAC server implementation (can download data but not extract zip files)
    • Need STAC catalog to point OpenEO towards the data
      • Workaround discussed and dismissed as an option
    • ❗ The bottleneck is not setting up the STAC catalog, but ingesting and indexing the data is difficult 🤯
    • Options for STAC catalog to explore
      1. RESTO: https://github.com/jjrom/resto/blob/master/INSTALLATION.md
        • Deployment based on docker (easy to convert to K8s deployment)
        • RESTO will also need to index data
        • @sustr4 do you know how to ingest and index data in RESTO STAC catalog deployment?
        • @sustr4 check with CREODIAS who have a RESTO STAC catalog running in production
      2. Follow up with EODAG: extract downloaded STAC assets CS-SI/eodag#391
        • @zbenta carefully 😄 ask how long it might take for them to implement a solution for the unzipping
      3. Redeploy VITO catalog
        • @jdries to ask if possible and estimate effort required to make VITO catalog available

sort out the VA amendment for CREODIAS to get access to object storage

configure CREODIAS layer for INCD instance so that the OpenEO workflow can access the data remotely from INCD

  • In progress
  • @jdries to update at next meeting

in terms of optimisation can work with lower resolution data, has performance impacts for transfer, analysis etc.

  • @Jaapel to check how resampling is done, i.e. how data is saved so we know at what scale we can pull data out.
  • Working on experimental approach developed by @jdries

Continue testing and improving Notebook using data for whole of Spain on Terrascope

  • 2 outstanding issues to for @Jaapel to solve:
    • 1. Caching data in background
    • 2. Resampling and loading coarser resolution Sentinel-2 for optimisation

Switch to CREODIAS backend and test.

See above dependency

Switch to INCD backend and test local data access performance.

See above dependency

Next meeting

9 March 4-5pm

@backeb
Copy link
Contributor Author

backeb commented Feb 18, 2022

Hi all,

I have been able to test the remote S3 access to CreoDIAS in openEO, and gotten it to work.

The main next step is for INCD to get an S3 access key and secret key for use in the use case, but I guess we need to wait for the amendment to the VA?

After that, to go further:
INCD (Zacarias) will have to update openEO to latest versions. Quite a lot has changed since we did the initial deploy, and I still needed a small change to get it working.
Then we'll need to add a few environment variables for the connection to CreoDIAS:
AWS_S3_ENDPOINT: "s3.cloudferro.com"
AWS_DIRECT: "TRUE"
AWS_ACCESS_KEY_ID: "THE KEY ID"
AWS_SECRET_ACCESS_KEY: "SECRET"
AWS_DEFAULT_REGION: "RegionOne"
AWS_REGION: "RegionOne"
AWS_HTTPS: "YES"
AWS_VIRTUAL_HOSTING: "FALSE"

This will need to happen in a yaml file similar to this one:
https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/kubernetes/openeo.yaml

After that, we should be able to use layers from CreoDIAS on INCD.

best regards,
Jeroen

@backeb
Copy link
Contributor Author

backeb commented Mar 9, 2022

provide object storage (swift/S3 interfaces), check integration with EGI Checkin

  • more or less documented in EOSC Synergy

poster for Portugal Copernicus meeting (first national copernicus conferences) on 22/23 March

  • @mariojmdavid preparing poster
  • aim to share 10/11 March for people to comment

Make data available on INCD

  • WP2 working on implementing centralised STAC catalogue where providers can register their data
  • How it works
    • Central webservice
    • REST interface allowing to add new collections and metadata
    • Requires an indexing job to be run locally to index data at provider and create the STAC metadata
    • Information sent to central STAC catalogue
    • The STAC catalogue then points to data on provider
    • @sustr4: CESNET and INFN still have hours available in WP3 - can they help with the development of "indexing script".
  • @jdries coordinate a brainstorming session about the indexing script

sort out the VA amendment for CREODIAS to get access to object storage

  • will be added to amendment but aren't sure
  • ❗ is a blocker for credentials. Need project credentials for scaling.

blockers

  • centralised STAC catalogue service
  • project credentials from CloudFerro related to VA amendment
  • ❗ ❗ cannot progress on use case with these blockers
  • @sustr4 please update on progress related to centralised STAC catalogue service
  • @cchatzikyriakou please advise on way forward re VA amendment and project credentials from CloudFerro
  • VA allocation system not particularly agile... maybe something to report on for the EC

configure CREODIAS layer for INCD instance so that the OpenEO workflow can access the data remotely from INCD

  • configured with own credentials and tested locally
  • need project credentials to scale - see above blocker

in terms of optimisation can work with lower resolution data, has performance impacts for transfer, analysis etc. & Continue testing and improving Notebook using data for whole of Spain on Terrascope

  • @Jaapel work on resampling and loading coarser resolution Sentinel-2 for optimisation

reminder of objectives

  1. Switch to CREODIAS backend and test remote access from OpenEO.
  2. Switch to INCD backend and test local data access performance.
  3. report on performance differences

next steps

  • @jdries coordinate meeting with WP2 about requirements for centralised STAC catalogue. include @mariojmdavid @zbenta ++
  • @backeb discuss how to get project credentials from CloudFerro in the mean time (so we don't have to wait for the VA amendment)

Follow up meeting: 25 March, 12h00 CET

@sustr4
Copy link

sustr4 commented Mar 9, 2022

Requires an indexing job to be run locally to index data at provider and create the STAC metadata

I know we don't get a fresh start, but shouldn't the data be registered in STAC by the downloader? I guess it knows what it just downloaded, right? We can think of a one-time solution to register what has already been downloaded before, but that would be a one-time hack.

please update on progress related to centralised STAC catalogue service

User/Access management seems to be the greatest issue now. Would it be possible to set up an IP filter before we have proper access control? That would mean someone (INCD?) specifying IP addresses (ranges) that can access the catalogue. Just asking: It may not be needed in the end.

@mariojmdavid
Copy link

mariojmdavid commented Mar 9, 2022 via email

@sebastian-luna-valero
Copy link
Contributor

Should we close this one?

@backeb
Copy link
Contributor Author

backeb commented Oct 25, 2022 via email

@sebastian-luna-valero
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants