Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaFileNotFound because of missing root certificates #827

Closed
scherbinek opened this issue Dec 16, 2022 · 12 comments
Closed

MetaFileNotFound because of missing root certificates #827

scherbinek opened this issue Dec 16, 2022 · 12 comments

Comments

@scherbinek
Copy link

scherbinek commented Dec 16, 2022

Hey!

I got the same error as described in #678

Describe the bug

Traceback (most recent call last):
  File "/opt/airflow/dags/dwd_kl_daily.py", line 67, in <module>
    r1 = DwdObservationRequest(
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/core/scalar/request.py", line 624, in all
    df = self._all().copy().reset_index(drop=True)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/api.py", line 561, in _all
    df = create_meta_index_for_climate_observations(dataset, self.resolution, period)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 84, in create_meta_index_for_climate_observations
    meta_index = _create_meta_index_for_climate_observations(dataset, resolution, period)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 142, in _create_meta_index_for_climate_observations
    meta_file = _find_meta_file(files_server, url, ["beschreibung", "txt"])
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 170, in _find_meta_file
    **raise MetaFileNotFound(f"No meta file was found amongst the files at {url}.")**
wetterdienst.exceptions.MetaFileNotFound: No meta file was found amongst the files at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/.

To Reproduce
Nothing special. Just a simple request which works locally on my computer.

from os import environ
environ['WD_CACHE_DISABLE'] = 'True'
from wetterdienst.provider.dwd.observation import DwdObservationRequest
from wetterdienst.provider.dwd.observation import DwdObservationRequest

Settings.cache_disable = True
r1 = DwdObservationRequest(
                  parameter=['climate_summary'],
                  resolution='daily',
                  period='recent'
).all()

Desktop (please complete the following information):

  • OS: apache/airflow:2.5.0-python3.9
  • Python-Version 3.9

Additional context
The script works perfectly fine on the local computer. But crashes with the above mentioned error on a server instance within a docker container of apache airflow. I already switched off the cache to avoid any issues. But wetterdienst.info() refers to a location at /home/airflow/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder). The airflow log refers to wetterdienst.util.fsspec_monkeypatch - INFO - Dircache located at /root/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder).

It seems that fsspec tries to resolve a cache directory for parsing the metadate file from the url but receives an empty list of files which led to the error and doesn't even try to request the content of the url. The dircache at /root/.cache/ seems to be misleading as it shouldn't be started as root. So my best guess is some authorization issue in a linux based context based on the fsspec_monkeypatch cache.

I'll give it a further try tomorrow. I try to debug the issue and share my result. But I am thankful for any hints. Initially I tried to search for an environment variable to overwrite the fsspec cache.

Regards,
Marcel

@amotl
Copy link
Member

amotl commented Dec 16, 2022

Dear @scherbinek,

thank you for the excellent report. @larsrinn recently reported a similar thing at #704, that the cache control environment variables WD_CACHE_DISABLE and WD_CACHE_DIR would not be honored correctly, which have been introduced with version 0.18.0 1.

However, when looking for them in the current state of the code base, I can not find either of them. It looks like 9c7cee5 got lost somehow? Do you have any clue about it, @gutzbenj?

With kind regards,
Andreas.

Footnotes

  1. https://github.com/earthobservations/wetterdienst/blob/f6d82088891e00e57dcfa5d7c71e98993690713d/CHANGELOG.rst#0180-04052021

@amotl
Copy link
Member

amotl commented Dec 16, 2022

Oh, the code is there, but because the prefix WD_ is handled in a separate line of code, I have not been able to spot it.

with self.env.prefixed("WD_"):
# cache
# for initial printout we need to work with _cache_disable and
# Check out this: https://florimond.dev/en/posts/2018/10/reconciling-dataclasses-and-properties-in-python/
self.cache_disable: bool = self.env.bool("CACHE_DISABLE", False)
self.cache_dir: pathlib.Path = self.env.path(
"CACHE_DIR", platformdirs.user_cache_dir(appname="wetterdienst")
)

@amotl
Copy link
Member

amotl commented Dec 16, 2022

Oh, and I also spotted this one. Not sure whether use_listings_cache=True is "always on" here, even when running with cache disabled?

real_cache_dir = os.path.join(Settings.cache_dir, "fsspec", key)
filesystem_real = HTTPFileSystem(use_listings_cache=True, client_kwargs=Settings.fsspec_client_kwargs)
if Settings.cache_disable or ttl is CacheExpiry.NO_CACHE:
filesystem_effective = filesystem_real
else:
filesystem_effective = WholeFileCacheFileSystem(
fs=filesystem_real, cache_storage=real_cache_dir, expiry_time=ttl_value
)

Edit: I've addressed this with GH-828, but I think this is only a cosmetic issue, and not responsible for any functional flaw.

@amotl
Copy link
Member

amotl commented Dec 16, 2022

I've exercised your scenario using the following program, using Wetterdienst 0.50.0, on both macOS and within a Docker container.

#
# Synopsis:
#
#   docker run --rm -it python:3.10-bullseye bash
#   pip install wetterdienst
#   python example-827.py
#
import logging

from wetterdienst import Settings
from wetterdienst.provider.dwd.observation import DwdObservationRequest


logger = logging.getLogger(__name__)


def process():
    Settings.cache_disable = True
    r1 = DwdObservationRequest(
                      parameter=['climate_summary'],
                      resolution='daily',
                      period='recent'
    ).all()
    print(r1)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    process()

Using Settings.cache_disable = True, to turn off caching, works perfectly well for me 1, I am able to confirm that no directory has been created at either /Users/amo/Library/Caches/wetterdienst (macOS) or /root/.cache/wetterdienst (Linux/Docker), after running that program.

Maybe you can share more details about your Docker environment, as being driven by Airflow? Maybe any special parameters or options are be used?

Which versions of Wetterdienst and Docker are you running?

Footnotes

  1. so does environ['WD_CACHE_DISABLE'] = 'True'.

@amotl
Copy link
Member

amotl commented Dec 16, 2022

Maybe it was really just an upstream error / fluke?

wetterdienst.exceptions.MetaFileNotFound: No meta file was found amongst the files at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/.

@amotl
Copy link
Member

amotl commented Dec 16, 2022

The airflow log refers to wetterdienst.util.fsspec_monkeypatch - INFO - Dircache located at /root/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder).

That log message was misleading, it will be fixed with GH-828. Thank you.

@scherbinek
Copy link
Author

Hi @amotl

Thank you for your detailed analysis and description. I tested as well your docker setup including the example-827.py and can confirm a working scenario as well. It even works with my server setup locally... but throws the mentioned error on my server. Testing it locally and on my servers step by step led to the actual error.

The error seems so simple that I curled the website on my server but everything was fine. But i didn't try to curl the requested webiste https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/ in my docker setup on the server.

airflow@653d5258586b:/opt/airflow/dags$ curl https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/
curl: (77) error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs

For using a secure connection I use SSL certificated on my server and mounted them to the docker container as well. Something I commented out locally as it runs on localhost.

volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- /var/run/docker.sock:/var/run/docker.sock
- /usr/share/pki/trust/anchors:/etc/ssl/certs

But as I only mount the servers /usr/share/pki/trust/anchors without any ca-certificates for checking the SSL certificate of other websites, I receive can't requests the website with a default verify on SSL. Thus I only have to mount my servers certificates to another folder than /etc/ssl/certs as it overwrites all ca-certificates of the docker container.

**- /usr/share/pki/trust/anchors: /usr/share/pki/trust/anchors**

And. It works. At least it was a tricky one as I never thought that the url can't be verified and therefore it pops up the initial mentioned error. Additionally it is not the first url I request but also request and use the Genesis API of the Statistischen Bundesamt - without problems.

Hopefully this issue might be a hint for further setups as mine. And sorry for any inconvenience as it feels like self-owned. And I even more don't like handling with SSL and certificates. The issue can be closed except you have any follow-up questions on my issue.

Regards,
Marcel

@amotl
Copy link
Member

amotl commented Dec 25, 2022

Hi @scherbinek,

thank you for your response, I am happy it works for you now. However, I will reopen this issue, because I would like to investigate if we should include the certifi package as a dependency, and if this would have improved the situation in your case.

With kind regards,
Andreas.

@amotl amotl reopened this Dec 25, 2022
@gutzbenj
Copy link
Member

gutzbenj commented Feb 26, 2023

I think we can close this.

certifi is already indirectly in our dependents (probably through fsspec/requests) and the issue can't be resolved by installing certifi but rather by linking it to system installed certificates.

@amotl
Copy link
Member

amotl commented Mar 4, 2023

The issue can't be resolved by installing certifi but rather by linking it to system installed certificates.

I was about to agree, but wasn't fully convinced 1, so I just looked up the topic on the corresponding urllib3 and aiohttp documentations.

urllib3

It looks like there is an option to make urllib3 use the certificates from the certifi package, and it is well documented.

Unless otherwise specified urllib3 will try to load the default system certificate stores. The most reliable cross-platform method is to use the certifi package which provides Mozilla’s root certificate bundle.

Once you have certificates, you can create a PoolManager that verifies certificates when making requests:

>>> import certifi
>>> import urllib3
>>> http = urllib3.PoolManager(
...     cert_reqs='CERT_REQUIRED',
...     ca_certs=certifi.where()
... )

-- https://urllib3.readthedocs.io/en/stable/user-guide.html#certificate-verification

aiohttp

It looks like aiohttp does not document how to use certificates from certifi. Evaluate "aiohttp" with "certifi" has a corresponding example program, its gist is:

import aiohttp
import certifi
import ssl

sslcontext = ssl.create_default_context(cafile=certifi.where())
session = aiohttp.ClientSession()
response = await session.get("https://www.hrw.org/", ssl=sslcontext)

Do you think we should carry that information forward to both the aiospec and the fsspec projects, to improve their documentation and their internals?

References

Footnotes

  1. I mean, what would be the point of providing the certificates per Python package then, if you can't make Python actually use it?

@amotl amotl changed the title MetaFileNotFound MetaFileNotFound because of missing root certificates Mar 4, 2023
@gutzbenj
Copy link
Member

gutzbenj commented Mar 4, 2023

Sure! But my honest opinion is: I've only seen this error once on a managed machine at work and there probably if you get this error nothing else works as well.

Usually if you install python (and maybe requests afterwards) everything should work out of the box and if not we wouldn't be able to provide any help and aiohttp neither, but the user would rather have to make sure that certificates on the machine are correctly installed.

@gutzbenj
Copy link
Member

Closing this as is not related to anything on our end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants