Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientConnectorCertificateError raised when reading parquet file from GCS using pandas and aiohttp==3.7.0 #5128

Closed
sm-hawkfish opened this issue Oct 25, 2020 · 2 comments
Labels

Comments

@sm-hawkfish
Copy link

🐞 Describe the bug
When reading parquet files from Google Cloud Storage using Pandas and aiohttp==3.7.0, the following error is thrown:

aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host www.googleapis.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1108)')]

💡 To Reproduce

python3 -m venv venv
source venv/bin/activate

pip install pandas pyarrow fsspec gcsfs

And attempt to download any file from GCS using pd.read_parquet("gs://..").

For example, the file gs://gcp-public-data-landsat/LC08/01/044/034/LC08_L1GT_044034_20130330_20170310_01_T2/LC08_L1GT_044034_20130330_20170310_01_T2_ANG.txt is publicly available so we can use it as a test case. Even though it is not valid parquet, it will crash on the reported error before complaining about the file format.

import pandas as pd

file_uri = "gs://gcp-public-data-landsat/LC08/01/044/034/LC08_L1GT_044034_20130330_20170310_01_T2/LC08_L1GT_044034_20130330_20170310_01_T2_ANG.txt"

pd.read_parquet(file_uri).head()

💡 Expected behavior

The file was able to be downloaded successfully (or, in the above test case, should crash with OSError: Could not open parquet input source ... Either the file is corrupted or this is not a parquet file.)

📋 Logs/tracebacks

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/steve/venv/lib/python3.8/site-packages/pandas/io/parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/steve/venv/lib/python3.8/site-packages/pandas/io/parquet.py", line 141, in read
    result = self.api.parquet.read_table(
  File "/Users/steve/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1607, in read_table
    dataset = _ParquetDatasetV2(
  File "/Users/steve/venv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1439, in __init__
    if filesystem.get_file_info(path).is_file:
  File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info
  File "/Users/steve/venv/lib/python3.8/site-packages/pyarrow/fs.py", line 195, in get_file_info
    info = self.fs.info(path)
  File "/Users/steve/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/Users/steve/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/Users/steve/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/Users/steve/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/Users/steve/venv/lib/python3.8/site-packages/gcsfs/core.py", line 781, in _info
    return await self._get_object(path)
  File "/Users/steve/venv/lib/python3.8/site-packages/gcsfs/core.py", line 576, in _get_object
    bucket, await self._call("GET", "b/{}/o/{}", bucket, key, json_out=True)
  File "/Users/steve/venv/lib/python3.8/site-packages/gcsfs/core.py", line 487, in _call
    async with self.session.request(
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/client.py", line 1083, in __aenter__
    self._resp = await self._coro
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/client.py", line 490, in _request
    conn = await self._connector.connect(
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/connector.py", line 528, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/connector.py", line 868, in _create_connection
    _, proto = await self._create_direct_connection(
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/connector.py", line 1023, in _create_direct_connection
    raise last_exc
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/connector.py", line 999, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
  File "/Users/steve/venv/lib/python3.8/site-packages/aiohttp/connector.py", line 948, in _wrap_create_connection
    raise ClientConnectorCertificateError(
aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host www.googleapis.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1108)')]

📋 Your version of the Python

$ python --version
Python 3.8.2

📋 Your version of the aiohttp/yarl/multidict distributions

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.7.0
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: Nikolay Kim
Author-email: fafhrd91@gmail.com
License: Apache 2
Location: /Users/steve/venv/lib/python3.8/site-packages
Requires: async-timeout, multidict, attrs, yarl, chardet
Required-by: gcsfs
$ python -m pip show multidict
Name: multidict
Version: 5.0.0
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: /Users/steve/venv/lib/python3.8/site-packages
Requires:
Required-by: yarl, aiohttp
$ python -m pip show yarl
Name: yarl
Version: 1.6.2
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: /Users/steve/venv/lib/python3.8/site-packages
Requires: idna, multidict
Required-by: aiohttp

📋 Additional context

Downgrading aiohttp to 3.6.3 fixes the issue

python3 -m venv venv
source venv/bin/activate

pip install pandas pyarrow fsspec gcsfs
pip freeze > bad.txt

rm -rf venv
python3 -m venv venv
source venv/bin/activate

pip install pandas pyarrow fsspec gcsfs aiohttp==3.6.3
pip freeze > good.txt

diff <(<bad.txt) <(<good.txt)

Gives

1c1
< aiohttp==3.7.0
---
> aiohttp==3.6.3
13c13
< multidict==5.0.0
---
> multidict==4.7.6
27c27
< yarl==1.6.2
---
> yarl==1.5.1
   System Version: macOS 10.15.7 (19H2)
      Kernel Version: Darwin 19.6.0
@evamaxfield
Copy link

Can also confirm the same error for S3

@asvetlov
Copy link
Member

Fixed by #5118

justTheKai pushed a commit to justTheKai/sukuinote that referenced this issue Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants