Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source File: ability to get HTTPS attachments #5537

Closed
muutech opened this issue Aug 19, 2021 · 22 comments · Fixed by #15269 or #15501
Closed

Source File: ability to get HTTPS attachments #5537

muutech opened this issue Aug 19, 2021 · 22 comments · Fixed by #15269 or #15501

Comments

@muutech
Copy link
Contributor

muutech commented Aug 19, 2021

Tell us about the problem you're trying to solve

We are trying to download an Excel file shared in a "ownCloud" or vía "transfer.sh". I think it does not work when response header is "Content-disposition: attachment". This header is very common and having this feature would be very nice and give access to public shared links from different platforms like Google Drive, etc.

Describe the solution you’d like

When we tried to use the connector File with HTTPS it works well for the same file if we upload it to a web server serving DIRECTLY the file, but, the same file, uploaded to an OwnCloud environment or to "transfer.sh" and trying to download it, airbyte says it is not a valid Excel file.

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/source_file/source.py", line 122, in discover streams = list(client.streams) File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 384, in streams "properties": self._stream_properties(), File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 372, in _stream_properties for df in df_list: File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 327, in load_dataframes yield reader(fp, **reader_options) File "/usr/local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 336, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1056, in __init__ content=path_or_buffer, storage_options=storage_options File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 942, in inspect_excel_format stream.seek(0) File "/usr/local/lib/python3.7/site-packages/smart_open/http.py", line 263, in seek raise OSError OSError

URL is like this (none of them working but it is easy to upload something to transfer.sh):
https://transfer.sh/get/1HpZ3cN/test.xlsx (/get/ is the key to direct download)
https://www.xxxxxxcloud.com/drive/index.php/s/xxxxxxx/download?path=%2F&files=test.xlsx

The only difference we notice from using a direct http server and this other options is that in the response header it comes "Content-Disposition: attachment ..."

Describe the alternative you’ve considered or used

I have looked in smart_open for a related issue but could not find it...

@muutech muutech added the type/enhancement New feature or request label Aug 19, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Aug 25, 2021
@sherifnada
Copy link
Contributor

thanks @muutech ! would you be open to creating a PR to submit this feature?

@muutech
Copy link
Contributor Author

muutech commented Aug 25, 2021

We have opened an issue on smart_open, to see if they can help us with some light/guidance or we would need to use another library (it work with "requests"). piskvorky/smart_open#644

@muutech
Copy link
Contributor Author

muutech commented Aug 27, 2021

Ok, with great help from the people of smart_open, the problem seems to be that the library is strict with the optional HTTP header "Accept-ranges", so in some cases like using "owncloud", "gdrive", "transfer.sh" which does not respond with that header when airbyte calls "seek" it will fail.

They tell us they would release a new version with this check relaxed. So, when it is gets out, it would be apparently solved, just updating this library, do you see any problem or thing to have in mind when updating this library?

See detailed explanation on: piskvorky/smart_open#644 (comment)

@muutech
Copy link
Contributor Author

muutech commented Aug 28, 2021

smart_open 5.2.1 is already out, with this change. I've tested it, and now the error is different... any idea?

  File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1056, in __init__
    content=path_or_buffer, storage_options=storage_options
  File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 958, in inspect_excel_format
    zf = zipfile.ZipFile(stream)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.7/zipfile.py", line 1258, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.7/zipfile.py", line 1350, in _RealGetContents
    raise BadZipFile("Truncated central directory")
zipfile.BadZipFile: Truncated central directory

Could be related to this? #5110

Thanks!

@sherifnada
Copy link
Contributor

this seems like a different issue @muutech -- the issue you linked is from the S3 source, which is different than the file source.

Can you share the inputs you provided to the file source that resulted in this error?

@muutech
Copy link
Contributor Author

muutech commented Sep 18, 2021

Sure.

format: "excel"
reader_options: ""
provider: "HTTPS: Public Web"
storage: "https"
url: an excel file loaded in transfer.sh (for instance): https://transfer.sh/get/8Rcu08/test.xlsx , you can use your own file

@grubberr
Copy link
Contributor

grubberr commented Aug 6, 2022

@muutech
In this PR #15269 I have fixed 1-st problem related to os.seek

About 2-nd problem it looks like something wrong with zip unpacking test.xlsx file
Can you provide that file for us for testing ?

@muutech
Copy link
Contributor Author

muutech commented Aug 8, 2022

@grubberr thanks.

It could be any test.xlsx but find attached the one I used, I think compression is made on the http transfer.

I tested uploading it to: https://transfer.sh/

test.xlsx

Thanks!

@grubberr
Copy link
Contributor

grubberr commented Aug 8, 2022

@muutech
about 2-nd problem related to transfer.sh and test.xlsx yes problem really exists
I did some research and found why this happened it's again our os.seek problem but a little bit different, let me describe:

  1. pd.read_excel is random access read, it do a lot of read, seek calls.
  2. All this seek and after read calls under the hood are separate HTTP requests with different bytes-ranges.
  3. I am pretty surprising that there is not caching, smart_open can request the same range a multiple times (maybe we can turn if on I need to check)
  4. Because a lot of HTTP requests, after some time we get HTTP Response: 429, Too Many Requests
  5. smart_open cannot handle it
    https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/http.py#L314
    for upper level it looks like read call returned empty buffer

We need to think how to better handle it

@grubberr
Copy link
Contributor

grubberr commented Aug 9, 2022

piskvorky/smart_open#712

@grubberr grubberr linked a pull request Aug 10, 2022 that will close this issue
14 tasks
@grubberr
Copy link
Contributor

@muutech

I have published airbyte/source-file:0.2.17
can you please check transfer.sh problem on your side?

@grubberr
Copy link
Contributor

@muutech any updates on this ?

@muutech
Copy link
Contributor Author

muutech commented Aug 16, 2022

@muutech any updates on this ?

When setting up connection: Failed to fetch schema. Please try again. You can the URL if you want.
image

Airbyte version: 0.39.41-alpha
source-file version: 0.2.17

@grubberr
Copy link
Contributor

grubberr commented Aug 17, 2022

yes problem still exists for transfer.sh in discover stage
fix here #15698

@grubberr
Copy link
Contributor

@muutech thanks for bug, I have updated
can you please re-check source-file:0.2.18
?

@muutech
Copy link
Contributor Author

muutech commented Aug 17, 2022

great! it works... but local csvs started to fail, airbyte tell us it succeeded but you see the destination is empty (header ok but no rows). find log attached, i do not know if its related or not.
logs-82.txt

@grubberr
Copy link
Contributor

@muutech is it possible to get prueba.csv and config.json for this problematic case?

@muutech
Copy link
Contributor Author

muutech commented Aug 17, 2022

Sure!

image

Contents of prueba.csv:
COLUMNA1;COLUMNA2; HOLA;VECINO; HOLITA;VECINITO3;

@grubberr
Copy link
Contributor

@muutech
hm, I see that in prueba.csv file there is only 1 line which is interepeted by pandas as header only (no rows)
I have duplicate 1st line twice in prueba.csv file and I was able to catch "RECORD" object too:

cat prueba.csv 
COLUMNA1;COLUMNA2; HOLA;VECINO; HOLITA;VECINITO3;
COLUMNA1;COLUMNA2; HOLA;VECINO; HOLITA;VECINITO3;

$ python3 main.py read --config secrets/config.json --catalog  catalog.json
{"type": "LOG", "log": {"level": "INFO", "message": "Reading test (file://prueba.csv)..."}}
{"type": "RECORD", "record": {"stream": "test", "data": {"VECINITO3": "VECINITO3", "COLUMNA1": "COLUMNA1", "VECINO": "VECINO", " HOLITA": " HOLITA", " HOLA": " HOLA", "Unnamed: 6": NaN, "COLUMNA2": "COLUMNA2"}, "emitted_at": 1660804728000}}

@muutech
Copy link
Contributor Author

muutech commented Aug 18, 2022

Sorry, github get my CSV wrong:

COLUMNA1;COLUMNA2;
HOLA;VECINO;
HOLITA;VECINITO3;

In the logs it seems to get the record object OK, but when trying to make it to destination CSV or Postgresql it ends "successful" but with 0 rows.

thanks!

@grubberr
Copy link
Contributor

@muutech there was problem with "NaN" values in connector output
I have fixed in this PR #15768
can you please re-test source-file:0.2.19 ?

@muutech
Copy link
Contributor Author

muutech commented Aug 20, 2022

Tested! Now everything works. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants