Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Excel & CSV query runner #2478

Open
wants to merge 7 commits into
base: master
from

Conversation

@deecay
Copy link
Contributor

commented Apr 20, 2018

This PR adds Excel and CSV as possible datasource.

  • Datasource Name is the only configuration for both datasource.
  • Specify path of excel/csv file, local path or URI, in query editor.
  • You may pass parameters to the excel/csv parsing function.

Online Excel data file with parameters

  • https://www.unicef.org/sowc2012/pdfs/U5MR-rank_FINAL.xls|{'names': ['Country', 'Mortality', 'Rank'], 'usecols': [0, 1, 2], 'skiprows': 7, 'skipfooter':2}
    image
  • File path and parameters must be separated by '|' (pipe).
  • Parameters are dictionary such as {'names': ['Country', 'Mortality', 'Rank'], 'usecols': [0, 1, 2]}.
  • Reference for parameters: CSV and Excel.

Same data without parameters (table is broken)

  • https://www.unicef.org/sowc2012/pdfs/U5MR-rank_FINAL.xls
    image

CSV example using NASA dataset

  • https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD&bom=true&format=true|{'header': 1, 'names': ['name', 'id', 'nametype', 'recclass', 'mass', 'fall', 'year', 'reclat', 'reclong', 'GeoLocation']}
    image

deecay added some commits Apr 17, 2018

@deecay deecay changed the title Excel query runner Excel & CSV query runner Sep 3, 2018

@deecay

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2018

Need db-logo for Excel...

@arikfr
Copy link
Member

left a comment

Thanks! Supporting both CSV and Excel is great.

My main concern with this implementation is the use of Pandas, as I always felt it's a "heavy" dependency that might introduce issues with setting up Redash for some users. Although I did look into it again and I'm no longer sure about this.

@jezdez do you happen to have any insight on Pandas requirements?

(and, @jezdez , of course it's another case for #2921 ;-))

@deecay

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

@jezdez, any opinions? Maybe resort to requirements_excel_ds.txt?

@jezdez

This comment has been minimized.

Copy link
Member

commented Nov 20, 2018

Yeah, installing Pandas and numpy is quite a lot for the purpose of reading and parsing Excel and CSV files alone. That's an additional ~23 MB (unpacked whl files on Linux for Python 2.7) for the Docker image and a higher maintenance burden given the high profile of Pandas and numpy for a relatively small, albeit useful feature.

Options forward:

  • find a different library to read and parse CSV/Excel files
  • ship the query runner as a separate package (#2921) and document how users can install it on demand (my preference)
  • accept the extra dependency

A positive point is that both Pandas and numpy are available as precompiled whl files, so it's literally just a matter of downloading it.

@denisov-vlad

This comment has been minimized.

Copy link
Member

commented Nov 20, 2018

Pandas is must have for python query runner.

Maybe it's heavy but does not require additional libraries which should be installed via apt like other datasources like mssql.

@jezdez

This comment has been minimized.

Copy link
Member

commented Nov 20, 2018

Pandas is must have for python query runner.

Maybe it's heavy but does not require additional libraries which should be installed via apt like other datasources like mssql.

That may or may not be so, it's out of scope of this pull request review though.

@deecay

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

Option 2 is acceptable for me too. I will wait until #2921 is done.

xlrd will be the candidate for Option 1, but this involves decent amount of 'reinventing the wheel' to get the nice features that Pandas have (skiprows, usecols, etc). Without these features, a "perfectly formatted" excel will be required, which can be very rare as I have displayed in the examples.

@arikfr

This comment has been minimized.

Copy link
Member

commented Nov 20, 2018

Option 2 is acceptable for me too. I will wait until #2921 is done.

We don't have to extract all the query runners for you to be able to do this. But we will probably need to extract the query runners base classes and helper methods to their own package, so they can be used "externally". @jezdez , am I correct here?

xlrd will be the candidate for Option 1, but this involves decent amount of 'reinventing the wheel' to get the nice features that Pandas have (skiprows, usecols, etc). Without these features, a "perfectly formatted" excel will be required, which can be very rare as I have displayed in the examples.

I missed the added functionality. That's actually nice :) I would suggest a different "query syntax" though, let's use YAML here, so the query becomes:

url: https://www.unicef.org/sowc2012/pdfs/U5MR-rank_FINAL.xls
names:
  - Country
  - Mortality
  - Rank
usecols: [0, 1, 2]
skiprows: 7
skipfooter: 2
@deecay

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

@Arik, interesting idea. I'll take a look at yandex metrica query runner first, and see what I can do about yaml.

@deecay

This comment has been minimized.

Copy link
Contributor Author

commented Nov 30, 2018

Done with the yaml part.
image

@deecay deecay changed the title Excel & CSV query runner [WIP] Excel & CSV query runner Nov 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.