-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table Dialect Spec #697
Table Dialect Spec #697
Comments
Very interesting proposal and generally big 👍
|
Probably I should have put So it's basically the way we join a multiline header row: dialect = ExcelDialect(header_rows=[7, 8, 9], header_join='/')
with Table('excel.xlsx', dialect=dialect) as table:
print(table.headers)
# ['Current/Phase1/#', ...] |
OK, understood and very good to have concrete examples.
…_________________________________________________________________________
*Datopian *| https://datopian.com | Open solutions for a data driven world
*DataHub* | https://datahub.io | GitHub for data
*CKAN * | http://ckan.org <https://ckan.org/> | The world's leading
data portal solution
President - +44 7795176976 - @rufuspollock
On Mon, Jul 27, 2020 at 12:00 PM roll ***@***.***> wrote:
Probably I should have put headerJoin into the Options to consider
category as it's a very minor and rare case. At the same time, people asked
for this option for tabulator many times including for pilots as there
are a lot Excel files with "fancy" multiline header like:
[image: excel]
<https://user-images.githubusercontent.com/557395/88529413-a54faf00-d008-11ea-90b2-1edb1b9b72a5.png>
So it's basically the way we join a multiline header row:
dialect = ExcelDialect(headerRows=[7, 8, 9], header_join='/')
with Table('excel.xlsx', dialect=dialect) as table:
print(table.headers)
# ['Current/Phase1/#', ...]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#697 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABMDMWRQIS7TOVVBFH2K5LR5VF3FANCNFSM4PIQDXOQ>
.
|
@roll do you want to start with a pull request to add this as a pattern? |
Sure. I'll PR (can take some time though) |
Json Table Dialect requires further discussion as many ways exist to encode tabular data in JSON.
See https://www.w3.org/TR/csv2json/ (CSVW) for an example of a specification that supports 1 (simple) and 3 (slightly complicated). A simplified form of this uses an object with property So Moreover cells in JSON Tables and Excel Tables can have data types other than plain strings.
Datatypes can be defined with columns as done in CSVW but less complex (e.g. only string, number, logical). |
Thanks, @nichtich! I think it should not be a blocker as in specs like this we have a privilege to start from a small core and extend once other properties are discussed and justified |
Overview
For now, we have:
schema
property that must beTabular Schema
dialect
property that must beCSV Dialect
It means that we have only two mechanisms to add tabular information to the resource:
schema
anddialect
properties:schema
: what is the datadialect
: how to extract the dataMaybe at some point this list can be extended e.g. providing table filtering ability etc but, as for now, I think we definitely can generalize the
dialect
property. Instead of having itcsv-only
we can have a generalTable Dialect
spec helping describe any tabular format details.The proposed
Table Dialect
spec will create a nice symmetry with already existentTable Schema
spec. Here is a quick overview of the proposal. The spec is hierarchical so e.g.Csv Table Dialect
inherits all the props fromTable Dialect
.Table Dialect
Core
Table Dialect
spec will handleheader
management.header (bool)
Whether the table has a header row(s)
headerRows (int[])
An array of header row numbers. Can describe a multiline header.
headerJoin (str)
A string to concatenate a multiline header. Has no effect for a single row header.
Csv Table Dialect
It will support all the
header
options and the options below which is standard forcsv
.delimiter (str)
lineTerminator (str)
quoteChar (str)
doubleQuote (bool)
escapeChar (str)
nullSequence (str)
skipInitialSpace (bool)
I propose the following changes to the current Csv Dialect spec:
skipInitialSpace=False
by default to sync with Python/Pandas/JS/etc behaviourcaseSensitiveHeader
as I guess it should be an option for someinfer
function but for general data description I'm not sure what it doescommentChar
option as partially its role will be handled byheaderRows
and, at the same time, there is more functionalskipRows
supported by the software. In software, I've moved all theskip/pick/limit/offset_fields/rows
functionality to a separate group calledTable Query
(orTable Discovery
previously) which should probably exist only in software because we don't want to make ETL from the specs, although I think there are options to consider.Excel Table Dialect
It will support all the
header
options and:sheet (str|int)
String or integer to address an excel sheet e.g.
2
orSheet 2
.Options to consider:
Json Table Dialect
It will support all the
header
options and:keyed (bool)
Whether a source is keyed i.e. an array of dictionaries instead of an array of arrays.
keys (str[])
For a keyed source, an array of keys to use as a header row.
Options to consider:
dogs/data
)In conclusion, the idea is:
The text was updated successfully, but these errors were encountered: