Table Dialect Spec #697

roll · 2020-07-27T08:16:45Z

Overview

For now, we have:

a resource describes a data entity
a resource can be tabular
- a tabular resource has schema property that must be Tabular Schema
- a tabular resource has dialect property that must be CSV Dialect

It means that we have only two mechanisms to add tabular information to the resource: schema and dialect properties:

schema: what is the data
dialect: how to extract the data

Maybe at some point this list can be extended e.g. providing table filtering ability etc but, as for now, I think we definitely can generalize the dialect property. Instead of having it csv-only we can have a general Table Dialect spec helping describe any tabular format details.

The proposed Table Dialect spec will create a nice symmetry with already existent Table Schema spec. Here is a quick overview of the proposal. The spec is hierarchical so e.g. Csv Table Dialect inherits all the props from Table Dialect.

Table Dialect

Core Table Dialect spec will handle header management.

header (bool)

default: true

Whether the table has a header row(s)

headerRows (int[])

default: [1]

An array of header row numbers. Can describe a multiline header.

headerJoin (str)

default: ' ' (one space)

A string to concatenate a multiline header. Has no effect for a single row header.

Csv Table Dialect

@amercader hints that we also need to re-review the CSVW spec in case we miss something - https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-descriptions

It will support all the header options and the options below which is standard for csv.

delimiter (str)

default: ,

lineTerminator (str)

default: \r\n

quoteChar (str)

default: ""

doubleQuote (bool)

default: true

escapeChar (str)

default: not set

nullSequence (str)

default: not set

skipInitialSpace (bool)

default: false

I propose the following changes to the current Csv Dialect spec:

make skipInitialSpace=False by default to sync with Python/Pandas/JS/etc behaviour
remove caseSensitiveHeader as I guess it should be an option for some infer function but for general data description I'm not sure what it does
review commentChar option as partially its role will be handled by headerRows and, at the same time, there is more functional skipRows supported by the software. In software, I've moved all the skip/pick/limit/offset_fields/rows functionality to a separate group called Table Query (or Table Discovery previously) which should probably exist only in software because we don't want to make ETL from the specs, although I think there are options to consider.

Excel Table Dialect

It will support all the header options and:

sheet (str|int)

default: 1

String or integer to address an excel sheet e.g. 2 or Sheet 2.

Options to consider:

fillMergedCells
preserveFormatting
adjustFloatingPointError

Json Table Dialect

It will support all the header options and:

keyed (bool)

default: false

Whether a source is keyed i.e. an array of dictionaries instead of an array of arrays.

keys (str[])

default: not set

For a keyed source, an array of keys to use as a header row.

Options to consider:

property (path to the data within json e.g. dogs/data)

In conclusion, the idea is:

csv is not the only tabular format; let's describe others, the most importantly Excel
to have one hierarchical spec which will help standardize different formats' dialects
new formats and properties addition should be considered based on users' demand and should happen gradually

The text was updated successfully, but these errors were encountered:

rufuspollock · 2020-07-27T09:23:56Z

Very interesting proposal and generally big 👍

headerRows - we def want something like this - i believe Multi-line header rows #681 is also about this
Not sure about headerJoin - can you provide more detail?
Like the idea of specializing the dialect to other cases

roll · 2020-07-27T10:00:03Z

Probably I should have put headerJoin into the Options to consider category as it's a very minor and rare case. At the same time, people asked for this option for tabulator many times including for pilots as there are a lot Excel files with "fancy" multiline header like:

So it's basically the way we join a multiline header row:

dialect = ExcelDialect(header_rows=[7, 8, 9], header_join='/')
with Table('excel.xlsx', dialect=dialect) as table:
    print(table.headers)
    # ['Current/Phase1/#', ...]

rufuspollock · 2020-07-27T10:07:39Z

OK, understood and very good to have concrete examples.

…

_________________________________________________________________________ *Datopian *| https://datopian.com | Open solutions for a data driven world *DataHub* | https://datahub.io | GitHub for data *CKAN * | http://ckan.org <https://ckan.org/> | The world's leading data portal solution President - +44 7795176976 - @rufuspollock

On Mon, Jul 27, 2020 at 12:00 PM roll ***@***.***> wrote: Probably I should have put headerJoin into the Options to consider category as it's a very minor and rare case. At the same time, people asked for this option for tabulator many times including for pilots as there are a lot Excel files with "fancy" multiline header like: [image: excel] <https://user-images.githubusercontent.com/557395/88529413-a54faf00-d008-11ea-90b2-1edb1b9b72a5.png> So it's basically the way we join a multiline header row: dialect = ExcelDialect(headerRows=[7, 8, 9], header_join='/') with Table('excel.xlsx', dialect=dialect) as table: print(table.headers) # ['Current/Phase1/#', ...] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#697 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABMDMWRQIS7TOVVBFH2K5LR5VF3FANCNFSM4PIQDXOQ> .

rufuspollock · 2021-03-31T16:00:45Z

@roll do you want to start with a pull request to add this as a pattern?

roll · 2021-04-04T05:41:03Z

Sure. I'll PR (can take some time though)

nichtich · 2022-12-18T19:35:31Z

Json Table Dialect requires further discussion as many ways exist to encode tabular data in JSON.

array of arrays
array of objects (aka keyed) so order of columns is unknown
object with property for rows and property for header

See https://www.w3.org/TR/csv2json/ (CSVW) for an example of a specification that supports 1 (simple) and 3 (slightly complicated). A simplified form of this uses an object with property rows for rows and property columns with an array of objects, each having property label at least.

So keyed is probably fine but keys is more complex.

Moreover cells in JSON Tables and Excel Tables can have data types other than plain strings.

Excel: text, number, logical, error.
JSON: any JSON data type except object and possibly array (string, number, boolean, null)

Datatypes can be defined with columns as done in CSVW but less complex (e.g. only string, number, logical).

roll · 2022-12-23T11:38:32Z

Thanks, @nichtich!

I think it should not be a blocker as in specs like this we have a privilege to start from a small core and extend once other properties are discussed and justified

roll added Table Dialect New Spec labels Jul 27, 2020

roll mentioned this issue Jan 7, 2021

dialect/commentChar is not respected during extraction frictionlessdata/frictionless-py#610

Closed

fjuniorr mentioned this issue Jun 10, 2022

Is there support for addressing worksheets within a spreadsheet? #793

Closed

nichtich mentioned this issue Dec 18, 2022

Extend support of CSV with CSV Dialect wader/fq#528

Open

roll mentioned this issue Jan 17, 2023

Migrating from v4 to v5 - how to skip blank rows frictionlessdata/frictionless-py#1375

Closed

roll added this to the v2 milestone Apr 7, 2023

roll changed the title ~~Proposal: Table Dialect Spec~~ Proposal: File Dialect Spec May 1, 2023

roll changed the title ~~Proposal: File Dialect Spec~~ Proposal: Table Dialect Spec Jul 16, 2023

fjuniorr mentioned this issue Dec 22, 2023

Write down versioning / changing rules #858

Open

roll removed the New Spec label Jan 3, 2024

roll changed the title ~~Proposal: Table Dialect Spec~~ Table Dialect Spec Jan 3, 2024

roll mentioned this issue Jan 3, 2024

Table Dialect should be able to handle more csv/excel cases #834

Closed

roll added the epic label Jan 25, 2024

roll self-assigned this Feb 21, 2024

roll mentioned this issue Feb 22, 2024

Table Dialect specification frictionlessdata/datapackage-v2-draft#41

Merged

roll added the proposal label Feb 22, 2024

roll closed this as completed in frictionlessdata/datapackage-v2-draft#41 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table Dialect Spec #697

Table Dialect Spec #697

roll commented Jul 27, 2020 •

edited

Loading

rufuspollock commented Jul 27, 2020

roll commented Jul 27, 2020 •

edited

Loading

rufuspollock commented Jul 27, 2020 via email

rufuspollock commented Mar 31, 2021

roll commented Apr 4, 2021

nichtich commented Dec 18, 2022 •

edited

Loading

roll commented Dec 23, 2022

Table Dialect Spec #697

Table Dialect Spec #697

Comments

roll commented Jul 27, 2020 • edited Loading

Overview

Table Dialect

header (bool)

headerRows (int[])

headerJoin (str)

Csv Table Dialect

delimiter (str)

lineTerminator (str)

quoteChar (str)

doubleQuote (bool)

escapeChar (str)

nullSequence (str)

skipInitialSpace (bool)

Excel Table Dialect

sheet (str|int)

Json Table Dialect

keyed (bool)

keys (str[])

rufuspollock commented Jul 27, 2020

roll commented Jul 27, 2020 • edited Loading

rufuspollock commented Jul 27, 2020 via email

rufuspollock commented Mar 31, 2021

roll commented Apr 4, 2021

nichtich commented Dec 18, 2022 • edited Loading

roll commented Dec 23, 2022

roll commented Jul 27, 2020 •

edited

Loading

roll commented Jul 27, 2020 •

edited

Loading

nichtich commented Dec 18, 2022 •

edited

Loading