Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular data #58

Closed
davidread opened this issue Apr 17, 2018 · 72 comments

Comments

@davidread
Copy link

commented Apr 17, 2018

Category: Data

Challenge Owner

David Read, tech arch at MoJ's Analytical Platform. Background with GDS on: data.gov.uk, Better Use of Data team.

Short Description

Tabular data (e.g. CSV) is the most common data format but it is loosely defined and users would benefit from standardizing on the details.

This challenge is not about Excel workbooks or similar. It is about data that is primarily consumed by machines/software, rather than humans.

This challenge is not about metadata (e.g. schema / column types, licence) or validation. That's covered in challenge #40 and the options, including CSV on the Web and Tabular Data Package are both about putting metadata in a separate file, so is a separate conversation.

User Need

Off the top of my head:

  • As a data scientist I want to open the file directly in Python or R into a tabular data structure (e.g. DataFrame) without having to wrangle it (frictionless access to data) so that I can efficiently analyse it
  • As a developer I want to do simple processing of the data in a range of simple software (e.g. command-line bash, Javascript, Java)
  • As a citizen with low data literacy I want to open the file directly into Excel to view the table so that I can browse government data (secondary use case)

Expected Benefits

We want to encourage government users and citizens to use government data more, for greater understanding and decision-making. There are plenty of barriers to this, including skills, tools, access, licencing etc but one small but significant one is a proliferation of usage of CSV. These often require users to do extra work::

  • configure 'dialect', such as quote character, escape character, line ending
  • collapse multiple header rows into one, or create a missing one
  • character encoding conversion (happily this is covered by an existing government open standard)

Examples of bad tabular data:

Functional Needs

The functional needs that the proposal must address.

@edent

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2018

Thanks for this David. Let me provide some background which might help with this challenge.

The Open Standards Board does not create standards. We look at what existing standards best meet our users' needs. This means that no one ever has to develop a interoperability library with a UK Gov specific standard. In this case, we would look to see what CSV standards there are, and seek to agree on one.

At the moment, CSV is covered by our UTF-8 standard. That is, if it is a plain text file, written in Unicode, encoded with UTF-8, it meets our definition of "open".

We had previously looked at RFC4180 - but it does present some problems.

  • Requires CR/LF as line endings. Mac and Linux tend to produce just LF endings.
  • It specifies ASCII - although notes other encodings could be used.
  • It is not a living standard. The only updates have been around meta-data.

So, my question to you is - which bodies are working on CSV standardisation?

@davidread

This comment has been minimized.

Copy link
Author

commented Apr 17, 2018

ODI have written guidelines for a tightly specified CSV to accompany their CSV Lint checker tool: https://csvlint.io/about#how_write_csv
No doubt there are others - let's here more suggestions.

@edent

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2018

Interesting looking at the results on a couple of CSVs https://csvlint.io/package/5ad5c19193ede3000900001d

If anyone knows of other validators and/or standards, we would be very interested.

@davidread

This comment has been minimized.

Copy link
Author

commented Apr 17, 2018

Tool-wise, there's also the OKI Good Tables tool: http://goodtables.okfnlabs.org/
which seems pretty similar, but in python rather than ruby.

@edent I get that the Open Standards Board doesn't want to get into the process of developing a standard. But there would be some merit in agreeing standards amongst gov. Maybe we should instead aim to get something into the Tech Code of Practice?

@edent

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2018

To clarify, we're happy to agree on an existing standard. But we won't create a new standard specifically for Government.

@MikeThacker1

This comment has been minimized.

Copy link

commented Apr 17, 2018

@davidread

This comment has been minimized.

Copy link
Author

commented Apr 17, 2018

@MikeThacker1 Maybe I've missed something here, but CSVW defines a schema file that is a partner for the CSV file. This challenge is about tightening up the CSV file itself.

@MikeThacker1

This comment has been minimized.

Copy link

commented Apr 17, 2018

@edent

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2018

@davidread Do you know how many CSVs hosted on GOV.UK don't meet your use cases?

I'm trying to understand the scale of the problem.

I've picked a random sample, and they all opened fine in LibreOffice and were read by Python/Pandas.

Do you have some examples of files which don't? Thanks.

@davidread

This comment has been minimized.

Copy link
Author

commented Apr 18, 2018

I've extended "Examples of bad tabular data" to include some examples that we came across today.

I don't have numbers. GOV.UK doesn't put their downloads in their API - you'd need to scrape the HTML (ugh). data.gov.uk do have an API. I've simply not done the analysis.

The ONS example above is what sparked me to write this, and I think it is true for all their thousand-odd datasets on their newish site. A significant proportion of statistical analyses use ONS's data that discounts for inflation, for example.

Traditionally a stastician would open this data in Excel and copy/paste them into their calculations sheet. We're trying to persuade them to change the way stats are created to be a reproducible R/Python script, where the first line downloads the latest data from the source and its dropped into a DataFrame.

Sure, R & Pandas have lots of auto-magic and options to deal with most of these annoyances, but not all. But those are the best tools, and require tinkering to get it right every time. It's unnecessary friction. It discourages casual & exploratory use, which is at the root of all data work.

It seems such a simple thing to agree what makes a "good-enough" CSV and it would be a small but useful push in the right direction.

@nacnudus

This comment has been minimized.

Copy link

commented Apr 19, 2018

Another angle to take would be to add an automatic csvlint step when publishing CSVs to alert publishers to any problems.

@edent

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2018

There are three distinct problems here

  1. CSVs which are corrupt.
  2. CSVs which have poor structure / data.
  3. CSVs which contain data useful to a person reading the file, but annoying for automatic machine readability.

The first I agree could be solved by running a linter on upload to check for UTF-8, proper escapes, line endings, and similar.

The second can only be solved by contacting the producer of data and explaining the problem.

As for the third... I think this will be a difficult issue to solve satisfactorily. Many of the users of the datasets are happy to open in a spreadsheet app, then manually manipulate the data. The metadata will be actively helpful for them.

This could be solved by using CSV on the Web - but it would require a massive change to the way publishers work. I suspect most of them design a complex spreadsheet and then just hit "Save As CSV".

We have three users - the producer, the consumer, the computer. Making a mandatory change to "no metadata" CSVs places a burden on the producer and consumer for the benefit of automatic tools.

So, actions for all of us:

  • We should agree on an acceptable open-source CSV validator, see how many current CSVs would fail, and see if it can be integrated into the publishing process. @stevenjmesser to look into if this can be added to the publishing workflow.
  • @davidread can you contact some of the departments to explain the issue and see whether they can produce either suitable CSVs or a JSON API for you to connect to?
  • Let's gather some examples of CSVs published in 2018 which fail to meet our expectations so we can understand the range of problems.
@davidread

This comment has been minimized.

Copy link
Author

commented Apr 19, 2018

@nacnudus I like it - this seems like a useful step, although the user would have to be amenable. I feel it really needs to be backed up by having a standard for two other groups of data publishers:

  • to push any heel-dragging data publishers up to an acceptable level
  • to see if we can find consensus, by pooling bodies of expert opinion and user research from the likes of ONS, because we currently have different views
@davidread

This comment has been minimized.

Copy link
Author

commented Apr 20, 2018

@edent I like your suggested actions and I'll look to get some more recent evidence and see if I can bring into the discussion some departments with differeing views.

@rufuspollock

This comment has been minimized.

Copy link

commented May 4, 2018

Frictionless Data specs are directly relevant here:

There's a lot of tooling support for these with libs in lots of languages and a very full featured validator library and service http://github.com/frictionlessdata/goodtables-py & https://goodtables.io/ (http://goodtables.okfnlabs.org/).

I note goodtables includes both structural checking (blank rows, columns etc) and validation against table schema.

We also have extensive tools for inferring schemas from CSVs e.g. infer method in https://github.com/frictionlessdata/datapackage-py and https://github.com/frictionlessdata/datapackage-js

I'd also comment that Table Schema and Tabular Data Package were original inspiration for W3C work (and I was an editor / contributor on both specs) but Table Schema / Tabular Data Package are a lot simpler specs relatively which is reflected in the breadth of implementation support (e.g. table schema lib in multiple languages, table schema support in pandas etc).

@davidread

This comment has been minimized.

Copy link
Author

commented May 8, 2018

@rufuspollock Let's not get distracted by schemas or tooling here - that's a big world and is covered by other challenges.

The focus of this challenge is simply to agree on a CSV standard that makes it easy to load using stock CSV libraries everywhere.

@danbri

This comment has been minimized.

Copy link

commented May 15, 2018

As a data scientist I want to open the file directly in Python or R into a tabular data structure (e.g. DataFrame) without having to wrangle it (frictionless access to data) so that I can efficiently analyse it

If [our hypothetical user] is going to call themselves a scientist, they ought to feel at least some obligation to read the footnotes, caveats, supporting documentation, and think about the less obvious characteristics of the data. That may sound harsh and I don't intend it meanly, but this is currently a huge problem with scaling up the modern machine learning scene to engage with more chaotic real world datasets.

@davidread

This comment has been minimized.

Copy link
Author

commented May 16, 2018

@danbri I agree that supporting documentation should be available alongside the data, but that's not what I meant this to be about. I can see that "frictionless access to data" is suggestive of how it might be downloaded, but I mean it in terms of going from a CSV file on your local disk to a dataframe or equivalent data structure in common languages.

@nacnudus

This comment has been minimized.

Copy link

commented May 21, 2018

I agree with @davidread, that this proposal is about the fact that common tabular data formats don't behave as well as, say, JSON, so we should standardise on one that does behave well.

@edent

This comment has been minimized.

Copy link
Contributor

commented May 21, 2018

@davidread @nacnudus can you suggest a good standard that we could look at?

@davidread

This comment has been minimized.

Copy link
Author

commented May 21, 2018

So, what about RFC4180?

I previously suggested the ODI CSV guidelines but actually RFC4180 (which you suggested) covers most of these, such as each row being a record, requiring the same number of fields for every row, not allowing multiple header rows, not allowing blank rows.

My one disagrement with RFC4180 is that the header row is optional - I had an annoyance with a header-less meterological dataset before. But I don't see this as a big deal.

I'm actually ok with the Windows line endings, because that has the huge advantage that it is compatible with Excel over many years. Excel is the widest used piece of software in the data space, and only outputs Windows line endings. Arguably we cause a bigger issue for Excel users if we select mac/unix line endings than vice-versa. So Windows line endings seem best at meeting user needs to me, in this case. (The ODI guidelines agree too.) Happy to hear any other thoughts and reasoning.

RFC4180 allows the gov's standard of UTF8, so there seems no issue there.

You say RFC4180 is not a living standard, and IETF describe it as 'informational' rather than a standard. But I notice they did publish an errata only last month. It's not changed much, but maybe it doesn't need to. And actually one of its advantages is that it's simple and has been around for decades.

The problems I've been talking about with CSVs are abuses compared to RFC4180. If we can agree on and bless this one then it could be a useful push to publishers to fall into line and make life easier for government data users.

@nacnudus

This comment has been minimized.

Copy link

commented May 23, 2018

Here's a recent paper that analyses a random sample of 80 CSV files from data.gov.uk, considers RFC 4180 and some state-of-the-art CSV parsers in R and Python.

I would summarise the paper, but the whole thing seems relevant.

@edent

This comment has been minimized.

Copy link
Contributor

commented May 23, 2018

OK. Let's get started with this challenge. I suggest that this be specifically for datasets published as "Machine Readable Data". That is, if users want to publish data as CSV and it is intended to be consumed by humans, that's fine.

@Lawrence-G please can you start the assessment of https://tools.ietf.org/html/rfc4180

@davidread please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

@rufuspollock

This comment has been minimized.

Copy link

commented May 23, 2018

@davidread

The focus of this challenge is simply to agree on a CSV standard that makes it easy to load using stock CSV libraries everywhere.

OK that's clearer to me. I'd suggest you'd still want to look at http://frictionlessdata.io/specs/csv-dialect/ because it will allow you describe / validate a bunch more CSV. Many CSVs in the wild don't limit themselves to compliance with https://tools.ietf.org/html/rfc4180 i.e. they don't always use "," as the separator, they have different line endings etc.

@davidread @edent

@davidread please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

i would also mention that goodtables does have a structure (not schema) validator that helps you pick up a lot of common CSV structure bugs e.g. missing headers, blank rows, missing columns etc https://github.com/frictionlessdata/goodtables-py

@Lawrence-G

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2018

Our assement of RFC 4180 using the CAMSS based list of questions

Formal specification

Q. 1. Does it address and aid interoperability between public administrations?

A. Yes
RFC 4180 should allow data in comma separated value (CSV) file format to be shared, processed and analysed more easily.

Q. 2. Does it address and aid the development of digital services in government?

A. Yes
Systems that use data should be easier to develop for tabular data that is published in a simple machine readable format.

Q. 3. Are the functional and non-functional requirements for the use and implementation of the specification clearly defined?

A. Yes
There are few requirements for this standard and they are well documented.

Q. 4. Is it possible to implement the specification across different domains?

A. Yes
Any domain where data is stored and exchanged as text in comma separated value (CSV) format.

Q. 5. Is it largely independent from products of single providers, either open source or proprietary?

A. Yes
This standard is Independent of any supplier.

Q. 6. Is it largely independent from specific platforms?

A. Yes
Applies to text files in CSV format and is applicable to all platforms

Q. 7. Has the standard been written so that it can be delivered or used with more than one technology (for example XML and JSON)?

A. No
This standard only applies to CSV text files

Q. 8. Has the specification been sufficiently developed and existed long enough to overcome most of its initial problems?

A. Yes
Originally drafted in 2003

Q. 9. Are there existing or planned mechanisms to assess its conformity and implementation - for example conformity tests, certifications and plugfests?

A Yes.
There are several tools that can be used on and offline. Perhaps the best known is CSV lint - https://csvlint.io/ from the ODI

Q. 10. Does it have sufficient detail, consistency and completeness for the use and development of products?

A. Yes
Though scope is limited to tabular data

Implementation of the formal specification

Q. 11. Does it provide current implementation guidelines and documentation for the implementation of products?

A. Yes
The documentation has notes for implementation.

Q. 12. Does it provide a reference (or open source) implementation?

A. No
Not required as this is a simple standard to implement. There are examples available online that conform to the standard

Q. 13. Does it address backwards compatibility with previous versions?

A. Not Applicable
There are no previous versions of RFC 4180

Q. 14. Are the underlying technologies for implementing it proven, stable and clearly defined?

A. Yes
Underlying technologies are encoded text arranged in tabular form e.g UTF-8

Openness

Q. 15. Is information on the terms and policies for the establishment and operation of the standardisation organisation publicly available?

A. Yes
The information is available at http://www.ietf.org/about/

Q. 16. Is participation in the creation process of the formal specification open to all relevant stakeholders (such as organisations, companies or individuals)?

A. Yes
IETF have a formal review and approval process. See - https://www.ietf.org/about/participate/

Q. 17. Is information on the standardisation process publicly available?

A. Yes
See https://www.ietf.org/standards/process/

Q. 18. Is information on the decision-making process for approving formal specifications is publicly available?

A. Yes
See https://www.ietf.org/standards/process/role-iesg-standards-process/

Q. 19. Are the formal specifications approved in a decision-making process which aims at reaching consensus?

A. Yes
“Any action made by an Area Director or the IESG may by made the subject of the conflict resolution mechanisms set out in RFC 2026, the primary aim of which is to resolve conflicts and move the IETF as a whole towards consensus.” see https://www.ietf.org/standards/process/appeals/

Q. 20. Are the formal specifications reviewed using a formal review process with all relevant external stakeholders (such as public consultation)?

A. Yes
Explained in on this page https://www.ietf.org/standards/process/informal/

Q. 21. Can all relevant stakeholders formally appeal or raise objections to the development and approval of formal specifications?

A. Yes
See the process here - https://datatracker.ietf.org/doc/rfc2026/

Q. 22. Is relevant documentation of the development and approval process of formal specifications publicly available (such as preliminary results and committee meeting notes)?

A. Yes
IETF drafts are published publicly at http://www.ietf.org/id-info/ Discussions take place online in archived mailing lists

Access to the formal specification

Q. 23. Is the documentation publicly available for implementation and use at zero or low cost?

A. Yes
RFC 4180 is free to view and download

Q. 24. Is the documentation of the intellectual property rights publicly available (is there a clear and complete set of licence terms)?

A. Yes
See here - https://www.rfc-editor.org/info/bcp79

Q. 25. Is it licensed on a royalty-free basis?

A. Yes
There are no royalties payable for use of RFC 4180

Versatility/flexibility of the proposed standard

Q. 26. Has the formal specification been used for different implementations by different vendors/suppliers?

A. Yes
It is difficult to be sure but it is probable.

Q. 27. Has the formal specification been used in different industries, business sectors or functions?

A. Yes
The standard is widely used where tabular data is stored and published

Q. 28. Has interoperability been demonstrated across different implementations by different vendors/suppliers?

A. Yes
CSV format aids interoperability across products from different vendors e.g Excel, Google Sheets . This standard’s aim is to improve on this.

End user effect of the formal specification

Q. 29. Do the products that implement it have a significant market share of adoption?

A. Yes
Popular spreadsheet and database programs are able to support RFC 4180

Q. 30. Do the products that implement it target a broad spectrum of end-uses?

A. Yes
Use in all sectors, including government, industry and academia.

Q. 31. Does it have strong support from different interest groups?

A. Yes
Including the ODI and The National Archive.

Q. 32. Is there evidence that the adoption of it supports improving efficiency and effectiveness of organisational process?

A. Yes
It should do but difficult to find evidence for.

Q. 33. Is there evidence that the adoption of it makes it easier to migrate between different solutions from different providers?

A. Yes
It should do but difficult to find evidence for

Q. 34. Is there evidence that the adoption of it positively impacts the environment?

A. No
It should have no impact on the environment

Q. 35. Is there evidence that the adoption of it positively impacts financial costs?

A. Yes
RFC 4180 will save time spent wrangling data by data scientists

Q. 35. Is there evidence that the adoption of it positively impacts security?

A. No
RFC 4180 should not have any impact on security. Note the security considerations in the RFC https://tools.ietf.org/html/rfc4180#page-5

Q. 37. Is there evidence that the adoption of it can be implemented alongside enterprise security technologies?

A. No
Though RFC 4180 is not concerned with security there are no barriers expected in implementing alongside enterprise security technologies.

Q. 38. Is there evidence that the adoption of it positively impacts privacy?

A. Not Applicable
No privacy implications

Q. 39. Is it largely compatible with related (not alternative) formal specifications in the same area of application?

A. Yes
CSV files can usually be exchanged between systems despite differences in formatting though some manual transformation may be required

Q. 40. Is there evidence that the adoption of it positively impacts accessibility and inclusion?

A. Not Applicable
No accessibility and inclusion implications

Maintenance of the formal specification

Q. 41. Does it have a defined maintenance organisation?

A. Yes
This is an IETF standard

Q. 42. Does the maintenance organisation provide sufficient finance and resource to control short-to-medium-term threats?

A. Yes
The organisation has a steady funding stream. IETF gets money from meeting fees and sponsorship, including from the Internet Society, which itself is funded through membership.

Q. 43. Does the maintenance organisation have a public statement on intention to transfer responsibility for maintenance of it, if the organisation were no longer able to continue?

A. No
No public statement found after extensive web search. We assume, therefore, that none exists. In lieu of a public statement, this standards body has been in continuous existence since 1986

Q. 44. Does it have a defined maintenance and support process?

A. Yes
The IETF https://www.ietf.org/ as defined by https://www.ietf.org/about/standards-process.html

Q. 45. Does it have a defined policy for version management?

A. Yes
An RFC has a clear lineage, citing other RFCs which are obsoleted, or which have supersede it.

Related European standards

Q. 46. Is this an existing European standard or an identified technical specification in Europe? (Note: CEN, CENELEC or ETSI are the European standards bodies. Technical specifications provided by organisations other than CEN, CENELEC or ETSI can be under consideration to become a European standard or an identified technical specification in Europe.)

A. No

Q. 47. Does this specification or standard cover an area different from those already identified or currently under consideration as an identified European standard or specification?

A. Yes

@davidread

This comment has been minimized.

Copy link
Author

commented Jun 5, 2018

One thing that RFC4180 doesn't address is whether or not to have a BOM at the start of the file to indicate UTF8 encoding. However I don't think we can or should standardize to have a BOM or not. Leave it up to the publisher to decide. I just wanted to note it here with my reasoning, and in case anyone has further thoughts:

There's a lot of discussion e.g. this one. For me the key points are:

For BOM:

  • best compatibility with Excel (which is the widest used piece of software in the data space). If a CSV doesn't have a BOM then any non-ASCII characters will appear mangled in Excel, unless you use Excel's "Import Text" wizard and specify UTF8 manually.

Against BOM:

  • aside from Excel, the majority of software doesn't understand it, and you end up with the first column heading gets prepended with: "". In Pandas you need v0.20.3+ and pd.read_csv(..., encoding='utf-8-sig') and in R do read.csv(..., fileEncoding = "UTF-8-BOM").
  • a BOM breaks ASCII compatibility, one of the advantages of UTF8.

So you have a choice between seamless use in Excel (with a BOM) or seamless use in everything else (without a bom). And I think this should remain a choice for people publishing CSVs in government.

@stevenjmesser

This comment has been minimized.

Copy link

commented Jun 12, 2018

please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

Happy to help!

@frankieroberto

This comment has been minimized.

Copy link

commented May 17, 2019

@arnau good questions!

The problem with CSV files is that, as plain text files, there's nowhere that specifies the character encoding once you've downloaded them. When you view them in a browser over HTTP, then the HTTP header should in theory specify the character encoding as part of the MIME type (text/csv; charset=utf-8. However, in practice, very few servers are configured to do this.

The BOM was invented for UTF-16 files (where's it’s required), and serves no practical purpose in UTF8 files other than as an indicator of the character encoding.

Without the BOM, and with the character encoding either missing from the HTTP header (or missing because it's being loaded from a local filesystem), applications have to either use heuristics to guess at which encoding is being used, or ask the user (who may not know).

I've yet to discover any applications which process UTF8 CSV files but choke on the BOM – however they may exist? But there are plenty of applications, including Excel and most web browsers (when rendering CSV files directly) for which the presence of the BOM makes a difference.

I put together this test website a while back which you can use to check your browser (or application, if you download first): https://csv-encoding-test.herokuapp.com/

@davidread

This comment has been minimized.

Copy link
Author

commented May 17, 2019

@frankieroberto Yes, Windows tools (Excel, PowerBI, Tableau) detect the BOM and handle it ok, but I really think BOMs do cause problems in the *nix & open source data analysis world, which is very important too. Even Python and R with their huge data science communities have lots of friction. By default you either get some awful error message, or you end up with the first column heading prepended with: "".

Examples:

  • In Pandas you need v0.20.3+ and pd.read_csv(..., encoding='utf-8-sig')
  • In R do read.csv(..., fileEncoding = "UTF-8-BOM")
  • @arnau gave the example of using cat, and of course there's usually lots of mileage in other *nix tools: grep, sort, wc -l etc. I expect there is a command-line tool to strip the BOM, though.
  • With csvkit you need to add an encoding parameter csvsql -e utf-8-sig

In all these cases, it's hard work figuring out what the "" is and how to get rid of it. Plenty of friction.

If you don't have a BOM, then Excel will display incorrectly anything non-ascii, like those "£" signs. But maybe that's not too bad? (Maybe in a non-English-speaking country we'd have a different thought.) If we have to choose between avoiding horrid error messages in the popular open source data analysis tools and the avoiding the occasional symbol looking wrong in Excel, I think overally the former is better. But I'm very happy to hear more evidence that swings the balance.

Current conclusion: users will be happier without BOMs

This choice would come with the advantage that it is what the RFC says. But it does mean that when saving CSVs from Windows Excel, and there are non-ASCII characters, you can't simply use "Save As". Another step will be needed to remove the BOM. (Someone should probably write a good .xls->.csv converter!)

@nacnudus

This comment has been minimized.

Copy link

commented May 17, 2019

Based on the experiments below, I think the proposal should recommend no BOM.

  • Excel itself doesn't create files with a BOM, so an Excel->CSV->Excel round trip will never work. (@davidread this is not what you seem to have found)
  • The vast majority of CSV files are created by Excel, so people would ignore any recommendation to use a BOM.
  • Every GOV.UK csv file in a sample of ~50 that had a £ symbol had mangled it, so BOM or no BOM nothing could have read it.

I tried R read.csv() and readr::read_csv() and Python pandas.read_csv() and they worked whether or not the BOM was present.

Excel 2010 (on Windows 10, using the File Open dialogue) couldn't read a CSV containing £ correctly unless it had the BOM, but using Excel's "save as MS Dos .csv" mangles the £ and then creates a file without a BOM. So the Excel->CSV->Excel round trip doesn't work.

I opened a CSV with a BOM in Excel, then "save-as" defaulted to "unicode". I think it turned out to be UTF-16 but at any rate it defeated everything except Excel itself and Vim -- nothing else could interpret it.

I tried about 50 CSV files sampled from GOV.UK, and every file that contained £ failed, which suggests that the vast majority of CSV files are created using Excel.

@arnau

This comment has been minimized.

Copy link

commented May 20, 2019

@frankieroberto thanks for putting together the test website, really useful.

@davidread @nacnudus thanks, these results make me more confident that we should recommend no BOM.

Also, given how difficult is to open or export CSV in Excel, perhaps it would be good to provide some guidance on how to do it? Probably better if it describes other tools/methods like OpenOffice, Google Sheets, etc?

Aside, when you say "Excel" would you mind specifying the program version and OS version? And, ideally, the way you opened the CSV (e.g. drag and drop, open file).

@davidread

This comment has been minimized.

Copy link
Author

commented May 20, 2019

Thanks very much, all. The BOM question is a real can of worms! Here's my summary:

For BOM:

  • Excel is pretty bad with UTF8 generally, but the BOM does help it somewhat

Against BOM:

  • R & Python libraries can read a BOM, although for the frictionless examples @nacnudus gave I still suspect this requires recent versions, and doesn't apply to python's standard csv library.
  • *nix command-line tools do not like BOMs.
  • Desktop tools like Tableau

I guess it comes down to the Excel users, and whether we see them as primary users of this machine-readable data, or just secondary. The research we did on DGU showed that Excel is by far the most used tool. I'm not clear if this was just for viewing the file (which I'd argue is just a nice-to-have for machine-readable data), or led to actual basic data work by business analysts, financial sector, managers, etc. Clearly Python/R/*nix and other stuff is what is used by technical professions - analysts / statisticians / data scientists / developers.

@arnau and others, what's your experience of users and their needs for machine-readable data?

@frankieroberto

This comment has been minimized.

Copy link

commented May 21, 2019

@davidread good summary. However I don’t think it’s just Excel which can use the BOM. Most web browsers can too – and I think they're an important use-case, as lots of people like to preview a CSV file before downloading it (to check what’s in it, or that they’ve found the right one). You don’t need a BOM to make the CSV’s render in browsers, but it can be easier than making sure the files a served using the UTF8 character set in the MIME type.

Does anyone have R Studio or SPSS that can check whether they cope with the BOM or not?

And can you give any examples of UNIX tools not liking the BOM? I just tried grep on macOS, and it worked fine. Ruby seems to handle it too.

Based on this, I still think we should recommend a BOM, if possible (as well as the text/csv; charset=utf-8 content-type http header) – but acknowledge that this isn’t always possible, that there are pros and cons, and that it’s not part of the standard but is supplemental guidance.

@arnau

This comment has been minimized.

Copy link

commented May 21, 2019

@frankieroberto the one that I mentioned before is cat. When you concatenate two files with BOM with a tool that does not deal with encoding (e.g. cat) the result is a file with a BOM in the middle of the file, which is wrong. I'll try to find some time to test if this behaviour can be generalised to other tools/languages for this operation specifically.

The one I use, xsv handles it correctly although it drops the BOM on the resulting file which makes the result not ready for publication (if we require BOM that is).

@frankieroberto

This comment has been minimized.

Copy link

commented May 21, 2019

@arnau ah, that's interesting. Does cat do the same for UTF16 files?

The Wikipedia page on the BOM mentions "many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics".

I don’t think anyone is suggesting we require (or forbid) the BOM? This is just about guidance.

Ultimately, the decision is down to service teams, who may need to decide between prioritising Excel/web browser/desktop application users vs automated/unix pipeline users?

@davidread

This comment has been minimized.

Copy link
Author

commented May 21, 2019

Yes I think you're right @frankieroberto, the outcome of the BOM discussion is just a recommendation accompanying the choice of standard, so holds less weight, but this discussion may help inform service teams.

I just did some playing with Excel 2016 for Mac (16.16.9):

  • opening a CSV without BOM - opens fine - just mangles the £. This is fixed if you use 'Text Import Wizard'.
  • opening a CSV with BOM - takes me to the import wizard where it detects UTF8 fine (I think it went to the wizard because it was such a small file to use heuristics to detect the column separator, so not a worry)
  • "save as CSV" - in the drop-down of "Common formats" it offers "CSV UTF-8 (Comma delimited)" and it gets a BOM. Tucked away in "Specialty formats" are also "Comma separated values", "Macintosh Comma Separated" and "MS-DOS Comma Separated" which all have no BOM but encodes the "£" as a single byte - i.e. ASCII8 encoded
  • round-trip with BOM (open a CSV with BOM and save it again) - works fine - it comes out with a BOM
  • round-trip without BOM (open a CSV without BOM and save it again) - had to use 'Text Import Wizard' to open, and can't find a way to save as UTF8 without BOM.

So in general Excel fared a bit better than what you got @nacnudus - I guess Excel has improved a bit between 2010 and 2016. Now that Excel has the new way to save UTF-8 with a BOM, it is working sensibly with a UTF-8 and a BOM. However Excel still doesn't read UTF-8 CSVs without a BOM, unless you use the Text Import Wizard. So @nacnudus I reckon this somewhat undermines the 'Excel is rubbish at UTF-8, whatever we do' argument.

@frankieroberto excellent find that grep is now fine about BOMs. I've tried sort, wc and cat (on Mac) and all just skip the BOM fine.

Latest summary

For BOM:

  • Excel is not great with UTF8 generally. Without a BOM, you need to use Text Import Wizard, otherwise non-ASCII characters get mangled.

A bit against BOM (accepts it fine, but drops it on output):

  • Most unix tools: grep, cat, sort, wc
  • Command-line utils: csvkit, xsv
  • R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.
@arnau

This comment has been minimized.

Copy link

commented May 22, 2019

@davidread you said:

[...] cat (on Mac) and all just skip the BOM fine.

Out of curiosity, what did you do that works fine? My test looks like this:

$ curl -o utf8_bom.csv https://csv-encoding-test.herokuapp.com/csv/is-utf8-with-bom/as-utf8
$ echo "" >> utf8_bom.csv
$ cat utf8_bom.csv utf8_bom.csv > mix.csv

Results in a broken utf-8 file:

$ bat -p mix.csv
<U+FEFF>CSV, £
<U+FEFF>CSV, £

Using MacOS 10.13.6, curl 7.54.0, default cat, default echo and bat to clearly show BOM.

Also, is anyone testing Excel on Windows, I don't have access to one?

@gheye

This comment has been minimized.

Copy link

commented May 22, 2019

Hi,

I am working as a data architect at GDS looking at Data Standards.

I would like to lend my support to the introduction of this standard.

My one comment is:

I would in like to see it extended, in government, so there should always be a header row at the top of a file.

I understand for legacy applications and flows this may not be ideal at the moment.

Best Wishes,

Gareth H.

@rufuspollock

This comment has been minimized.

Copy link

commented May 22, 2019

Just to say i'm loving this thread and the detail re stuff like BOM - it is really useful.

Also to understand, is this just about standardizing on a "CSV Spec" and, if so, is the Frictionless Data CSV Dialect spec https://frictionlessdata.io/specs/csv-dialect/ relevant here (even if just informative) (I'm also thinking whether CSV dialect needs extending with a BOM field).

@davidread

This comment has been minimized.

Copy link
Author

commented May 22, 2019

@gheye Great! Yes I'm keen to encourage a header row. Very much a user need.

@ldodds mentions that CSVW takes a view on CSV best practices which mostly echoes comments here:

  • start with RFC4180 - we agree
  • also allow unix line endings - this would suit many of the techies and their default tools, but they are a fraction of the users, compared to the Excel users. I'm not clear yet whether Open Standards Board would allow guidance that contravenes the RFC, so I'm tempted to be agnostic for now.
  • default to UTF8 but allow other encodings - declaring a different encoding in the HTTP header only works if the CSV is used in-place "on the web" - rather incompatible with the average user that downloads it before loading it into Excel etc. (although I love the CSVW vision). Gov has agreed to adopt UTF8 only, so this isn't really an issue.

@arnau Ah, you're dead right - thanks for this. I was simply going by what was echoed to the terminal, but yes when you dump the output it's clear the BOM is not dropped. 'cat', 'grep' & 'sort' are being dumb about handling the BOM, treating it as part of the first cell. So basic unix tools don't like BOMs.

Latest summary

For BOM:

  • Excel is not great with UTF8 generally. Without a BOM, you need to use Text Import Wizard, otherwise non-ASCII characters get mangled.

A bit against BOM (aware of it, but has friction working with it):

  • Command-line utils: csvkit, xsv
  • R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.

Against BOM (not aware, lots of friction):

  • Most unix tools: grep, cat, sort, wc
@frankieroberto

This comment has been minimized.

Copy link

commented May 23, 2019

@davidread in the 'For BOM' section, can you add something like "Allows web browsers to use the correct encoding when it’s not specified in the HTTP header"?

@davidread

This comment has been minimized.

Copy link
Author

commented May 23, 2019

I think [web browsers are] an important use-case, as lots of people like to preview a CSV file before downloading it (to check what’s in it, or that they’ve found the right one)

@frankieroberto I definitely agree with the user need to preview the CSV content. However I'm intrigued by you talking about displaying the file in the browser, because my experience is that when you click on a CSV link, the browser invites you to save it to disk and maybe open in Excel (depending on file associations), rather than displaying it in the browser. That's trying it with Chrome & FF on GOV.UK, data.gov.uk and Racial Facts and Figures, for example. When do you/users see it display in the browser?

@gheye

This comment has been minimized.

Copy link

commented May 23, 2019

HI All,

I think that we should ideally get RFC4180 accepted as a baseline standard and then raise an issue to extend it.

We need a baseline first.

Perhaps get it accepted with known issues that require discussion/agreement.

Thanks.

Gareth H.

@frankieroberto

This comment has been minimized.

Copy link

commented May 23, 2019

@davidread Chrome and Firefox both seem to download all text/csv files, but Safari lets you view them in browser: (eg this one which I found via data.gov.uk).

For https://csv-encoding-test.herokuapp.com I cheated slightly (for the sake of testing) by setting the content-type to text/plain, which then does render in all browsers (and they all seem to use the BOM, in the absence of an explicit encoding set within the content-type header).

So yeah, in short, it only really affects Safari. And probably there are better ways to preview CSV files.

@arnau

This comment has been minimized.

Copy link

commented Jun 5, 2019

@davidread did we get to any conclusion regarding BOM? Is there anything else we can help with to move forward this proposal?

@davidread

This comment has been minimized.

Copy link
Author

commented Jun 5, 2019

Thanks @arnau. @frankieroberto I've add in the Safari one point and the latest summary is:

For BOM:

  • Excel is not great with UTF8 generally. Without a BOM, non-ASCII characters get mangled. (This can be overcome by using Text Import Wizard.)
  • Safari uses BOM to display CSVs when the header doesn't set the encoding

A bit against BOM (aware of it, but has friction working with it):

  • Command-line utils: csvkit, xsv
  • R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.

Against BOM (not aware, lots of friction):

  • Most unix tools: grep, cat, sort, wc

Thinking about user needs, I think we should discourage BOM use. This will minimize friction with the command-line tools and programming languages, most used by analysts, statisticians, data scientists etc. We choose this with the awareness that it is not ideal for Excel users, who will find any non-ASCII characters (e.g. "£") appear incorrectly, but this does not get in the way of them doing basic analysis characteristic of Excel users - calculate a total, draw a chart, filter rows etc. And where it is an issue they can use the 'Import Text' option instead.

I'm taking this to the board tomorrow. I'm very happy to debate this BOM conclusion further, in terms of the user needs. However I think the BOM decision is a small thing in the scheme of things.

@davidread

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

I'm pleased to report that this has been accepted by the Open Standards Board! Thanks to everyone for contributing their experience and considered opinions to this discussion and supported this.

It will be confirmed by the Board but my understanding from the meeting is:

The board recommends RFC 4180 for publishing machine-readable tabular data. It will advise that:

  • Character encoding - ASCII & UTF-8 are the existing standards for character encoding
  • Line endings - LF (Unix-style) are also acceptable
  • Header rows - one header row is recommended (not zero)

I didn't push for discussion of allowing Byte Order Mark (BOM), following discussion of compatibility and user needs on this issue. So because it is not part of RFC4180 I conclude it is not part of the Open Standards Board recommendation.

The board decided this should be a "recommendation", rather than "mandatory", because the use of tabular data is a huge area and deciding where to draw the line, and enforcing it, is not a realistic task for the Open Standards Board.

The basic message is: CSV is a good choice for tabular data. And if you're publishing a CSV, publish it according to RFC4180 by default.

@davidread davidread closed this Jun 11, 2019

@nacnudus

This comment has been minimized.

Copy link

commented Jun 11, 2019

Fantastic news. Thank you @davidread for writing the proposal and marshalling it through to acceptance. Here's to clean, standard tabular data!

@Lawrence-G Lawrence-G added Recommended and removed in progress labels Jun 11, 2019

@andyjpb

This comment has been minimized.

Copy link

commented Jun 12, 2019

Thanks @davidread!
It's a great piece of work with important implications for the data ecosystem both inside and outside Government.
I'm really glad you got it through and I know that you've been working hard on it for a long time.

@Lawrence-G

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@Lawrence-G

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

The first is for standards recommended for use and the other is for compulsory standards. The two @frankieroberto categories have existed in theory in the past but until this year all were mandated by the OSB ( including some where the proposal was to recommend) so we have had to create a second page to list them in. We are going to rework the standards list soon as we feel the two pages of simple lists is probably not the best solution.

@frankieroberto

This comment has been minimized.

Copy link

commented Jun 28, 2019

@Lawrence-G ah, I see. So RFC 4180 is only recommended but not mandated?

Merging the two lists makes sense to me, but in the short term, is it worth linking to the recommended ones from the collection page: https://www.gov.uk/government/collections/open-standards-for-government-data-and-technology (unless it is already and I've missed it) – that's where I'd normally look for existing standards when developing a service.

@davidread

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

@frankieroberto Yeah, the board decided this should be a "recommendation", rather than "mandatory", because the use of tabular data is a huge area. I made an attempt to define the scope of "use cases where CSV is best" in the proposal, and came across plenty of exceptions pretty quickly. Deciding where to draw that line between CSV and Excel/ODS/JSON/Parquet/Arrow/NetCDF etc., and enforcing it, is not a realistic task for the Open Standards Board. So the message is "CSV is a useful 'lowest common denominator' / general purpose tabular format" and "when you publish CSV, make sure it is RFC4180"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.