I regularly run into CSV files containing zero-padded numbers that are not properly quoted. These can be identifiers, multi-digit codes, a US zip code, etc. For example:
person_id, us_zip, education_level
0001,00123,03
0002,01234,08
0003,12345,99
When read by software, these are interpreted as a numeric data type (rightfully so), which then drops the leading zeros, causing various issues (invalid values, data loss, mismatched codes, incorrect data type, etc.).
The proper behaviour is to surround these with quotes, like:
person_id, us_zip, education_level
"0001","00123","03"
"0002","01234","08"
"0003","12345","99"
So... could such a quality-of-life utility be added to our favorite QSV toolkit? It would essentially parse a CSV file and either update it or create a new properly encoded version.
I unfortunately do not think there is a fix for this in `\tab or other delimited files (of fixed ascii), the option then would be to convert to CSV (which such a utility may also properly take care of?).
Beyond this, parsers and other inference routines could be instructed to take this into account (which typically requires a double pass or top rows scan).
Thoughts?
I regularly run into CSV files containing zero-padded numbers that are not properly quoted. These can be identifiers, multi-digit codes, a US zip code, etc. For example:
When read by software, these are interpreted as a numeric data type (rightfully so), which then drops the leading zeros, causing various issues (invalid values, data loss, mismatched codes, incorrect data type, etc.).
The proper behaviour is to surround these with quotes, like:
So... could such a quality-of-life utility be added to our favorite QSV toolkit? It would essentially parse a CSV file and either update it or create a new properly encoded version.
I unfortunately do not think there is a fix for this in `\tab or other delimited files (of fixed ascii), the option then would be to convert to CSV (which such a utility may also properly take care of?).
Beyond this, parsers and other inference routines could be instructed to take this into account (which typically requires a double pass or top rows scan).
Thoughts?