-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv.Sniffer().has_header()
is called when not needed and returns the wrong result
#739
Comments
Here's what I had in mind: def normalize_config(config, head, dialect='excel', skip_lines: int = 0):
"""Using the header line, convert the configuration field name lookups to int indexes.
Args:
config: A dict of Col types to string or indexes.
head: A string, some decent number of bytes of the head of the file.
dialect: A dialect definition to parse the header
skip_lines: Skip first x (garbage) lines of file.
Returns:
A pair of
A dict of Col types to integer indexes of the fields, and
a boolean, true if the file has a header.
Raises:
ValueError: If there is no header and the configuration does not consist
entirely of integer indexes.
"""
if all(isinstance(field, int)
for field_type, field in config.items()):
return config, False
# Skip garbage lines before sniffing the header
assert isinstance(skip_lines, int)
assert skip_lines >= 0
for _ in range(skip_lines):
head = head[head.find('\n')+1:]
index_config = config
has_header = csv.Sniffer().has_header(head)
if has_header:
header = next(csv.reader(io.StringIO(head), dialect=dialect))
field_map = {field_name.strip(): index
for index, field_name in enumerate(header)}
index_config = {}
for field_type, field in config.items():
if isinstance(field, str):
field = field_map[field]
index_config[field_type] = field
else:
raise ValueError("CSV config without header has non-index fields: "
"{}".format(config))
return index_config, has_header
This is a reasonable approach, but not foolproof. It's possible that some users are supplying all int field lookups even though the file has a header, and it's possible that they're relying on the current implementation of I think that's unlikely, but if it isn't, there's an easy workaround: they can increment |
+1 for this as I have csv files that depending on number of transactions and the particular order of them the sniffer returns unreliable results causing me to currently play withe csv files to get them to be read properly. What is being proposed here would effectively allow me a native method to override what the sniffer might think. Here is an example csv file where the sniffer returns incorrect results...
Note if I put double quotes around the Transaction Details column it will then read correctly. Below is an example of a similar file from the same bank that reads just fine without any messing around with it....
Was originally thinking I could somehow fix this by playing around with the dialect used but this is not possible as from my understanding the Sniffer function does not accept a parameter to tell it the dialect to use. In the end it would be great to be able to indicate that there is a header or not to bypass the Sniffer which can be unreliable in many cases. |
csv.Importer()
callsnormalize_config()
infile_date()
andextract()
. After skipping the requested number of lines,normalize_config()
callscsv.Sniffer().has_header()
:beancount/beancount/ingest/importers/csv.py
Line 397 in 79da2f9
I have a CSV file for which
has_header()
returnsTrue
when it should returnFalse
; here's the entire file:beancount
can't do anything about the fact thathas_header()
is returning the wrong result. But! But it can, in this case, avoid callinghas_header()
in the first place.Here's the description of
normalize_config()
:For this CSV file, the
config
field name lookups are already int indexes whennormalize_config()
is called:When the field name lookups are already all int indexes
normalize_config()
should, I think, bail early.(When
has_header()
returnsFalse
normalize_config()
asserts that all indexes are ints. Basically, it should perform that check before sniffing the file. If all indexes are intsnormalize_config()
can returnconfig, False
and avoidhas_header()
entirely.)The text was updated successfully, but these errors were encountered: