`csv.Sniffer().has_header()` is called when not needed and returns the wrong result #739

michaellenaghan · 2022-12-05T22:35:55Z

csv.Importer() calls normalize_config() in file_date() and extract(). After skipping the requested number of lines, normalize_config() calls csv.Sniffer().has_header():

beancount/beancount/ingest/importers/csv.py

Line 397 in 79da2f9

has_header = csv.Sniffer().has_header(head)

I have a CSV file for which has_header() returns True when it should return False; here's the entire file:

01/05/2022,E-TRANSFER ***b3y   ,,23.00,25.00

beancount can't do anything about the fact that has_header() is returning the wrong result. But! But it can, in this case, avoid calling has_header() in the first place.

Here's the description of normalize_config():

Using the header line, convert the configuration field name lookups to int indexes.

For this CSV file, the config field name lookups are already int indexes when normalize_config() is called:

           config = {
                csv.Col.DATE: 0,
                csv.Col.PAYEE: 1,
                csv.Col.AMOUNT_DEBIT: 2,
                csv.Col.AMOUNT_CREDIT: 3,
                csv.Col.BALANCE: 4,
            }

When the field name lookups are already all int indexes normalize_config() should, I think, bail early.

(When has_header() returns False normalize_config() asserts that all indexes are ints. Basically, it should perform that check before sniffing the file. If all indexes are ints normalize_config() can return config, False and avoid has_header() entirely.)

The text was updated successfully, but these errors were encountered:

michaellenaghan · 2022-12-05T22:50:57Z

Here's what I had in mind:

def normalize_config(config, head, dialect='excel', skip_lines: int = 0):
    """Using the header line, convert the configuration field name lookups to int indexes.

    Args:
      config: A dict of Col types to string or indexes.
      head: A string, some decent number of bytes of the head of the file.
      dialect: A dialect definition to parse the header
      skip_lines: Skip first x (garbage) lines of file.
    Returns:
      A pair of
        A dict of Col types to integer indexes of the fields, and
        a boolean, true if the file has a header.
    Raises:
      ValueError: If there is no header and the configuration does not consist
        entirely of integer indexes.
    """
    if all(isinstance(field, int)
            for field_type, field in config.items()):
        return config, False

    # Skip garbage lines before sniffing the header
    assert isinstance(skip_lines, int)
    assert skip_lines >= 0
    for _ in range(skip_lines):
        head = head[head.find('\n')+1:]

    index_config = config
    has_header = csv.Sniffer().has_header(head)
    if has_header:
        header = next(csv.reader(io.StringIO(head), dialect=dialect))
        field_map = {field_name.strip(): index
                     for index, field_name in enumerate(header)}
        index_config = {}
        for field_type, field in config.items():
            if isinstance(field, str):
                field = field_map[field]
            index_config[field_type] = field
    else:
        raise ValueError("CSV config without header has non-index fields: "
                             "{}".format(config))
    return index_config, has_header

Immediately check to see if all field lookups are ints. If they are, return the original config and False.
If they aren't, sniff for a header — and if a header isn't found, always raise an error. Because at that point we already know we have non-int field lookups.

This is a reasonable approach, but not foolproof. It's possible that some users are supplying all int field lookups even though the file has a header, and it's possible that they're relying on the current implementation of normalize_config() to skip the header they're already ignoring.

I think that's unlikely, but if it isn't, there's an easy workaround: they can increment skip_lines.

mondjef · 2023-02-26T14:52:08Z

+1 for this as I have csv files that depending on number of transactions and the particular order of them the sniffer returns unreliable results causing me to currently play withe csv files to get them to be read properly. What is being proposed here would effectively allow me a native method to override what the sniffer might think.

Here is an example csv file where the sniffer returns incorrect results...

Date, Transaction Details, Funds Out, Funds In 
02/20/2023,POS MERCHANDISE RETAILER  28,6.38,
02/20/2023,POS MERCHANDISE RETAILER,1.00,,60.35

Note if I put double quotes around the Transaction Details column it will then read correctly. Below is an example of a similar file from the same bank that reads just fine without any messing around with it....

Date, Transaction Details, Funds Out, Funds In 
02/27/2023, TRANSFER IN,,35.00,366.86

Was originally thinking I could somehow fix this by playing around with the dialect used but this is not possible as from my understanding the Sniffer function does not accept a parameter to tell it the dialect to use. In the end it would be great to be able to indicate that there is a header or not to bypass the Sniffer which can be unreliable in many cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`csv.Sniffer().has_header()` is called when not needed and returns the wrong result #739

`csv.Sniffer().has_header()` is called when not needed and returns the wrong result #739

michaellenaghan commented Dec 5, 2022

michaellenaghan commented Dec 5, 2022

mondjef commented Feb 26, 2023 •

edited

Loading

csv.Sniffer().has_header() is called when not needed and returns the wrong result #739

csv.Sniffer().has_header() is called when not needed and returns the wrong result #739

Comments

michaellenaghan commented Dec 5, 2022

michaellenaghan commented Dec 5, 2022

mondjef commented Feb 26, 2023 • edited Loading

`csv.Sniffer().has_header()` is called when not needed and returns the wrong result #739

`csv.Sniffer().has_header()` is called when not needed and returns the wrong result #739

mondjef commented Feb 26, 2023 •

edited

Loading