Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv.Sniffer().has_header() is called when not needed and returns the wrong result #739

Open
michaellenaghan opened this issue Dec 5, 2022 · 2 comments

Comments

@michaellenaghan
Copy link

csv.Importer() calls normalize_config() in file_date() and extract(). After skipping the requested number of lines, normalize_config() calls csv.Sniffer().has_header():

has_header = csv.Sniffer().has_header(head)

I have a CSV file for which has_header() returns True when it should return False; here's the entire file:

01/05/2022,E-TRANSFER ***b3y   ,,23.00,25.00

beancount can't do anything about the fact that has_header() is returning the wrong result. But! But it can, in this case, avoid calling has_header() in the first place.

Here's the description of normalize_config():

Using the header line, convert the configuration field name lookups to int indexes.

For this CSV file, the config field name lookups are already int indexes when normalize_config() is called:

           config = {
                csv.Col.DATE: 0,
                csv.Col.PAYEE: 1,
                csv.Col.AMOUNT_DEBIT: 2,
                csv.Col.AMOUNT_CREDIT: 3,
                csv.Col.BALANCE: 4,
            }

When the field name lookups are already all int indexes normalize_config() should, I think, bail early.

(When has_header() returns False normalize_config() asserts that all indexes are ints. Basically, it should perform that check before sniffing the file. If all indexes are ints normalize_config() can return config, False and avoid has_header() entirely.)

@michaellenaghan
Copy link
Author

Here's what I had in mind:

def normalize_config(config, head, dialect='excel', skip_lines: int = 0):
    """Using the header line, convert the configuration field name lookups to int indexes.

    Args:
      config: A dict of Col types to string or indexes.
      head: A string, some decent number of bytes of the head of the file.
      dialect: A dialect definition to parse the header
      skip_lines: Skip first x (garbage) lines of file.
    Returns:
      A pair of
        A dict of Col types to integer indexes of the fields, and
        a boolean, true if the file has a header.
    Raises:
      ValueError: If there is no header and the configuration does not consist
        entirely of integer indexes.
    """
    if all(isinstance(field, int)
            for field_type, field in config.items()):
        return config, False

    # Skip garbage lines before sniffing the header
    assert isinstance(skip_lines, int)
    assert skip_lines >= 0
    for _ in range(skip_lines):
        head = head[head.find('\n')+1:]

    index_config = config
    has_header = csv.Sniffer().has_header(head)
    if has_header:
        header = next(csv.reader(io.StringIO(head), dialect=dialect))
        field_map = {field_name.strip(): index
                     for index, field_name in enumerate(header)}
        index_config = {}
        for field_type, field in config.items():
            if isinstance(field, str):
                field = field_map[field]
            index_config[field_type] = field
    else:
        raise ValueError("CSV config without header has non-index fields: "
                             "{}".format(config))
    return index_config, has_header
  • Immediately check to see if all field lookups are ints. If they are, return the original config and False.
  • If they aren't, sniff for a header — and if a header isn't found, always raise an error. Because at that point we already know we have non-int field lookups.

This is a reasonable approach, but not foolproof. It's possible that some users are supplying all int field lookups even though the file has a header, and it's possible that they're relying on the current implementation of normalize_config() to skip the header they're already ignoring.

I think that's unlikely, but if it isn't, there's an easy workaround: they can increment skip_lines.

@mondjef
Copy link

mondjef commented Feb 26, 2023

+1 for this as I have csv files that depending on number of transactions and the particular order of them the sniffer returns unreliable results causing me to currently play withe csv files to get them to be read properly. What is being proposed here would effectively allow me a native method to override what the sniffer might think.

Here is an example csv file where the sniffer returns incorrect results...

Date, Transaction Details, Funds Out, Funds In 
02/20/2023,POS MERCHANDISE RETAILER  28,6.38,
02/20/2023,POS MERCHANDISE RETAILER,1.00,,60.35

Note if I put double quotes around the Transaction Details column it will then read correctly. Below is an example of a similar file from the same bank that reads just fine without any messing around with it....

Date, Transaction Details, Funds Out, Funds In 
02/27/2023, TRANSFER IN,,35.00,366.86

Was originally thinking I could somehow fix this by playing around with the dialect used but this is not possible as from my understanding the Sniffer function does not accept a parameter to tell it the dialect to use. In the end it would be great to be able to indicate that there is a header or not to bypass the Sniffer which can be unreliable in many cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants