Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

Closed
droberts195 opened this issue Jul 31, 2019 · 2 comments · Fixed by #45099
Labels
:ml Machine learning

Comments

@droberts195
Copy link
Contributor

elastic/kibana#42114 contains an example of a CSV file where the find_file_structure endpoint didn't detect that the first row contained the column names.

The explanation is:

    "First row is not unusual based on length test: [1347.0] and [count=313, min=1231.000000, average=4363.025559, max=8911.000000]",
    "First row is not unusual based on Levenshtein test [count=100, min=1871.000000, average=4357.270000, max=8512.000000] and [count=100, min=1711.000000, average=3648.230000, max=5914.000000]"

In other words:

  1. The first row length is 1347 and other rows vary in length between 1231 and 8911 characters.
  2. The average Levenshtein distance between the first row and each of the next 100 rows is 4357.27 while the average distance between 100 other pairs of rows is 3648.23.

The file in question contains AirBNB listings data. Some owners have written huge amounts about their properties and other owners have written very little, and the two current tests are confused by this.

To a human it's flagrantly obvious that the first row is a header row, so we should be able to improve this.

One idea is to extend _excluding_ the biggest difference from:

/**
     * Sum of the Levenshtein distances between corresponding elements
     * in the two supplied lists _excluding_ the biggest difference.
     * The reason the biggest difference is excluded is that sometimes
     * there's a "message" field that is much longer than any of the other
     * fields, varies enormously between rows, and skews the comparison.
     */

to exclude all fields that are over a certain length in any row, as this indicates likely freeform text fields (and the AirBNB data has more than 1 such field per row).

Another idea would be to look at the number of distinct characters in each row. In the AirBNB data this could well notice a difference between the first row and others because the first row is all commas, lowercase letters and underscores whereas the other lines have many other characters.

@droberts195 droberts195 added the :ml Machine learning label Jul 31, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@sophiec20
Copy link
Contributor

Possibly look at ratio of numeric characters in addition to excluding very lengthy fields.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Aug 1, 2019
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.

This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.

Fixes elastic#45047
droberts195 added a commit that referenced this issue Aug 1, 2019
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.

This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.

Fixes #45047
droberts195 added a commit that referenced this issue Aug 2, 2019
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.

This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.

Fixes #45047
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants