Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row number in Excel files #684

Closed
obolman opened this issue Jun 10, 2022 · 6 comments
Closed

Row number in Excel files #684

obolman opened this issue Jun 10, 2022 · 6 comments

Comments

@obolman
Copy link

obolman commented Jun 10, 2022

When searching in Excel files, dngrep shows a line number but it's not the original row number in Excel (looks like it converts Excel to csv on the fly and shows the line number in that csv file.) Would be great to see the row number instead.

@doug24
Copy link
Contributor

doug24 commented Jun 10, 2022

Can you show an example? In the files I have tried the line number does match the row number in the Excel sheet.

@obolman
Copy link
Author

obolman commented Jun 14, 2022

See an example file attached as well as an image of search results.

Row number sample.xlsx
dngrep row numbers
.

@doug24
Copy link
Contributor

doug24 commented Jun 14, 2022

The issue/difference here is that there are cells with carriage return (newline) characters in the cells, and that is throwing off the row numbers.

I wrote something very similar to this about PDF files recently: Excel files are a real pain/problem for doing text searches. First thing to understand is dnGrep searches plain text - so when starting from documents like PDF, Word, Excel and PowerPoint, dnGrep first extracts plain text from the document as best it can. PDF is by far the most difficult. Excel is second. What does a plain text version of Excel look like? One option would be CSV, but instead dnGrep uses tab separated values for the columns and a newline between rows.

Using ordinary text like tabs and newlines makes the extracted text somewhat readable and searchable across cells and rows, but not down columns. Searching Excel this way tells you the search pattern exists in a row on sheet in a file, but has no indication of what column it is in. And maybe more importantly, it can be shown in dnGrep's result tree just like any other document.

But when a cell contains newlines, extra "rows" are created because there is no difference between a newline inside a cell and the newline used to designate a new row.

The easiest solution would be to replace any newlines found within a cell with a space character. This keeps all the text in the cell together, is still readable with some loss of formatting, and is and is compatible with Excel documents that don't contain newlines. I'm definitely leaning this way but appreciate anyone's comments.

A much more complicated change would be to extract and search each cell as a separate item, preserving embedded newlines. This loses the ability to search across cells, and I really don't know how to show this in the results tree (a row with a bunch of columns).

@obolman
Copy link
Author

obolman commented Jun 14, 2022

Thank you for a very detailed response! I understand the challenge now. Feel free to close the issue.

@doug24
Copy link
Contributor

doug24 commented Jun 26, 2022

fix added to v3.0.84.0

@doug24 doug24 closed this as completed Jul 3, 2022
@obolman
Copy link
Author

obolman commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants