New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row number in Excel files #684
Comments
Can you show an example? In the files I have tried the line number does match the row number in the Excel sheet. |
See an example file attached as well as an image of search results. |
The issue/difference here is that there are cells with carriage return (newline) characters in the cells, and that is throwing off the row numbers. I wrote something very similar to this about PDF files recently: Excel files are a real pain/problem for doing text searches. First thing to understand is dnGrep searches plain text - so when starting from documents like PDF, Word, Excel and PowerPoint, dnGrep first extracts plain text from the document as best it can. PDF is by far the most difficult. Excel is second. What does a plain text version of Excel look like? One option would be CSV, but instead dnGrep uses tab separated values for the columns and a newline between rows. Using ordinary text like tabs and newlines makes the extracted text somewhat readable and searchable across cells and rows, but not down columns. Searching Excel this way tells you the search pattern exists in a row on sheet in a file, but has no indication of what column it is in. And maybe more importantly, it can be shown in dnGrep's result tree just like any other document. But when a cell contains newlines, extra "rows" are created because there is no difference between a newline inside a cell and the newline used to designate a new row. The easiest solution would be to replace any newlines found within a cell with a space character. This keeps all the text in the cell together, is still readable with some loss of formatting, and is and is compatible with Excel documents that don't contain newlines. I'm definitely leaning this way but appreciate anyone's comments. A much more complicated change would be to extract and search each cell as a separate item, preserving embedded newlines. This loses the ability to search across cells, and I really don't know how to show this in the results tree (a row with a bunch of columns). |
Thank you for a very detailed response! I understand the challenge now. Feel free to close the issue. |
fix added to v3.0.84.0 |
Hi Doug,
I can give you a sample file but I don’t want to attach it in Github, so it’s publicly available. Is there a way to limit viewing in Github to you only? I can’t email it to you either.
Thanks.
From: Doug P ***@***.***>
Sent: Friday, June 10, 2022 10:26 AM
To: dnGrep/dnGrep ***@***.***>
Cc: Oleg Bolman ***@***.***>; Author ***@***.***>
Subject: Re: [dnGrep/dnGrep] Row number in Excel files (Issue #684)
Can you show an example? In the files I have tried the line number does match the row number in the Excel sheet.
—
Reply to this email directly, view it on GitHub<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FdnGrep%2FdnGrep%2Fissues%2F684%23issuecomment-1152419800&data=05%7C01%7Coleg.bolman%40ey.com%7C41cfe56362fb4ee0c28908da4aed2b13%7C5b973f9977df4bebb27daa0c70b8482c%7C0%7C0%7C637904679739848169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BXfNiwI9hcS6Cz%2F1jC9h55waDchRDiqUeKzL0KZ7REY%3D&reserved=0>, or unsubscribe<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAZLJZXYMW66Q4PDD7AOTYTLVONGABANCNFSM5YNKZTFQ&data=05%7C01%7Coleg.bolman%40ey.com%7C41cfe56362fb4ee0c28908da4aed2b13%7C5b973f9977df4bebb27daa0c70b8482c%7C0%7C0%7C637904679739848169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=A23Pk5AGdFbSlKAhnUMRxpnhVek3mt%2BeoC%2F%2B%2F2SbyeI%3D&reserved=0>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
Any tax advice in this e-mail should be considered in the context of the tax services we are providing to you. Preliminary tax advice should not be relied upon and may be insufficient for penalty protection.
…________________________________________________________________________
The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer.
Notice required by law: This e-mail may constitute an advertisement or solicitation under U.S. law, if its primary purpose is to advertise or promote a commercial product or service. You may choose not to receive advertising and promotional messages from Ernst & Young LLP (except for My EY, which tracks e-mail preferences through a separate process) at this e-mail address by opting out of emails through EY’s Email Preference Center<https://www.ey.com/en_us/email-preference-center>. Our principal postal address is One Manhattan West, New York, NY 10001. Thank you. Ernst & Young LLP
|
When searching in Excel files, dngrep shows a line number but it's not the original row number in Excel (looks like it converts Excel to csv on the fly and shows the line number in that csv file.) Would be great to see the row number instead.
The text was updated successfully, but these errors were encountered: