Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

files are being skipped from search when the filename has a spanish accent #503

Closed
leitor79 opened this issue May 10, 2021 · 4 comments
Closed

Comments

@leitor79
Copy link

Hi,

When I look for some content in some folder, "paths to match"= "", "." or ".pdf", the results list miss files with a positive match if the filename has a character with a spanish accent (IE: "ó").

I've tested this with the same archive, removing the character from one of them, and dnGrep finds only one of them.

The file is a PDF, just in case.

Regards,

@doug24
Copy link
Contributor

doug24 commented May 10, 2021

I assume you are searching by asterisk pattern, because these aren't good regular expressions. But you aren't including any wildcards in the the pattern either. * matches any number of characters, and ? matches a single character. So ".pdf" would be used to find any file with a pdf extension, and ".*" would be used to find any file name with any extension.

Can you try searching by one of these?
*.pdf
.

@doug24 doug24 closed this as completed May 15, 2021
@leitor79
Copy link
Author

Hi,
Thank you for your answer.
I'm pretty sure I've tried including wildcards; I've tried again, just in case. I've copied the same file with the same content in the same folder and renamed one of them to not have an accent. The accented file was not a match.

https://i.imgur.com/jz1k140.png

Regards,

@doug24 doug24 reopened this May 16, 2021
@doug24
Copy link
Contributor

doug24 commented May 16, 2021

I did reproduce this bug. It is specific to pdf files, which I had not tested before. As I commented on #504, dnGrep searches plain text, so it uses plug-ins to convert binary formatted files like Word, Excel and PDF to text before searching.

The bug isn't actually in dnGrep, but in the pdftotext.exe application that dnGrep calls to extract text from pdf files. When calling pdftotext.exe dnGrep makes the call like this:

pdftotext.exe -layout -enc UTF-8 -bom "C:\testFiles\test\issue503\Eliseo Verón.pdf" "C:\Users\user\AppData\Local\Temp\dnGrep-lkwsbbww72mP\dnGREP-PDF\Eliseo Verón.txt"

But instead of creating the file "Eliseo Verón.txt", pdftotext.exe creates a file named "Eliseo Verón.txt", and dnGrep can't find the correct file to search.

This bug appears to be in pdftotext version 4, but not in version 3. The dnGrep installer installs version 4 with the application, but you can overwrite it with the older version.

I attached pdftotext.exe version 3 to this note (see below). To use it, open this directory in Windows Explorer:
C:\Program Files\dnGREP\Plugins\PdfSearch
Rename the existing pdftotext.exe to pdftotext4.exe, and copy version 3 from the zip file into that directory.
Next start dnGrep, and open the Options dialog. Scroll down the PDF section and remove the command line options (the default options only work with pdftotext version 4:

image

pdftotext.zip

@doug24
Copy link
Contributor

doug24 commented Jun 27, 2021

Fixed in Release 2.9.345

@doug24 doug24 closed this as completed Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants