Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Body comparison fails to find duplicates. #202

Open
iago-lito opened this issue Jan 2, 2024 · 6 comments
Open

The Body comparison fails to find duplicates. #202

iago-lito opened this issue Jan 2, 2024 · 6 comments

Comments

@iago-lito
Copy link

I was suprised that No duplicates were found pretty much in any situation involving the Body comparison criterium. So I closed Thunderbird and went under its Mail folder, grabbed some raw archive mbox file and tried the following:

$ cat archive > dupes && cat archives >> dupes

So, this artefactual dupes mail file is twice the size of archive and contains only duplicates, right?

Opening Thunderbird again, right-clicking on the new dupes mail folder and searching duplicates yielded No duplicates were found. again. I therefore suspect there is a bug in the Body comparison.

@eyalroz
Copy link
Owner

eyalroz commented Jan 2, 2024

So, this artefactual dupes mail file is twice the size of archive and contains only duplicates, right?

You're assuming TB properly recognizes the duplicated messages + meta-data as two distinct messages. That may not be the case. Also, TB may be failing to retrieve the message bodies properly when you manipulate mbox files like that.

Still, if you can send me a compressed mbox file you've generated this way (via email or even here with an attachment), with 2x2 messages, which is supposed to have 2 dupe sets of size 2, but is not found to have them - I could try to reproduce and work on a fix.

Please note that my availability under late this month is rather low.

Did you remove all other criteria?

@iago-lito
Copy link
Author

Did you remove all other criteria?

Not when I wrote the OP, but I have tested now with only Body selected and the same happens indeed.

You're assuming TB properly recognizes the duplicated messages + meta-data as two distinct messages. That may not be the case. Also, TB may be failing to retrieve the message bodies properly when you manipulate mbox files like that.

FWIU, mbox files are just text files containing all messages in a ^From-separated sequence, so I think it does make sense to concatenate two files like this. I was also convinced when I saw that TB correctly interpreted the result.

if you can send me a compressed mbox file you've generated this way

There you go. This is not compressed but very small. I have crafted a toy example from only two dummy messages. The second file is just the concatenation of twice the first file so it contains no extra information. This is a rather minimal example that I have been able to reproduce the bug with:

I would be happy that these two urls not linger on online for too long. Can you please tell me when you have the files on your side so I can remove them?

Please note that my availability under late this month is rather low.

No worries, thank you for removedupes <3

@eyalroz
Copy link
Owner

eyalroz commented Jan 4, 2024

I'll try to find time to look at this next week; if I haven't please poke me again. With work, plus anti-war activities, plus other repositories of mine (cuda-api-wrappers) - I'm kind of swamped.

@iago-lito
Copy link
Author

Take your time :) Do you have the the files on your side so I can take them offline?

@iago-lito
Copy link
Author

Friendly ping @eyalroz, but maybe you're not out of the swamp yet..

@lkasdj9
Copy link

lkasdj9 commented Jun 11, 2024

Joining iago-lito. same issue for a while now (115). another friendly ping @eyalroz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants