-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing large files from different origins is impossible #319
Comments
These aren't outrageously large files. I think 20k persons have been compared before. Can you provide the command you used and how large (in file size) is each file? |
Here's the command, on the latest Linux Mint: ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html File sizes: -rwxr-xr-x 1 enno enno 2734017 nov 27 15:36 fs.ged And although most of the persons are the same, their data is not, because Untitled_6 has IDs generated by Gramps, and fs has IDs generated by RootsMagic. Most of the persons in fs.ged also have place names that were normalized by FamilySearch. In other words, although most of the persons are the same individuals, most have small differences, primarily because of the normalized place names, or because they have been edited by other FS members. When I compare two GEDCOMs generated by Gramps, the program runs fine, although in that case the HTML is too big to load, so I can't use it either. I'm a bit confused, because you wrote that you tested it with Ancestry, which also changes all IDs. |
Interesting. I wonder if it's running out of memory because of the sheer size of the HTML generated and not the diff itself. try running with the ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress You can also try adding |
... and also |
I tried all, but none of these really helped. And on this system, I did not get any out of memory errors. The program just slowed down to a crawl, so I had to abort it. Here are my results: enno@desktop-mate: Please note that, like I wrote earlier, almost none of the persons are equal. More than half are the same individuals, but almost every person downloaded from FamilySearch has standardized place names and none have the same attributes, simple because FamilySearch doesn't store that many. This is quite different from when you upload your own GEDCOM to Ancestry, and make some modifications on-line. If you do that, downloaded persons have different IDs, but most of their attributes are the same. In other words, I think that to serve my purpose, the program needs to be way more fuzzy than it actually is. |
The default behavior is to to compare every individual with every other individual (12k x 7k = 84m comparisons). Comparisons are made by taking into account the name, birth and death dates and producing a similarity number (0.0 - 1.0). Individuals that have a similarity higher than the threshold (default is 0.733, but you can override with options) are considered "equal". If there there are many pairs of individuals that match, the highest similarity is chosen. The tool will try to use common IDs to reduce the product of the comparisons. This is useful when both sides come from the same source or at least IDs are maintained. However, if you say that these share no IDs then it will always fall back to comparing all individuals. You can verify this is the case by seeing that the total comparisons (84m) does not change. If ID matches are found this number will drop throughout the process. I don't think this is a problem with how the tool works but rather just some memory leaking. You could try rebuilding from source and adding explicit As you say, you won't get any out-of-memory errors because it will just start consuming swap space which is ultra slow, but if you're willing to let it run over night it will eventually finish... |
I am now running a session comparing my current tree from Gramps with the one that I have on Ancestry, which has the same UIDs, and now I see more progress. It starts slow, with expected run time shown as multiple days, but the actual run time is 8 minutes. |
When I compare a GEDCOM created by Gramps with anoher one created by RootsMagic, the program runs out of resources and crashes, consuming more than 6 GB of my 8 GB RAM.
The GEDCOM from Gramps has close to 12,000 persons, the one from RootsMagic about 7,000. About 6,000 of these are the same or similar.
The text was updated successfully, but these errors were encountered: