Allow alternative handlers for parse results #171

frizbog · 2016-10-28T23:33:28Z

Currently, when a gedcom file is parsed, the entire file is loaded into memory and the object graph is returned to the caller. This works for small files and fast processors, but for large files or processors/platforms with strong limitations for memory space (like phones), keeping all the objects in heap memory is not scalable.

If callers could register a handler and receive each root-level record (Individual, Family, etc) as they are parsed, gedcom4j could call that handler and pass the parsed object. The caller could then decide to serialize the object, write it to a database, transform its contents to some other format, fiddle with the data, whatever it wants, without it being required to be stored in memory along with the entire contents of the file. Default behavior could/should be to continue doing what it does now, but giving the caller alternatives and choices is a good thing.

frizbog · 2016-11-20T15:15:15Z

More thinking on this: The first major difficulty that comes to mind here is cross-references between objects. Families have references to Individuals that have references to Sources that have references to Multimedia, etc. If the parser returns a Family by itself, there needs to be some way to deal with the Individuals in the family, etc., and have the caller be able to resolve the cross references. The XRef's fields and the *Reference classes recently introduced are probably the key here, but it will be one complicating factor.

haralduna · 2016-12-07T11:26:59Z

The way I currently use gedcom4j is to loop through all individuals and then build a new hash table with all the individuals, including all related information like photo links, person links and copies of complete notes and sources information. This gives one compact table that is pretty efficient, but I think it does include copies of some elements like Sources.

If I could have tables of all root elements directly available I could perhaps use those directly and serialize directly to disk, and without any loss of information.

frizbog · 2016-12-07T23:58:56Z

Yes, that's the thought. If you wanted to take each root level object and put it in a SQLite database, for example, keyed by the xref, you'd be able to do that.

frizbog · 2016-12-12T00:56:58Z

Unfortunately, while working through this, I have discovered that there will need to be some moderately-significant changes to the object model. Specifically - rather than having actual references from an object to root level object (such as a Family having a reference to the Individuals in it), the model will now have to keep a copy of the other item's xref instead, and the calling code will need to resolve xrefs to root-level items.

This is primarily due to forward references (parsing a family full of people before you've parsed the people). When it was all done in memory, and that was the only way to do things, it was possible to just do the lookups on the fly. Now that I'm making it possible for an item to reference an object that is not instantiated, that won't work anymore, so properties like getFamily() will need to become getFamilyXref() or something similar.

frizbog · 2016-12-12T01:20:19Z

Further impact: some methods currently available in the model will need to move (e.g., getAncestors() will need to move out of Individual) because a given object in the model will no longer have direct references to the other objects...something else that can resolve xrefs to objects will need to perform this work.

It's very arguable methods like this did not belong in the model in the first place, so I don't feel bad about it architecturally at all, but it is going to impact the API quite a bit and is going to be a whole lot less convenient.

frizbog · 2016-12-12T01:47:01Z

This is going to be a huge deal, the more I dig. The validation framework does internal consistency checks and expects that all the data is available in an org.gedcom4j.model.Gedcom object, which may no longer be the case after this is set up. It also impacts the writer, which also expects a Gedcom object to be in memory.

I think I'm likely to need to make an Interface-based replacement for the Gedcom object (the one that holds maps of all the objects) and start passing references to that implementation that all over the place where a Gedcom is currently expected. The default implementation would be the current in-memory multi-hashmap based object. Alternate implementations will need to be able to do things like look up objects by xref from disk, database, etc...but in a way that's API compatible with the in-memory Gedcom object.

frizbog · 2016-12-12T02:05:27Z

Sorry about the stream of consciousness comments...

The more I think about this the more I suspect I have been trying to solve the wrong problem entirely. It's not the parser, it's the model. The parser just stores to the model, and the validator and writer just pull from the model, but neither the parser, validator, or writer care if anything is in memory...it only cares that the objects are somehow accessible for read/write.

If I extract an API from the model, and use the current code as the "in-memory" implementation of that API, and make the parser, validator, and writer all access the model through that API, then it won't matter where that data is: heap, disk, database, cloud, whatever. Alternate implementations of that model API would then be possible. Dependency injection could let you pick which model API implementation you want to use in your code, with the current in-memory behavior as the default implementation.

This is to make room for other implementations of the IGedcom interface

frizbog added the enhancement label Oct 28, 2016

frizbog self-assigned this Dec 11, 2016

frizbog pushed a commit that referenced this issue Jan 8, 2017

Issue #171 - Simplest of starts. Extracted IGedcom from Gedcom class.

209795e

frizbog pushed a commit that referenced this issue Jan 8, 2017

Issue #171 - Updates to javadoc

b4f3fea

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - All but one test passes (Issue164Test)

856ecae

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - More work - all unit tests pass (at least locally)

940e7f0

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - Updating javadoc and removing references to Gedcom class

d7b0cf8

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - Updating javadoc and removing references to Gedcom class

0817e76

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - Updating javadoc and removing references to Gedcom class

6d222f7

frizbog pushed a commit that referenced this issue Jan 9, 2017

Issue #171 - Renamed old Gedcom class to InMemoryGedcom

b2eb341

This is to make room for other implementations of the IGedcom interface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow alternative handlers for parse results #171

Allow alternative handlers for parse results #171

frizbog commented Oct 28, 2016 •

edited

frizbog commented Nov 20, 2016

haralduna commented Dec 7, 2016 •

edited

frizbog commented Dec 7, 2016

frizbog commented Dec 12, 2016

frizbog commented Dec 12, 2016 •

edited

frizbog commented Dec 12, 2016

frizbog commented Dec 12, 2016

Allow alternative handlers for parse results #171

Allow alternative handlers for parse results #171

Comments

frizbog commented Oct 28, 2016 • edited

frizbog commented Nov 20, 2016

haralduna commented Dec 7, 2016 • edited

frizbog commented Dec 7, 2016

frizbog commented Dec 12, 2016

frizbog commented Dec 12, 2016 • edited

frizbog commented Dec 12, 2016

frizbog commented Dec 12, 2016

frizbog commented Oct 28, 2016 •

edited

haralduna commented Dec 7, 2016 •

edited

frizbog commented Dec 12, 2016 •

edited