Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow alternative handlers for parse results #171

Open
frizbog opened this issue Oct 28, 2016 · 7 comments
Open

Allow alternative handlers for parse results #171

frizbog opened this issue Oct 28, 2016 · 7 comments
Assignees
Projects

Comments

@frizbog
Copy link
Owner

frizbog commented Oct 28, 2016

Currently, when a gedcom file is parsed, the entire file is loaded into memory and the object graph is returned to the caller. This works for small files and fast processors, but for large files or processors/platforms with strong limitations for memory space (like phones), keeping all the objects in heap memory is not scalable.

If callers could register a handler and receive each root-level record (Individual, Family, etc) as they are parsed, gedcom4j could call that handler and pass the parsed object. The caller could then decide to serialize the object, write it to a database, transform its contents to some other format, fiddle with the data, whatever it wants, without it being required to be stored in memory along with the entire contents of the file. Default behavior could/should be to continue doing what it does now, but giving the caller alternatives and choices is a good thing.

@frizbog
Copy link
Owner Author

frizbog commented Nov 20, 2016

More thinking on this: The first major difficulty that comes to mind here is cross-references between objects. Families have references to Individuals that have references to Sources that have references to Multimedia, etc. If the parser returns a Family by itself, there needs to be some way to deal with the Individuals in the family, etc., and have the caller be able to resolve the cross references. The XRef's fields and the *Reference classes recently introduced are probably the key here, but it will be one complicating factor.

@haralduna
Copy link

haralduna commented Dec 7, 2016

The way I currently use gedcom4j is to loop through all individuals and then build a new hash table with all the individuals, including all related information like photo links, person links and copies of complete notes and sources information. This gives one compact table that is pretty efficient, but I think it does include copies of some elements like Sources.

If I could have tables of all root elements directly available I could perhaps use those directly and serialize directly to disk, and without any loss of information.

@frizbog
Copy link
Owner Author

frizbog commented Dec 7, 2016

Yes, that's the thought. If you wanted to take each root level object and put it in a SQLite database, for example, keyed by the xref, you'd be able to do that.

@frizbog frizbog self-assigned this Dec 11, 2016
@frizbog
Copy link
Owner Author

frizbog commented Dec 12, 2016

Unfortunately, while working through this, I have discovered that there will need to be some moderately-significant changes to the object model. Specifically - rather than having actual references from an object to root level object (such as a Family having a reference to the Individuals in it), the model will now have to keep a copy of the other item's xref instead, and the calling code will need to resolve xrefs to root-level items.

This is primarily due to forward references (parsing a family full of people before you've parsed the people). When it was all done in memory, and that was the only way to do things, it was possible to just do the lookups on the fly. Now that I'm making it possible for an item to reference an object that is not instantiated, that won't work anymore, so properties like getFamily() will need to become getFamilyXref() or something similar.

@frizbog
Copy link
Owner Author

frizbog commented Dec 12, 2016

Further impact: some methods currently available in the model will need to move (e.g., getAncestors() will need to move out of Individual) because a given object in the model will no longer have direct references to the other objects...something else that can resolve xrefs to objects will need to perform this work.

It's very arguable methods like this did not belong in the model in the first place, so I don't feel bad about it architecturally at all, but it is going to impact the API quite a bit and is going to be a whole lot less convenient.

@frizbog
Copy link
Owner Author

frizbog commented Dec 12, 2016

This is going to be a huge deal, the more I dig. The validation framework does internal consistency checks and expects that all the data is available in an org.gedcom4j.model.Gedcom object, which may no longer be the case after this is set up. It also impacts the writer, which also expects a Gedcom object to be in memory.

I think I'm likely to need to make an Interface-based replacement for the Gedcom object (the one that holds maps of all the objects) and start passing references to that implementation that all over the place where a Gedcom is currently expected. The default implementation would be the current in-memory multi-hashmap based object. Alternate implementations will need to be able to do things like look up objects by xref from disk, database, etc...but in a way that's API compatible with the in-memory Gedcom object.

@frizbog
Copy link
Owner Author

frizbog commented Dec 12, 2016

Sorry about the stream of consciousness comments...

The more I think about this the more I suspect I have been trying to solve the wrong problem entirely. It's not the parser, it's the model. The parser just stores to the model, and the validator and writer just pull from the model, but neither the parser, validator, or writer care if anything is in memory...it only cares that the objects are somehow accessible for read/write.

If I extract an API from the model, and use the current code as the "in-memory" implementation of that API, and make the parser, validator, and writer all access the model through that API, then it won't matter where that data is: heap, disk, database, cloud, whatever. Alternate implementations of that model API would then be possible. Dependency injection could let you pick which model API implementation you want to use in your code, with the current in-memory behavior as the default implementation.

frizbog pushed a commit that referenced this issue Jan 8, 2017
frizbog pushed a commit that referenced this issue Jan 9, 2017
This is to make room for other implementations of the IGedcom interface
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
v5.0.0
In Progress
Development

No branches or pull requests

2 participants