Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenericDocumentProperties multipleStringValues : append, or merge #116

Closed
benlabbe opened this issue Apr 19, 2021 · 3 comments
Closed

GenericDocumentProperties multipleStringValues : append, or merge #116

benlabbe opened this issue Apr 19, 2021 · 3 comments
Assignees

Comments

@benlabbe
Copy link
Contributor

benlabbe commented Apr 19, 2021

Dear developers,

As a continuation of my work to restore the metadata index and the use cases that go with it in AMOSE, I have a question about GenericDocumentProperties in LIMA.

I'm looking for a definition / specification for the property of type multString // multipleStringValues, specifically used with the meta-data authPrprty.
Without documentation, I directly searched for the C++ code corresponding to the use case. I identified that the call to GenericDocumentProperties::addStringValue() is made from StructuredDocumentXMLParser + ContentStructuredDocument, but also from addProperty() in BoWXMLHandler.
When adding properties and value, there is currently no management of duplicate values ​​of multipleStringValues ​​properties. The new values ​​are append after the previous ones without checking.
I think merging / deduplication is useful, to reproduce the behaviour of single-valued properties: if the property exists, the value is overwritten.

To give a bit of context, my use case is the following:

  • I am preparing documents in English from a load of thousand PDFs , with minimum 50 pages each.
  • Some PDF document hold author information as PDF/metadata . Tika is fetching this info for me.
  • for one input pdf document in my process, I have formatted the text content into an XML document with precisely one <engTEXT /> per pdf page.
  • and you guessed it, analyzeXmk and readMultFile are showing multiple identical values ​​of authPrpty cumulated on a document single DOCSET.DOC when I have several blocks of text content <engTEXT /> inside the DOCSET.DOC.

Proposal :

  • prevent deduplication in GenericDocumentProperties::addStringValue()
  • hence no duplicate string value will be inserted in the document by StructuredDocumentXMLParser
  • and also, previous .mult files holding duplicate values will be reinterpreted correctly on the fly by BoWXMLHandler::addProperty() . This avoids the complete analysis of existing corpus of .mult files . Only the indexing in AMOSE needs to be re-executed.

NB : for weightedProperties , GenericDocumentProperties::addWeightedPropValue() should be modified for deduplication also.

What do you think ?
@kleag @romaricb

@benlabbe
Copy link
Contributor Author

up ?? any comments ?

@kleag
Copy link
Contributor

kleag commented Jun 18, 2021

Yes, it seems reasonnable to me. I cannot see a use case where having several occurrences of the same property would be useful.

benlabbe added a commit that referenced this issue Jun 25, 2021
- prevent deduplication in GenericDocumentProperties::addStringValue()
- idem for addWeightedPropValue()
@benlabbe
Copy link
Contributor Author

Fixed in commit 329f1d6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants