You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a continuation of my work to restore the metadata index and the use cases that go with it in AMOSE, I have a question about GenericDocumentProperties in LIMA.
I'm looking for a definition / specification for the property of type multString // multipleStringValues, specifically used with the meta-data authPrprty.
Without documentation, I directly searched for the C++ code corresponding to the use case. I identified that the call to GenericDocumentProperties::addStringValue() is made from StructuredDocumentXMLParser + ContentStructuredDocument, but also from addProperty() in BoWXMLHandler.
When adding properties and value, there is currently no management of duplicate values of multipleStringValues properties. The new values are append after the previous ones without checking.
I think merging / deduplication is useful, to reproduce the behaviour of single-valued properties: if the property exists, the value is overwritten.
To give a bit of context, my use case is the following:
I am preparing documents in English from a load of thousand PDFs , with minimum 50 pages each.
Some PDF document hold author information as PDF/metadata . Tika is fetching this info for me.
for one input pdf document in my process, I have formatted the text content into an XML document with precisely one <engTEXT /> per pdf page.
and you guessed it, analyzeXmk and readMultFile are showing multiple identical values of authPrpty cumulated on a document single DOCSET.DOC when I have several blocks of text content <engTEXT /> inside the DOCSET.DOC.
Proposal :
prevent deduplication in GenericDocumentProperties::addStringValue()
hence no duplicate string value will be inserted in the document by StructuredDocumentXMLParser
and also, previous .mult files holding duplicate values will be reinterpreted correctly on the fly by BoWXMLHandler::addProperty() . This avoids the complete analysis of existing corpus of .mult files . Only the indexing in AMOSE needs to be re-executed.
NB : for weightedProperties , GenericDocumentProperties::addWeightedPropValue() should be modified for deduplication also.
Dear developers,
As a continuation of my work to restore the metadata index and the use cases that go with it in AMOSE, I have a question about
GenericDocumentProperties
in LIMA.I'm looking for a definition / specification for the property of type multString // multipleStringValues, specifically used with the meta-data authPrprty.
Without documentation, I directly searched for the C++ code corresponding to the use case. I identified that the call to
GenericDocumentProperties::addStringValue()
is made fromStructuredDocumentXMLParser
+ContentStructuredDocument
, but also fromaddProperty()
inBoWXMLHandler
.When adding properties and value, there is currently no management of duplicate values of
multipleStringValues
properties. The new values are append after the previous ones without checking.I think merging / deduplication is useful, to reproduce the behaviour of single-valued properties: if the property exists, the value is overwritten.
To give a bit of context, my use case is the following:
Proposal :
GenericDocumentProperties::addStringValue()
StructuredDocumentXMLParser
BoWXMLHandler::addProperty()
. This avoids the complete analysis of existing corpus of .mult files . Only the indexing in AMOSE needs to be re-executed.NB : for weightedProperties ,
GenericDocumentProperties::addWeightedPropValue()
should be modified for deduplication also.What do you think ?
@kleag @romaricb
The text was updated successfully, but these errors were encountered: