Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

ANY23-304 Add extractor for OpenIE #34

Merged
merged 14 commits into from
Aug 23, 2017
Merged

ANY23-304 Add extractor for OpenIE #34

merged 14 commits into from
Aug 23, 2017

Conversation

lewismc
Copy link
Member

@lewismc lewismc commented Feb 24, 2017

Hi Folks,
This issue is a rework of #33 which takes on board @ansell 's comments to add the new extractor as a separate module as oppose to inside of core.
There are a number of classes which are cleaned up for JDK1.8 compliance.
In addition, this new functionality augments the default configuration by introducing a threshold for OpenIE extractions of 0.5. Anything below this value is not converted into triples.
I run a test extraction on a reasonably testing Webpage from the PO.DAAC but right now i am not asserting anything.
As far as I can see this is working pretty well but some community review would go a long way.

@lewismc
Copy link
Member Author

lewismc commented Feb 24, 2017

@ansell is it necessary to put this new module into plugins and have the new extractor implement ExtractorPlugin?

@ansell
Copy link
Member

ansell commented Feb 24, 2017

I haven't looked at it recently. The META-INF/services should be enough on their own without the explicit plugin support but I can't recall whether there are any other differences that could affect usage.

@lewismc
Copy link
Member Author

lewismc commented Feb 24, 2017

OK so implementing ExtractorPlugin is not necessary... none of the other plugins use this logic.
I'm trying to get it working via cli appassembler script however no joy yet.

@ansell
Copy link
Member

ansell commented Feb 24, 2017

The cli module may need the new module added as a dependency to pull it onto the classpath. Strangely enough, it appears as though none of the other plugins are cli dependencies.

@lewismc
Copy link
Member Author

lewismc commented Feb 24, 2017

Yep your right. Bang on the money. I'll update the PR.

@lewismc
Copy link
Member Author

lewismc commented Feb 28, 2017

Hi @ansell this is now fixed... if you could pull the code and let me know how you get on it would be appreciated.
After a good bit of debugging I discovered that some erroneous <resources> descriptions in plugin pom.xml files meant that the META-INF/service directories were being filtered out from the generated .jar artifacts... meaning that the ServiceLoader did not discover them.
Anyway... if you could pull the code and let me know how you get on it would be appreciated. This is working well for me.
One final thing to note, you will see that for the appassembler plugin definition in cli/pom.xml we increase the JVM arguments to 6000m... this is because OpenIE is pretty memory intensive.

@lewismc
Copy link
Member Author

lewismc commented Feb 28, 2017

Unfortunately... due to the bugs regarding the META-INF/service directories being filtered out, it means that the plugins for Any23 2.0 are not as useful as they should be as they cannot be dynamically discovered if present on the classpath. We should potentially push Any23 2.1 once this patch is merged into master.

@lewismc
Copy link
Member Author

lewismc commented Mar 1, 2017

PING... anyone that is able to provide a review? Would be very much appreciated.

@ansell
Copy link
Member

ansell commented Mar 1, 2017

Tests failed for me with OOM:

[INFO] Compiling 1 source file to /home/mint/gitrepos/any23/openie/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ apache-any23-openie ---

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.any23.openie.OpenIEExtractorTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/org/slf4j/slf4j-log4j12/1.7.21/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
Loading feature templates.
Loading models.
Loading lexica.
Loading configuration.
Loading feature templates.
Loading models.
Loading feature templates.
Loading models.
Loading lexica.
Loading feature templates.
Loading models.
Loading feature templates.
Loading models.
Loading lexica.
Loading feature templates.
Loading models.
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.977 sec <<< FAILURE! - in org.apache.any23.openie.OpenIEExtractorTest
testExtractFromHTMLDocument(org.apache.any23.openie.OpenIEExtractorTest)  Time elapsed: 20.282 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
	at org.apache.any23.openie.OpenIEExtractorTest.extract(OpenIEExtractorTest.java:75)
	at org.apache.any23.openie.OpenIEExtractorTest.testExtractFromHTMLDocument(OpenIEExtractorTest.java:65)

Copy link
Member

@ansell ansell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments possibly related to memory usage and performance.

// instance.extr().arg2s().text() - object
for(Instance instance : listExtractions) {
final Configuration immutableConf = DefaultConfiguration.singleton();
if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Double.parseDouble here should be done once and stored in a local variable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this @ansell

for(Instance instance : listExtractions) {
final Configuration immutableConf = DefaultConfiguration.singleton();
if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
List<Argument> listArg2s = JavaConversions.seqAsJavaList(instance.extr().arg2s());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with Scala, but does this bring an Iterator-like element into memory completely instead of processing it in a streaming fashion (just trying to understand why the OOM are occurring).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ansell yes having debugged the code pretty extensively by now, it read the entire sequence into memory. You can see the Scala Docs here, for a bit of additional context, they state

Implicitly converts a Scala Seq to a Java List. The returned Java List is backed by the provided Scala Seq and any side-effects of using it via the Java interface will be visible via the Scala interface and vice versa. If the Scala Seq was previously obtained from an implicit or explicit call of asSeq(java.util.List) then the original Java List will be returned.

This being said, I haven;t noticed the size of instance.extr().arg2s() being overly large so far.

My feeling is that OOM's are stemming from loading the model(s), however I may be wrong. Over on the openie README it states "...openie requires substantial memory.".


public void extract(IRI uri, String filePath)
throws IOException, ExtractionException, TripleHandlerException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to a file instead of ByteArrayOutputStream may alleviate some of the memory pressures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ansell can you please elaborate? Its my understanding that we require the ByteArrayOutputStream as it acts as a parameter for each TripleHandler implementation e.g. RDFXMLWriter(baos).
I would be happy to stream the extractions as an attempt to mitigate against OOM, however this would be after the extraction right? Not before, therefore I'm not sure how much memory we would be saving.

If you could clarify it would be appreciated. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteArrayOutputStream will hold all of the results in memory. It may be useful to create a temporary file and reference it as a FileOutputStream, which will have a fixed memory buffer before writing to disk. Just trying to work through the possible avenues where memory requirements can be managed. It may be useful to work through in a debugger to identity the large memory requirement and where that can be lowered, as hopefully the CLI can still be used on small machines after this pull request. Does the model load on every call to the CLI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I get you and yes this makes pefect sense.

Does the model load on every call to the CLI?

... yes... which I realize is far from ideal. The issue with this as well is that it will load on every document AFAIK so this is a major limitation of the approach as it currently sits.

@lewismc
Copy link
Member Author

lewismc commented Jul 27, 2017

Hi @ansell I finally got around to addressing your comments. Just to refresh your memory, use of FileOutputStream (as oppose to ByteArrayOutputStream) within the OpenExtractorTest.java logic is more performant, by around 1/4 second or so.
Do you have any further comments on this patch?

@lewismc
Copy link
Member Author

lewismc commented Aug 2, 2017

Will commit within next day or so if there are no objections.

@ansell
Copy link
Member

ansell commented Aug 2, 2017

My main objections before were about the larger memory requirements for default use and not being able to run the tests without OOM in my mid-range development machine.

@ansell
Copy link
Member

ansell commented Aug 2, 2017

Is it an optional plugin in the current setup to avoid having users need to load it if they have minimal memory available. I haven't had time to look through it, but I see there is a new openie module.

@lewismc
Copy link
Member Author

lewismc commented Aug 5, 2017

Hi @ansell yes this is a separate module however currently it always builds with CLI module. I'm going to push an update which disables the module tests by default.

@lewismc
Copy link
Member Author

lewismc commented Aug 23, 2017

Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely

  • removal of open module from CLI (meaning that, by default the open extractor is not executed by default during normal unit test execution)
  • addition of some class loading logic which improves the flexibility of extractor detection based upon the presence of the open extractor.

By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests.

Thanks for any final review.
Lewis

@asfgit asfgit merged commit c40b788 into apache:master Aug 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants