ANY23-304 Add extractor for OpenIE #34

lewismc · 2017-02-24T01:32:40Z

Hi Folks,
This issue is a rework of #33 which takes on board @ansell 's comments to add the new extractor as a separate module as oppose to inside of core.
There are a number of classes which are cleaned up for JDK1.8 compliance.
In addition, this new functionality augments the default configuration by introducing a threshold for OpenIE extractions of 0.5. Anything below this value is not converted into triples.
I run a test extraction on a reasonably testing Webpage from the PO.DAAC but right now i am not asserting anything.
As far as I can see this is working pretty well but some community review would go a long way.

lewismc · 2017-02-24T04:15:59Z

@ansell is it necessary to put this new module into plugins and have the new extractor implement ExtractorPlugin?

ansell · 2017-02-24T04:24:42Z

I haven't looked at it recently. The META-INF/services should be enough on their own without the explicit plugin support but I can't recall whether there are any other differences that could affect usage.

lewismc · 2017-02-24T04:48:43Z

OK so implementing ExtractorPlugin is not necessary... none of the other plugins use this logic.
I'm trying to get it working via cli appassembler script however no joy yet.

ansell · 2017-02-24T05:12:36Z

The cli module may need the new module added as a dependency to pull it onto the classpath. Strangely enough, it appears as though none of the other plugins are cli dependencies.

lewismc · 2017-02-24T05:46:33Z

Yep your right. Bang on the money. I'll update the PR.

…IE module

lewismc · 2017-02-28T05:00:41Z

Hi @ansell this is now fixed... if you could pull the code and let me know how you get on it would be appreciated.
After a good bit of debugging I discovered that some erroneous <resources> descriptions in plugin pom.xml files meant that the META-INF/service directories were being filtered out from the generated .jar artifacts... meaning that the ServiceLoader did not discover them.
Anyway... if you could pull the code and let me know how you get on it would be appreciated. This is working well for me.
One final thing to note, you will see that for the appassembler plugin definition in cli/pom.xml we increase the JVM arguments to 6000m... this is because OpenIE is pretty memory intensive.

lewismc · 2017-02-28T05:02:33Z

Unfortunately... due to the bugs regarding the META-INF/service directories being filtered out, it means that the plugins for Any23 2.0 are not as useful as they should be as they cannot be dynamically discovered if present on the classpath. We should potentially push Any23 2.1 once this patch is merged into master.

lewismc · 2017-03-01T19:12:27Z

PING... anyone that is able to provide a review? Would be very much appreciated.

ansell · 2017-03-01T22:24:42Z

Tests failed for me with OOM:

[INFO] Compiling 1 source file to /home/mint/gitrepos/any23/openie/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ apache-any23-openie ---

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.any23.openie.OpenIEExtractorTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mint/.m2/repository/org/slf4j/slf4j-log4j12/1.7.21/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
Loading feature templates.
Loading models.
Loading lexica.
Loading configuration.
Loading feature templates.
Loading models.
Loading feature templates.
Loading models.
Loading lexica.
Loading feature templates.
Loading models.
Loading feature templates.
Loading models.
Loading lexica.
Loading feature templates.
Loading models.
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.977 sec <<< FAILURE! - in org.apache.any23.openie.OpenIEExtractorTest
testExtractFromHTMLDocument(org.apache.any23.openie.OpenIEExtractorTest)  Time elapsed: 20.282 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
	at org.apache.any23.openie.OpenIEExtractorTest.extract(OpenIEExtractorTest.java:75)
	at org.apache.any23.openie.OpenIEExtractorTest.testExtractFromHTMLDocument(OpenIEExtractorTest.java:65)

ansell

A few comments possibly related to memory usage and performance.

ansell · 2017-03-01T22:28:00Z

openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java

+        // instance.extr().arg2s().text() - object
+        for(Instance instance : listExtractions) {
+            final Configuration immutableConf = DefaultConfiguration.singleton();
+            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {


The Double.parseDouble here should be done once and stored in a local variable

I addressed this @ansell

ansell · 2017-03-01T22:29:08Z

openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java

+        for(Instance instance : listExtractions) {
+            final Configuration immutableConf = DefaultConfiguration.singleton();
+            if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) {
+                List<Argument> listArg2s = JavaConversions.seqAsJavaList(instance.extr().arg2s());


I am not very familiar with Scala, but does this bring an Iterator-like element into memory completely instead of processing it in a streaming fashion (just trying to understand why the OOM are occurring).

@ansell yes having debugged the code pretty extensively by now, it read the entire sequence into memory. You can see the Scala Docs here, for a bit of additional context, they state

Implicitly converts a Scala Seq to a Java List. The returned Java List is backed by the provided Scala Seq and any side-effects of using it via the Java interface will be visible via the Scala interface and vice versa. If the Scala Seq was previously obtained from an implicit or explicit call of asSeq(java.util.List) then the original Java List will be returned.

This being said, I haven;t noticed the size of instance.extr().arg2s() being overly large so far.

My feeling is that OOM's are stemming from loading the model(s), however I may be wrong. Over on the openie README it states "...openie requires substantial memory.".

ansell · 2017-03-01T22:30:12Z

openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java

+
+    public void extract(IRI uri, String filePath) 
+      throws IOException, ExtractionException, TripleHandlerException {
+      ByteArrayOutputStream baos = new ByteArrayOutputStream();


Writing to a file instead of ByteArrayOutputStream may alleviate some of the memory pressures.

@ansell can you please elaborate? Its my understanding that we require the ByteArrayOutputStream as it acts as a parameter for each TripleHandler implementation e.g. RDFXMLWriter(baos).
I would be happy to stream the extractions as an attempt to mitigate against OOM, however this would be after the extraction right? Not before, therefore I'm not sure how much memory we would be saving.

If you could clarify it would be appreciated. Thanks.

ByteArrayOutputStream will hold all of the results in memory. It may be useful to create a temporary file and reference it as a FileOutputStream, which will have a fixed memory buffer before writing to disk. Just trying to work through the possible avenues where memory requirements can be managed. It may be useful to work through in a debugger to identity the large memory requirement and where that can be lowered, as hopefully the CLI can still be used on small machines after this pull request. Does the model load on every call to the CLI?

OK I get you and yes this makes pefect sense.

Does the model load on every call to the CLI?

... yes... which I realize is far from ideal. The issue with this as well is that it will load on every document AFAIK so this is a major limitation of the approach as it currently sits.

lewismc · 2017-07-27T19:20:52Z

Hi @ansell I finally got around to addressing your comments. Just to refresh your memory, use of FileOutputStream (as oppose to ByteArrayOutputStream) within the OpenExtractorTest.java logic is more performant, by around 1/4 second or so.
Do you have any further comments on this patch?

lewismc · 2017-08-02T02:26:38Z

Will commit within next day or so if there are no objections.

ansell · 2017-08-02T02:47:03Z

My main objections before were about the larger memory requirements for default use and not being able to run the tests without OOM in my mid-range development machine.

ansell · 2017-08-02T02:48:06Z

Is it an optional plugin in the current setup to avoid having users need to load it if they have minimal memory available. I haven't had time to look through it, but I see there is a new openie module.

lewismc · 2017-08-05T16:40:37Z

Hi @ansell yes this is a separate module however currently it always builds with CLI module. I'm going to push an update which disables the module tests by default.

lewismc · 2017-08-23T19:19:07Z

Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely

removal of open module from CLI (meaning that, by default the open extractor is not executed by default during normal unit test execution)
addition of some class loading logic which improves the flexibility of extractor detection based upon the presence of the open extractor.

By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests.

Thanks for any final review.
Lewis

lewismc added 3 commits February 23, 2017 17:26

ANY23-304 Add extractor for OpenIE

2ecfbff

Add META-INF service discovery for openie

6871755

Make pom relative parents consistent

0910104

Create consistent package naming

01abf8f

lewismc added 3 commits February 27, 2017 09:56

ANY23-304 update package names and introduce Service Loading for Open…

2f54725

…IE module

Fix package naming in OpenIE TestCase

1bb96c4

Fix CLassLoading issues and test issues introduced with ANY23-274

89d1d85

ansell suggested changes Mar 1, 2017

View reviewed changes

lewismc added 5 commits March 1, 2017 17:54

ANY23-304 Address comments from ansell

1b0c5ff

Resolve all documentation conflicts

a2d07fc

ANY23-304 merge with master branch

d4008bc

ANY23-304 increase number of extractors found

6d5c39e

ANY23-304 implement temporary file reader within test logic

b39d220

ANY23-304 Add extractor for OpenIE

ef14614

ANY23-304 skip tests in openie module

c40b788

asfgit merged commit c40b788 into apache:master Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANY23-304 Add extractor for OpenIE #34

ANY23-304 Add extractor for OpenIE #34

lewismc commented Feb 24, 2017

lewismc commented Feb 24, 2017

ansell commented Feb 24, 2017

lewismc commented Feb 24, 2017

ansell commented Feb 24, 2017

lewismc commented Feb 24, 2017

lewismc commented Feb 28, 2017

lewismc commented Feb 28, 2017

lewismc commented Mar 1, 2017

ansell commented Mar 1, 2017

ansell left a comment

ansell Mar 1, 2017

lewismc Mar 2, 2017

ansell Mar 1, 2017

lewismc Mar 2, 2017

ansell Mar 1, 2017

lewismc Mar 2, 2017

ansell Mar 2, 2017

lewismc Mar 2, 2017

lewismc commented Jul 27, 2017

lewismc commented Aug 2, 2017

ansell commented Aug 2, 2017

ansell commented Aug 2, 2017

lewismc commented Aug 5, 2017

lewismc commented Aug 23, 2017

ANY23-304 Add extractor for OpenIE #34

ANY23-304 Add extractor for OpenIE #34

Conversation

lewismc commented Feb 24, 2017

lewismc commented Feb 24, 2017

ansell commented Feb 24, 2017

lewismc commented Feb 24, 2017

ansell commented Feb 24, 2017

lewismc commented Feb 24, 2017

lewismc commented Feb 28, 2017

lewismc commented Feb 28, 2017

lewismc commented Mar 1, 2017

ansell commented Mar 1, 2017

ansell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewismc commented Jul 27, 2017

lewismc commented Aug 2, 2017

ansell commented Aug 2, 2017

ansell commented Aug 2, 2017

lewismc commented Aug 5, 2017

lewismc commented Aug 23, 2017