-
Notifications
You must be signed in to change notification settings - Fork 55
Conversation
@ansell is it necessary to put this new module into |
I haven't looked at it recently. The META-INF/services should be enough on their own without the explicit plugin support but I can't recall whether there are any other differences that could affect usage. |
OK so implementing ExtractorPlugin is not necessary... none of the other plugins use this logic. |
The cli module may need the new module added as a dependency to pull it onto the classpath. Strangely enough, it appears as though none of the other plugins are cli dependencies. |
Yep your right. Bang on the money. I'll update the PR. |
Hi @ansell this is now fixed... if you could pull the code and let me know how you get on it would be appreciated. |
Unfortunately... due to the bugs regarding the |
PING... anyone that is able to provide a review? Would be very much appreciated. |
Tests failed for me with OOM:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments possibly related to memory usage and performance.
// instance.extr().arg2s().text() - object | ||
for(Instance instance : listExtractions) { | ||
final Configuration immutableConf = DefaultConfiguration.singleton(); | ||
if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Double.parseDouble here should be done once and stored in a local variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this @ansell
for(Instance instance : listExtractions) { | ||
final Configuration immutableConf = DefaultConfiguration.singleton(); | ||
if (instance.confidence() > Double.parseDouble(immutableConf.getProperty("any23.extraction.openie.confidence.threshold", "0.5"))) { | ||
List<Argument> listArg2s = JavaConversions.seqAsJavaList(instance.extr().arg2s()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with Scala, but does this bring an Iterator-like element into memory completely instead of processing it in a streaming fashion (just trying to understand why the OOM are occurring).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ansell yes having debugged the code pretty extensively by now, it read the entire sequence into memory. You can see the Scala Docs here, for a bit of additional context, they state
Implicitly converts a Scala Seq to a Java List. The returned Java List is backed by the provided Scala Seq and any side-effects of using it via the Java interface will be visible via the Scala interface and vice versa. If the Scala Seq was previously obtained from an implicit or explicit call of asSeq(java.util.List) then the original Java List will be returned.
This being said, I haven;t noticed the size of instance.extr().arg2s()
being overly large so far.
My feeling is that OOM's are stemming from loading the model(s), however I may be wrong. Over on the openie README it states "...openie requires substantial memory.".
|
||
public void extract(IRI uri, String filePath) | ||
throws IOException, ExtractionException, TripleHandlerException { | ||
ByteArrayOutputStream baos = new ByteArrayOutputStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing to a file instead of ByteArrayOutputStream may alleviate some of the memory pressures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ansell can you please elaborate? Its my understanding that we require the ByteArrayOutputStream
as it acts as a parameter for each TripleHandler
implementation e.g. RDFXMLWriter(baos)
.
I would be happy to stream the extractions as an attempt to mitigate against OOM, however this would be after the extraction right? Not before, therefore I'm not sure how much memory we would be saving.
If you could clarify it would be appreciated. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteArrayOutputStream will hold all of the results in memory. It may be useful to create a temporary file and reference it as a FileOutputStream, which will have a fixed memory buffer before writing to disk. Just trying to work through the possible avenues where memory requirements can be managed. It may be useful to work through in a debugger to identity the large memory requirement and where that can be lowered, as hopefully the CLI can still be used on small machines after this pull request. Does the model load on every call to the CLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I get you and yes this makes pefect sense.
Does the model load on every call to the CLI?
... yes... which I realize is far from ideal. The issue with this as well is that it will load on every document AFAIK so this is a major limitation of the approach as it currently sits.
Hi @ansell I finally got around to addressing your comments. Just to refresh your memory, use of FileOutputStream (as oppose to ByteArrayOutputStream) within the OpenExtractorTest.java logic is more performant, by around 1/4 second or so. |
Will commit within next day or so if there are no objections. |
My main objections before were about the larger memory requirements for default use and not being able to run the tests without OOM in my mid-range development machine. |
Is it an optional plugin in the current setup to avoid having users need to load it if they have minimal memory available. I haven't had time to look through it, but I see there is a new openie module. |
Hi @ansell yes this is a separate module however currently it always builds with CLI module. I'm going to push an update which disables the module tests by default. |
Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely
By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests. Thanks for any final review. |
Hi Folks,
This issue is a rework of #33 which takes on board @ansell 's comments to add the new extractor as a separate module as oppose to inside of core.
There are a number of classes which are cleaned up for JDK1.8 compliance.
In addition, this new functionality augments the default configuration by introducing a threshold for OpenIE extractions of 0.5. Anything below this value is not converted into triples.
I run a test extraction on a reasonably testing Webpage from the PO.DAAC but right now i am not asserting anything.
As far as I can see this is working pretty well but some community review would go a long way.