Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time normalization config in eidos.conf and reference.conf #443

Closed
bgyori opened this issue Sep 21, 2018 · 22 comments
Closed

Time normalization config in eidos.conf and reference.conf #443

bgyori opened this issue Sep 21, 2018 · 22 comments

Comments

@bgyori
Copy link
Contributor

bgyori commented Sep 21, 2018

I've been trying to configure Eidos to use the time normalization feature and I'm running into some issues. These are 3 issues here but they are related so I'm putting them all here.

First, I am wondering if some of the differences in eidos.conf and reference.conf are on purpose or not.

  1. In reference.conf timeNormModelPath is set to
timeNormModelPath = /org/clulab/wm/eidos/english/models/timenorm_model.hdf5                                                

whereas in eidos.conf it is set to

timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5

I think between the two, the latter is the better default setting since timenorm_model.hdf is part of the repo at org/clulab/wm/eidos/models/timenorm_model.hdf5. Should I update the default reference.conf to use this path?

  1. Another inconsistency between the two conf files is that in reference.conf
useTimeNorm = false

is set but there is no useTimeNorm row in eidos.conf. Would it make sense to include the same row with the same default value in eidos.conf as well?

  1. Now, using the settings as follows:
timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5
...
useTimeNorm = true

and running

java -Xmx12G -cp /Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar org.clulab.wm.eidos.apps.ExtractFromDirectory /Users/ben/tmp/eidos/docs /Users/ben/tmp/eidos/docs

I get

15:22:16.328 [scala-execution-context-global-11] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
15:22:17.500 [scala-execution-context-global-11] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.2 sec].
jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5
Exception in thread "main" java.nio.file.FileSystemNotFoundException
	at com.sun.nio.zipfs.ZipFileSystemProvider.getFileSystem(ZipFileSystemProvider.java:171)
	at com.sun.nio.zipfs.ZipFileSystemProvider.getPath(ZipFileSystemProvider.java:157)
	at java.nio.file.Paths.get(Paths.java:143)
	at org.clulab.wm.eidos.EidosSystem$LoadableAttributes$.apply(EidosSystem.scala:129)
	at org.clulab.wm.eidos.EidosSystem.<init>(EidosSystem.scala:153)
	at org.clulab.wm.eidos.apps.ExtractFromDirectory$.delayedEndpoint$org$clulab$wm$eidos$apps$ExtractFromDirectory$1(ExtractFromDirectory.scala:14)
	at org.clulab.wm.eidos.apps.ExtractFromDirectory$delayedInit$body.apply(ExtractFromDirectory.scala:9)
	at scala.Function0.apply$mcV$sp(Function0.scala:34)
	at scala.Function0.apply$mcV$sp$(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App.$anonfun$main$1$adapted(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:389)
	at scala.App.main(App.scala:76)
	at scala.App.main$(App.scala:74)
	at org.clulab.wm.eidos.apps.ExtractFromDirectory$.main(ExtractFromDirectory.scala:9)
	at org.clulab.wm.eidos.apps.ExtractFromDirectory.main(ExtractFromDirectory.scala)

Note that I added a debug print: println(timeNormResource) on EidosSystem.scala line 127 to produce this line

jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5

I also confirmed by browsing the jar file itself that /org/clulab/wm/eidos/models/timenorm_model.hdf5 is at the specified location within the JAR.

Thanks for your help!

@kwalcock
Copy link
Member

kwalcock commented Sep 21, 2018 via email

@bgyori
Copy link
Contributor Author

bgyori commented Sep 21, 2018

Thanks, so if I understand correctly, if I put the hdf5 file outside the JAR and reference it by its absolute path it should work. Let me try that and I'll report back!

@kwalcock
Copy link
Member

Yes, that's it. Also please be advised that we are working on related performance issues. Don't plan to run lots of files with timenorm on.

@bgyori
Copy link
Contributor Author

bgyori commented Sep 21, 2018

Alright, that seems to have worked as far as the file path goes. In particular,

java -Xmx12G -cp /Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar
 org.clulab.wm.eidos.apps.ExtractFromDirectory /Users/ben/tmp/eidos/docs /Users/ben/tmp/eidos/docs

works as expected, and timexes show up in the output JSON-LD.

However, the other reading mode we have been using, which is for reading snippets of text directly using an instance of EidosSystem and calling its extractFromText method (from Python) gives me this error:

JavaException: JVM exception occurred: Text 'nullT00:00:00' could not be parsed at index 0

Any clues what might be behind this?

@kwalcock
Copy link
Member

kwalcock commented Sep 21, 2018

Still working on it...
The program ExtractFromDirectory uses extractFromText, so it should be working in general. Can you send a specific sentence that's a problem and/or the important part of code? Thanks. It doesn't seem that @EgoLaparra is online to respond.

@bgyori
Copy link
Contributor Author

bgyori commented Sep 24, 2018

I think I have a guess: from Python I'm passing scala.Some(None) as the fourth argument which is the documentCreationTime. I thought passing in None would be adequate because None is defined as the default argument, and ExtractFromDirectory doesn't specify this argument:

val annotatedDocuments = Seq(reader.extractFromText(text))

With some experimentation, I found that if I change the argument to scala.Some('2018'), I get this error:

JavaException: JVM exception occurred: Text '2018T00:00:00' could not be parsed at index 4

@EgoLaparra
Copy link
Contributor

For the moment, the DocTime must be in YYYY-MM-DD format. Try passing something like scala.Some('2018-09-24').

@bgyori
Copy link
Contributor Author

bgyori commented Sep 24, 2018

Thanks @EgoLaparra, that worked! Let me test it some more and then I'll close this issue.

@EgoLaparra
Copy link
Contributor

By the way, what happens if you don't pass the fourth argument?

@bgyori
Copy link
Contributor Author

bgyori commented Sep 24, 2018

Complicated... The Java-Python bridge called jnius that allows us to use Eidos programatically at all is not really meant to be used with Scala. Java methods don't have default arguments (you rather define the function multiple times with different sets of arguments) and so jnius thinks this method needs 5 arguments and errors if you call it with less. This is what prompted e.g. this line:
https://github.com/clulab/eidos/blob/master/src/main/scala/org/clulab/wm/eidos/EidosSystem.scala#L32

@EgoLaparra
Copy link
Contributor

I see. In any case, we need the actual creation time of the document to get correct normalizations for expression like "last week". The parser cannot infer it from the text, so, when no DocTime is passed, it uses as reference the current date.

@kwalcock
Copy link
Member

The fourth argument as in filename: Option[String]= None to EidosSystem.annotate? It should be OK. It is only used for the document id which is probably only used for the JSON-LD output.

@bgyori
Copy link
Contributor Author

bgyori commented Sep 24, 2018

Well if you count from 1, not 0, then the 4th argument is documentCreationTime which we discussed above:

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[String] = None, filename: Option[String] = None)

@kwalcock
Copy link
Member

I can still count, but maybe it's time for trifocals :-) I didn't realize you were both talking about the same thing.

@bgyori
Copy link
Contributor Author

bgyori commented Sep 24, 2018

Thanks, looks like this is working!

@bgyori bgyori closed this as completed Sep 24, 2018
@kwalcock
Copy link
Member

@EgoLaparra, I think you'll want to change from

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[String] = None, filename: Option[String] = None)

to

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[LocalDateTime] = None, filename: Option[String] = None)

Neither Eidos nor EidosDocument are in a good position to decide what kind of string is being passed and should let whatever reads or produces the string take care of that. In reading these 17k documents I find that the "creation date" comes in multiple formats and it's not efficient to parse them and convert them to the kind of string that is needed (e.g., eight digits, with dashes, without time) only to have them parsed again, etc.

@EgoLaparra
Copy link
Contributor

What about letting the parser to deal with these strings? Eidos could pass whatever it finds, even if the format is not the correct one, and the temporal parser would decide if it can create a DCT or set it as undefined.

@kwalcock
Copy link
Member

That sounds interesting. Perhaps if it is passed a string, it could convert it to an Option[LocalDateTime] and call the other function. Right now the conversion process is on the fragile side. I haven't been watching your timenorm project to know if you have made the update that includes what you want to be used in this large run. Be sure to let me know. Thanks.

@kwalcock
Copy link
Member

@EgoLaparra, are we any closer on what needs to be delivered on this large run that needs to work overnight and get sent away? For the metadata files should I expect that there are some without matching text files? I need to double check, but it seemed that there were both texts without metadata and metadata without texts.

@EgoLaparra
Copy link
Contributor

Yes, we are closer. I have changed the parser and EidosDocument so that the dct can be handled with any format, even it it is wrong. I still need to run some test to make sure that everything is working properly.
And yes, the document collection in the FAO site has changed since I retrieve the pdfs, so this kind of things can happen.

@EgoLaparra
Copy link
Contributor

@kwalcock, I have created a pull-request to kwalcock-timeTime with theses changes.

@MihaiSurdeanu
Copy link
Contributor

Thanks @EgoLaparra and @kwalcock!
This integration is very important. Please prioritize this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants