-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanish grammar for TimeNorm #68
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'm sure @bethard will have comments. It looks like this repo is configured for Travis, but I don't think that is still working, so this might need to be tested locally.
standard normalizations | ||
*/ | ||
|
||
def main(lang:String, in_file:String, out_file:String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all nice and neat. Someone might be concerned with the Python conventions, though. This line needs an = for Scala 2.13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
the list with all the normalizations*/ | ||
|
||
// Select the parser for the desired grammar depending on the language | ||
val parser = lang match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change the compiler settings so that there is a complaint about the unexhaustive match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As before, all improvements are welcome!
|
||
// Process the DCT depending on the presence of a time reference | ||
val pattern = "T".r | ||
val anchor = pattern.findFirstIn(dct_timex) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dct_timex.contains('T')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
val anchor = pattern.findFirstIn(dct_timex) match { | ||
// Get the anchor timespan if time is specified | ||
case Some(_) => | ||
val dct = dct_timex.split("T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a DateTimeFormatter
make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. Thanks!
var norm_counter = 0 | ||
|
||
// Iterate over timex list and get each timex, gold and norm set | ||
for (i <- 0 to timex_list.length - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (i <- timex_list.indices) {
can make life easier here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
// If this is a timex, write the data and sum a gold value | ||
if (timex != "") { | ||
writer.write(s"${timex}\t${gold}\t${norm}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a PrintWriter, println() would work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
if (gold != "-" && norm == gold) {norm_counter += 1} | ||
} | ||
// If this is a doc separator, write a newline | ||
else {writer.write(s"\n")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. writer.println()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
gold_list += "" | ||
} | ||
} | ||
return (timex_list.toList, gold_list.toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're not writing return
. Can they be removed throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate any improvement for a better performance :)
def get_content(in_file:String) : (List[String], List[String]) = { | ||
/**Obtains the content from the input file as timex and value lists*/ | ||
|
||
// Turn input file to a list of lines | ||
val content = Source.fromFile(in_file).getLines.toList | ||
// Obtain the standard line length from the first DCT | ||
val std_line_length = content(0).split("\t").length | ||
|
||
val timex_list = ListBuffer[String]() | ||
val gold_list = ListBuffer[String]() | ||
|
||
for (line <- content) { | ||
// If line is not a doc separator (indicated by empty string): | ||
if (line != "") { | ||
// If line length equals the standard, get the timex and its gold value | ||
if (line.split("\t").length == std_line_length) { | ||
timex_list += line.split("\t").head | ||
gold_list += line.split("\t").last | ||
} | ||
// If this is a detected timex absent in the evaluation corpus, get the | ||
// timex but append "" as gold normalization | ||
else { | ||
timex_list += line.split("\t").head | ||
gold_list += "-" | ||
} | ||
} | ||
// Otherwise, add empty strings to mark end of document timexes | ||
else { | ||
timex_list += "" | ||
gold_list += "" | ||
} | ||
} | ||
return (timex_list.toList, gold_list.toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it depends on the file, but this method may amount to
def get_content(in_file:String) : (List[String], List[String]) = {
/**Obtains the content from the input file as timex and value lists*/
val source = Source.fromFile(in_file)
val content = source.getLines.toList.map { line =>
val split = line.split("\t")
(split.lift(0).getOrElse(""), split.lift(1).getOrElse(""))
}.unzip
source.close()
content
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the comments! Feel free to make any improvement you consider appropriate
standard normalizations | ||
*/ | ||
|
||
def main(lang:String, in_file:String, out_file:String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
gold_list += "" | ||
} | ||
} | ||
return (timex_list.toList, gold_list.toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate any improvement for a better performance :)
def get_content(in_file:String) : (List[String], List[String]) = { | ||
/**Obtains the content from the input file as timex and value lists*/ | ||
|
||
// Turn input file to a list of lines | ||
val content = Source.fromFile(in_file).getLines.toList | ||
// Obtain the standard line length from the first DCT | ||
val std_line_length = content(0).split("\t").length | ||
|
||
val timex_list = ListBuffer[String]() | ||
val gold_list = ListBuffer[String]() | ||
|
||
for (line <- content) { | ||
// If line is not a doc separator (indicated by empty string): | ||
if (line != "") { | ||
// If line length equals the standard, get the timex and its gold value | ||
if (line.split("\t").length == std_line_length) { | ||
timex_list += line.split("\t").head | ||
gold_list += line.split("\t").last | ||
} | ||
// If this is a detected timex absent in the evaluation corpus, get the | ||
// timex but append "" as gold normalization | ||
else { | ||
timex_list += line.split("\t").head | ||
gold_list += "-" | ||
} | ||
} | ||
// Otherwise, add empty strings to mark end of document timexes | ||
else { | ||
timex_list += "" | ||
gold_list += "" | ||
} | ||
} | ||
return (timex_list.toList, gold_list.toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
the list with all the normalizations*/ | ||
|
||
// Select the parser for the desired grammar depending on the language | ||
val parser = lang match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As before, all improvements are welcome!
|
||
// Process the DCT depending on the presence of a time reference | ||
val pattern = "T".r | ||
val anchor = pattern.findFirstIn(dct_timex) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
val anchor = pattern.findFirstIn(dct_timex) match { | ||
// Get the anchor timespan if time is specified | ||
case Some(_) => | ||
val dct = dct_timex.split("T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. Thanks!
var norm_counter = 0 | ||
|
||
// Iterate over timex list and get each timex, gold and norm set | ||
for (i <- 0 to timex_list.length - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
// If this is a timex, write the data and sum a gold value | ||
if (timex != "") { | ||
writer.write(s"${timex}\t${gold}\t${norm}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
if (gold != "-" && norm == gold) {norm_counter += 1} | ||
} | ||
// If this is a doc separator, write a newline | ||
else {writer.write(s"\n")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
I won't have a chance to look at this before the end of June, so @kwalcock has my permission to review, clean up, and merge as deemed appropriate. |
@NGEscribano, do you happen to have a file meeting the description of |
Sure! I just added a "datasets" directory with the train and test sets from TempEval-3 formatted for this normalization task. This test file should work. |
@NGEscribano, your code has been incorporated into #70. Please take a look at it if you can and let me know if you have concerns. I don't know whether you would want it to be merged back into your repo. |
This PR is being closed in favor of #70, which includes all the commits from this one except for changes to README.md. Since this PR originates from a fork, it was easier to do that than to modify the fork so that the modifications would show up here under #68. So, the changes have been incorporated, just not with this PR. |
No description provided.