Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish grammar for TimeNorm #68

Closed
wants to merge 12 commits into from
Closed

Conversation

NGEscribano
Copy link

No description provided.

Copy link
Member

@kwalcock kwalcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'm sure @bethard will have comments. It looks like this repo is configured for Travis, but I don't think that is still working, so this might need to be tested locally.

README.md Show resolved Hide resolved
standard normalizations
*/

def main(lang:String, in_file:String, out_file:String) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all nice and neat. Someone might be concerned with the Python conventions, though. This line needs an = for Scala 2.13.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks!

the list with all the normalizations*/

// Select the parser for the desired grammar depending on the language
val parser = lang match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the compiler settings so that there is a complaint about the unexhaustive match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, all improvements are welcome!


// Process the DCT depending on the presence of a time reference
val pattern = "T".r
val anchor = pattern.findFirstIn(dct_timex) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dct_timex.contains('T')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

val anchor = pattern.findFirstIn(dct_timex) match {
// Get the anchor timespan if time is specified
case Some(_) =>
val dct = dct_timex.split("T")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a DateTimeFormatter make more sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Thanks!

var norm_counter = 0

// Iterate over timex list and get each timex, gold and norm set
for (i <- 0 to timex_list.length - 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (i <- timex_list.indices) { can make life easier here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


// If this is a timex, write the data and sum a gold value
if (timex != "") {
writer.write(s"${timex}\t${gold}\t${norm}\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a PrintWriter, println() would work well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks!

if (gold != "-" && norm == gold) {norm_counter += 1}
}
// If this is a doc separator, write a newline
else {writer.write(s"\n")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. writer.println().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

gold_list += ""
}
}
return (timex_list.toList, gold_list.toList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're not writing return. Can they be removed throughout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate any improvement for a better performance :)

Comment on lines +43 to +75
def get_content(in_file:String) : (List[String], List[String]) = {
/**Obtains the content from the input file as timex and value lists*/

// Turn input file to a list of lines
val content = Source.fromFile(in_file).getLines.toList
// Obtain the standard line length from the first DCT
val std_line_length = content(0).split("\t").length

val timex_list = ListBuffer[String]()
val gold_list = ListBuffer[String]()

for (line <- content) {
// If line is not a doc separator (indicated by empty string):
if (line != "") {
// If line length equals the standard, get the timex and its gold value
if (line.split("\t").length == std_line_length) {
timex_list += line.split("\t").head
gold_list += line.split("\t").last
}
// If this is a detected timex absent in the evaluation corpus, get the
// timex but append "" as gold normalization
else {
timex_list += line.split("\t").head
gold_list += "-"
}
}
// Otherwise, add empty strings to mark end of document timexes
else {
timex_list += ""
gold_list += ""
}
}
return (timex_list.toList, gold_list.toList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it depends on the file, but this method may amount to

  def get_content(in_file:String) : (List[String], List[String]) = {
    /**Obtains the content from the input file as timex and value lists*/

    val source = Source.fromFile(in_file)
    val content = source.getLines.toList.map { line =>
      val split = line.split("\t")
      (split.lift(0).getOrElse(""), split.lift(1).getOrElse(""))
    }.unzip
    source.close()
    content
  }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Author

@NGEscribano NGEscribano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the comments! Feel free to make any improvement you consider appropriate

standard normalizations
*/

def main(lang:String, in_file:String, out_file:String) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks!

gold_list += ""
}
}
return (timex_list.toList, gold_list.toList)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate any improvement for a better performance :)

Comment on lines +43 to +75
def get_content(in_file:String) : (List[String], List[String]) = {
/**Obtains the content from the input file as timex and value lists*/

// Turn input file to a list of lines
val content = Source.fromFile(in_file).getLines.toList
// Obtain the standard line length from the first DCT
val std_line_length = content(0).split("\t").length

val timex_list = ListBuffer[String]()
val gold_list = ListBuffer[String]()

for (line <- content) {
// If line is not a doc separator (indicated by empty string):
if (line != "") {
// If line length equals the standard, get the timex and its gold value
if (line.split("\t").length == std_line_length) {
timex_list += line.split("\t").head
gold_list += line.split("\t").last
}
// If this is a detected timex absent in the evaluation corpus, get the
// timex but append "" as gold normalization
else {
timex_list += line.split("\t").head
gold_list += "-"
}
}
// Otherwise, add empty strings to mark end of document timexes
else {
timex_list += ""
gold_list += ""
}
}
return (timex_list.toList, gold_list.toList)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

the list with all the normalizations*/

// Select the parser for the desired grammar depending on the language
val parser = lang match {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, all improvements are welcome!


// Process the DCT depending on the presence of a time reference
val pattern = "T".r
val anchor = pattern.findFirstIn(dct_timex) match {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

val anchor = pattern.findFirstIn(dct_timex) match {
// Get the anchor timespan if time is specified
case Some(_) =>
val dct = dct_timex.split("T")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Thanks!

var norm_counter = 0

// Iterate over timex list and get each timex, gold and norm set
for (i <- 0 to timex_list.length - 1) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


// If this is a timex, write the data and sum a gold value
if (timex != "") {
writer.write(s"${timex}\t${gold}\t${norm}\n")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks!

if (gold != "-" && norm == gold) {norm_counter += 1}
}
// If this is a doc separator, write a newline
else {writer.write(s"\n")}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@bethard
Copy link
Collaborator

bethard commented May 16, 2023

I won't have a chance to look at this before the end of June, so @kwalcock has my permission to review, clean up, and merge as deemed appropriate.

@kwalcock
Copy link
Member

@NGEscribano, do you happen to have a file meeting the description of "timex [type] gold_value" tab-separated format that can be used for testing?

@NGEscribano
Copy link
Author

Sure! I just added a "datasets" directory with the train and test sets from TempEval-3 formatted for this normalization task. This test file should work.

@kwalcock
Copy link
Member

@NGEscribano, your code has been incorporated into #70. Please take a look at it if you can and let me know if you have concerns. I don't know whether you would want it to be merged back into your repo.

@kwalcock
Copy link
Member

This PR is being closed in favor of #70, which includes all the commits from this one except for changes to README.md. Since this PR originates from a fork, it was easier to do that than to modify the fork so that the modifications would show up here under #68. So, the changes have been incorporated, just not with this PR.

@kwalcock kwalcock closed this May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants