-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StringIn takes a very long time with longer strings #33
Comments
I wonder if it's a consequence of the way StringIn is implemented? A lot of Could you repeat the benchmark moving the initialization (construction of The lookup array implementation lives in the trie here: On Fri, Aug 7, 2015 at 12:00 AM, tannerezell notifications@github.com
|
It looks like when parse is called, rather than when the parser is setup: sealed trait Val
case class Var(value : String) extends Val
def time[R](unit : => R) : R = {
val startTime = new Date()
val result = unit
val endTime = new Date()
val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
println (s"took $dateDiff to parse")
unit
}
val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
val Result.Success(myVars, _) = time { variables.parse("SA") }
println (s"myVars = $myVars") and the result:
That seems crazy right? |
Initialization is lazy IIRC; what if you call parse over and over? On Fri, Aug 7, 2015 at 5:14 AM, tannerezell notifications@github.com
|
val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
val Result.Success(myVars, _) = time { variables.parse("SA") }
val Result.Success(myVars2, _) = time { variables.parse("SA") }
val Result.Success(myVars3, _) = time { variables.parse("SA") }
val Result.Success(myVars4, _) = time { variables.parse("SA") }
println (s"myVars = $myVars")
Seems like only the first time, then its pretty fast |
I tried my hand at writing an alternate 'StringIn' implementation: case class ContainsIn(str : Seq[String]) extends Parser[Unit] {
val sortedList = str.sortBy(_.length).reverse
override def parseRec(cfg: ParseCtx, index: Int): Mutable[Unit] = {
def length : Int = { sortedList foreach { p => if (cfg.input.startsWith(p, index)) return p.length }; -1 }
if (length != -1) success(cfg.success, (), index + length + 0, Nil, cut = false)
else fail(cfg.failure, index)
}
override def toString = {
s"ContainsIn(${sortedList.map(literalize(_)).mkString(", ")})"
}
} For the most part it works and is very very fast. For grins and giggles I tried to swap it in place of StringIn and ran the tests, it was very unhappy to say the least. I ran it against the examples in the documentation, ran it against a couple variations and it seems to work like I would expect. Maybe you can provide some insight as to why this implementation breaks on the tests. |
I don't know why it wouldn't work off the top of my head; you'll have to go and minimize the problem |
I have been experimenting with a mixed approach, on pull request #54. However this is not ideal, as detailed in this review comment. The real challenge is to have a solution which is efficient for long running parsers and not to costly on initialisation for one-shot parsers. |
I don't understand the problem in detail and I don't understand what your solution is trying to do. For this sort of non-completely-trivial patch, you're going to need to explain what's going on =P |
90 seconds to compile StringIn with a 17 character string is way too long. println("string length,time to compile StringIn (ms)")
for (subset <- "aaaaaaaaaaaaaaaaa".tails.toSeq.reverse) {
val start = System.currentTimeMillis()
StringIn(subset)
val end = System.currentTimeMillis()
println(s"${subset.size},${end - start}")
}
|
Yes it is. It's totally screwed and someone needs to fix it. Anyone want to chip in with an implementation of a Double Array Trie? I've never used one but it seems compact, fast to initialize and fast to query |
I'm not so sure you need a fresh implementation. How about the one from Apache commons? |
That's plausible. We'd need to...
Nothing fundamentally difficult about this, someone just needs to do it. And I'm spending all day/week/month/year refactoring legacy python code so that person probably won't be me ^_^ |
This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33 https://issues.scala-lang.org/browse/SI-4776
Tested on 0.2.1:
results in the following output:
This only happens on long strings, shorter ones process almost instantly. I've tried various combinations of the long string and it always results in an exceedingly long parsing time.
The text was updated successfully, but these errors were encountered: