Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringIn takes a very long time with longer strings #33

Closed
tannerezell opened this issue Aug 6, 2015 · 12 comments · Fixed by #70
Closed

StringIn takes a very long time with longer strings #33

tannerezell opened this issue Aug 6, 2015 · 12 comments · Fixed by #70

Comments

@tannerezell
Copy link

Tested on 0.2.1:

sealed trait Val
case class Var(value : String) extends Val
val startTime = new Date()
val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq 
val variables : Parser[Var] = P ( StringIn(vars:_*).! ).map(Var)
val Result.Success(myVars, _) = variables.parse("SA")
val endTime = new Date()
val dateDiff = new SimpleDateFormat("mm:ss").format(new Date(endTime.getTime - startTime.getTime))
println (s"myVars = $myVars")
println (s"took $dateDiff to parse")

results in the following output:

myVars = Var(SA)
took 03:11 to parse

This only happens on long strings, shorter ones process almost instantly. I've tried various combinations of the long string and it always results in an exceedingly long parsing time.

@lihaoyi
Copy link
Member

lihaoyi commented Aug 6, 2015

I wonder if it's a consequence of the way StringIn is implemented? A lot of
our bitsets are implemented as Char predicates, including those used in
StringIn, and those initialize a 65000-pointer long lookup array the first
time they're initialized. This may be causing the slowdown, but I can't
imagine it taking 3 minutes...

Could you repeat the benchmark moving the initialization (construction of
the parser object) outside of the timing block? I thought the
initialization would be relatively fast, but maybe it's not, and we should
reconsider this approach.

The lookup array implementation lives in the trie here:

https://github.com/lihaoyi/fastparse/blob/master/utils/shared/src/main/scala/fastparse/Utils.scala#L138-L183

On Fri, Aug 7, 2015 at 12:00 AM, tannerezell notifications@github.com
wrote:

Tested on 0.2.1:

sealed trait Valcase class Var(value : String) extends Valval startTime = new Date()val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq val variables : Parser[Var] = P ( StringIn(vars:_*).! ).map(Var)val Result.Success(myVars, _) = variables.parse("SA")val endTime = new Date()val dateDiff = new SimpleDateFormat("mm:ss").format(new Date(endTime.getTime - startTime.getTime))
println (s"myVars = $myVars")
println (s"took $dateDiff to parse")

results in the following output:

myVars = Var(SA)
took 03:11 to parse

This only happens on long strings, shorter ones process almost instantly.
I've tried various combinations of the long string and it always results in
an exceedingly long parsing time.


Reply to this email directly or view it on GitHub
#33.

@tannerezell
Copy link
Author

It looks like when parse is called, rather than when the parser is setup:

sealed trait Val
case class Var(value : String) extends Val

def time[R](unit : => R) : R = {
    val startTime = new Date()
    val result = unit
    val endTime = new Date()
    val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
    println (s"took $dateDiff to parse")
    unit
  }


  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse
took 03:27:331 to parse
myVars = Var(SA)

That seems crazy right?

@lihaoyi
Copy link
Member

lihaoyi commented Aug 6, 2015

Initialization is lazy IIRC; what if you call parse over and over?

On Fri, Aug 7, 2015 at 5:14 AM, tannerezell notifications@github.com
wrote:

It looks like when parse is called, rather than when the parser is setup:

sealed trait Valcase class Var(value : String) extends Val
def time[R](unit : => R) : R = {
val startTime = new Date()
val result = unit
val endTime = new Date()
val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
println (s"took $dateDiff to parse")
unit
}

val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
val Result.Success(myVars, _) = time { variables.parse("SA") }
println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse
took 03:27:331 to parse
myVars = Var(SA)

That seems crazy right?


Reply to this email directly or view it on GitHub
#33 (comment).

@tannerezell
Copy link
Author

  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  val Result.Success(myVars2, _) = time { variables.parse("SA") }
  val Result.Success(myVars3, _) = time { variables.parse("SA") }
  val Result.Success(myVars4, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")
took 00:00:00 to parse
took 03:08:530 to parse
took 00:00:00 to parse
took 00:00:00 to parse
took 00:00:00 to parse
myVars = Var(SA)

Seems like only the first time, then its pretty fast

@tannerezell
Copy link
Author

I tried my hand at writing an alternate 'StringIn' implementation:

case class ContainsIn(str : Seq[String]) extends Parser[Unit] {
    val sortedList = str.sortBy(_.length).reverse
    override def parseRec(cfg: ParseCtx, index: Int): Mutable[Unit] = {
      def length : Int = { sortedList foreach { p => if (cfg.input.startsWith(p, index)) return p.length }; -1 }
      if (length != -1) success(cfg.success, (), index + length + 0, Nil, cut = false)
      else fail(cfg.failure, index)
    }
    override def toString = {
      s"ContainsIn(${sortedList.map(literalize(_)).mkString(", ")})"
    }
  }

For the most part it works and is very very fast. For grins and giggles I tried to swap it in place of StringIn and ran the tests, it was very unhappy to say the least. I ran it against the examples in the documentation, ran it against a couple variations and it seems to work like I would expect. Maybe you can provide some insight as to why this implementation breaks on the tests.

@lihaoyi
Copy link
Member

lihaoyi commented Aug 12, 2015

I don't know why it wouldn't work off the top of my head; you'll have to go and minimize the problem

@Alain-Bearez
Copy link
Contributor

I have been experimenting with a mixed approach, on pull request #54. However this is not ideal, as detailed in this review comment.

The real challenge is to have a solution which is efficient for long running parsers and not to costly on initialisation for one-shot parsers.

@lihaoyi
Copy link
Member

lihaoyi commented Nov 11, 2015

I don't understand the problem in detail and I don't understand what your solution is trying to do. For this sort of non-completely-trivial patch, you're going to need to explain what's going on =P

@shawjef3
Copy link

90 seconds to compile StringIn with a 17 character string is way too long.

println("string length,time to compile StringIn (ms)")

for (subset <- "aaaaaaaaaaaaaaaaa".tails.toSeq.reverse) {

  val start = System.currentTimeMillis()

  StringIn(subset)

  val end = System.currentTimeMillis()

  println(s"${subset.size},${end - start}")
}
string length,time to compile StringIn (ms)
0,16
1,5
2,1
3,2
4,4
5,10
6,16
7,17
8,23
9,52
10,79
11,155
12,438
13,1274
14,3264
15,10069
16,29726
17,90797

@lihaoyi
Copy link
Member

lihaoyi commented Dec 16, 2015

Yes it is. It's totally screwed and someone needs to fix it.

Anyone want to chip in with an implementation of a Double Array Trie? I've never used one but it seems compact, fast to initialize and fast to query

@shawjef3
Copy link

I'm not so sure you need a fresh implementation. How about the one from Apache commons?

@lihaoyi
Copy link
Member

lihaoyi commented Dec 16, 2015

That's plausible. We'd need to...

  • Port it to Scala, or find an alternate implementation in Javascript so we can run it on Scala.js.
  • Make sure it's semi-rigorously unit tested itself, and the Fastparse/Scalaparse/Pythonparse tests all pass
  • Run some perf comparisons on both initialization (which we know is slow for status quo) and retrieval/matching (which I think the status quo does pretty well?) to see how it compares.

Nothing fundamentally difficult about this, someone just needs to do it. And I'm spending all day/week/month/year refactoring legacy python code so that person probably won't be me ^_^

rklaehn added a commit to rklaehn/fastparse that referenced this issue Dec 16, 2015
This seems to make creation of the TrieNode much faster. It might fix com-lihaoyi#33

https://issues.scala-lang.org/browse/SI-4776
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants