tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

unhammer · 2022-03-10T14:51:29Z

In April 2020 it was 106M
In September 2018 it was 39M

Where did we go wrong?

snomos · 2022-03-10T20:48:18Z

@flammie did look into memory consumption for pmhfst files a while ago. Maybe he has some ideas.

flammie · 2022-03-11T05:50:30Z

Mm, there are some things that legit multiplied the automaton size, eg. upcase in 5e0bdaf. There aren't too many other commits in the history, but many are filling up alphabet and alphabet size can easily be a multiplier in tokeniser size, I was hoping list arcs fix it a bit but it wasn't too effective. I think there might be a way to automate this with git bisect especially if keeping the analyser_relabelled-blah size constant might reveal something more...

snomos · 2023-11-16T17:53:51Z

Is this issue something we want to keep open? @flammie 's use of list arcs didn't help much, and my understanding is that the only thing left to do is a rewrite of parts of the hfst-pmatch code: in Karttunen's paper on pmatch, one of the features of the Xerox implementation is that it should save (disc and memory) space by storing reused FST constructs as references instead of copying in them in every instance. And the same goes for some built-in text manipulation functions, such as uppercasing.

To me this indicates that although the Hfst implementation is true to the original in linguistic features, it is not when it comes to implementation stuff that impacts memory consumption. And I believe this is a rather big omission on the Hfst part.

At the same time it is a major effort to rewrite the code, so I suggest that we for the time being just accepts the situation as it is, and close this issue.

Any thoughts?

unhammer · 2023-11-18T19:05:16Z

Well, it would be interesting to try to bisect and find out what commits were responsible for the jumps in size – are they all necessary, or could there be some low-hanging fruit? OTOH if there aren't currently plans to run it locally on phones or combine with other fst's then it's probably not a problem in practice, just an annoyance, so closing makes sense.

snomos added the gramcheck Issues restricted to the grammar checker label Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

unhammer commented Mar 10, 2022

snomos commented Mar 10, 2022

flammie commented Mar 11, 2022

snomos commented Nov 16, 2023

unhammer commented Nov 18, 2023 •

edited

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

Comments

unhammer commented Mar 10, 2022

snomos commented Mar 10, 2022

flammie commented Mar 11, 2022

snomos commented Nov 16, 2023

unhammer commented Nov 18, 2023 • edited

unhammer commented Nov 18, 2023 •

edited