Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

Open
unhammer opened this issue Mar 10, 2022 · 4 comments
Open

tokeniser-gramcheck-gt-desc.pmhfst is 211M #52

unhammer opened this issue Mar 10, 2022 · 4 comments
Labels
gramcheck Issues restricted to the grammar checker

Comments

@unhammer
Copy link
Contributor

  • In April 2020 it was 106M
  • In September 2018 it was 39M

Where did we go wrong?

@snomos
Copy link
Member

snomos commented Mar 10, 2022

@flammie did look into memory consumption for pmhfst files a while ago. Maybe he has some ideas.

@flammie
Copy link
Contributor

flammie commented Mar 11, 2022

Mm, there are some things that legit multiplied the automaton size, eg. upcase in 5e0bdaf. There aren't too many other commits in the history, but many are filling up alphabet and alphabet size can easily be a multiplier in tokeniser size, I was hoping list arcs fix it a bit but it wasn't too effective. I think there might be a way to automate this with git bisect especially if keeping the analyser_relabelled-blah size constant might reveal something more...

@snomos snomos added the gramcheck Issues restricted to the grammar checker label Mar 16, 2022
@snomos
Copy link
Member

snomos commented Nov 16, 2023

Is this issue something we want to keep open? @flammie 's use of list arcs didn't help much, and my understanding is that the only thing left to do is a rewrite of parts of the hfst-pmatch code: in Karttunen's paper on pmatch, one of the features of the Xerox implementation is that it should save (disc and memory) space by storing reused FST constructs as references instead of copying in them in every instance. And the same goes for some built-in text manipulation functions, such as uppercasing.

To me this indicates that although the Hfst implementation is true to the original in linguistic features, it is not when it comes to implementation stuff that impacts memory consumption. And I believe this is a rather big omission on the Hfst part.

At the same time it is a major effort to rewrite the code, so I suggest that we for the time being just accepts the situation as it is, and close this issue.

Any thoughts?

@unhammer
Copy link
Contributor Author

unhammer commented Nov 18, 2023

Well, it would be interesting to try to bisect and find out what commits were responsible for the jumps in size – are they all necessary, or could there be some low-hanging fruit? OTOH if there aren't currently plans to run it locally on phones or combine with other fst's then it's probably not a problem in practice, just an annoyance, so closing makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gramcheck Issues restricted to the grammar checker
Projects
None yet
Development

No branches or pull requests

3 participants