New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 70% less memory for TokenizedLines by using a different representation #6757

Merged
merged 55 commits into from May 20, 2015

Conversation

Projects
None yet
9 participants
@nathansobo
Copy link
Contributor

nathansobo commented May 12, 2015

Previously, we naively stored every token on every line as its own object, and made things even worse by storing an array of scopes on every token. This caused TokenizedLines to consume much more memory than is strictly necessary.

This PR switches TokenizedLine to a much more compact (but also more cryptic) representation. Instead of repeating the scopes on each token, we instead represent the entire tokenized line as a nested word, which intermingles start and end tags with text much like HTML.

Consider the following JavaScript fragment:

function() {}

Previously, we would store this as an array of tokens, each with an array of its containing scopes:

[
  {value: 'function', scopes: ['source.js', 'meta.function.js', 'storage.type.function.js']},
  {value: ' ', scopes: ['source.js', 'meta.function.js']},
  {value: '(', scopes: ['source.js', 'meta.function.js', 'punctuation.definition.parameters.begin.js']},
  {value: ')', scopes: ['source.js', 'meta.function.js', 'punctuation.definition.parameters.end.js']},
  {value: ' ', scopes: ['source.js', 'meta.function.js']},
  {value: '{', scopes: ['source.js', 'punctuation.section.scope.begin.js']},
  {value: '}', scopes: ['source.js', 'punctuation.section.scope.end.js']}
]

Now this is stored more like HTML, with strings intermingled with scope start / end tags.

// Note: This JSON is just for example purposes. Read below for the actual representation.

[
  {start: 'source.js'},
    {start: 'meta.function.js'},
      {start: 'storage.type.function.js'},
        "function",
      {end: 'storage.type.function.js'},
      " ",
      {start: 'punctuation.definition.parameters.begin.js'},
        "(",
      {end: 'punctuation.definition.parameters.begin.js'},
      {start: 'punctuation.definition.parameters.end.js'},
        ")",
      {end: 'punctuation.definition.parameters.end.js'},
    {end: 'meta.function.js'},
    " ",
    {start: 'punctuation.section.scope.begin.js'},
      "{",
    {end: 'punctuation.section.scope.begin.js'},
    {start: 'punctuation.section.scope.end.js'},
      "}",
    {end: 'punctuation.section.scope.end.js'},
  // note 'source.js' isn't closed yet
]

This may look more verbose, but it eliminates the need to repeat the same scopes on successive tokens in the same branch of the syntax tree. In actuality, the scopes are encoded as simple arrays of integers. Negative integers indicate start and end tags, with odd integers being start tags and even integers being end tags for the odd integer preceding them. Positive integers stand in for strings. The integers can be translated back to scope strings by consulting a lookup table on the grammar registry. We retain the original string and combine the two pieces of data as needed.

So in reality, the stored representation looks like this.

{
  text: "function() {}",
  tags: [-1, -3, -5, 8, -6, 1, -7, 1, -8, -9, 1, -10, -4, 1, -11, 1, -12, -13, 1, -14]
}

Measurements

Here are some heap snapshots taken with benchmark/fixtures/huge.js in the Atom repository. This file is just a copy of jQuery 1.7.1, but has some good examples of lines with extremely complex scope structures.

First, on master:

screenshot 2015-05-14 19 07 26

TokenizedLine instances consume over 27MB. The largest instance is expanded in the screenshot, where you can see the vast majority of the memory for that line being consumed by the tokens array (~25KB). Repeat that for thousands of lines and you can see the problem.

Now, the same file on this branch:

screenshot 2015-05-14 19 09 57

Still more memory than I'd like to consume, but much less, around 8.4MB. You can see that the tags array takes up much less space than the tokens array, but still more than it seems like it should. I can save more with typed arrays, but only about 500KB in my experiment.

JS objects still seem to impose a frustrating amount of overhead. Starting to wonder if we want to drop some of our low-level storage structures to C++, but I'd rather improve the structure of this layer before contemplating that. The TokenizedLine instances are also now much simpler objects, meaning they terminate in primitives much more quickly. I wonder if this helps the GC in that the heap is simpler and there are fewer objects to trace. This is just speculation however.

Steps

  • Get core specs passing with new representation
  • Eliminate usages of the tokens array shim in core, which creates a lot of garbage
  • Post some heap snapshot comparisons
  • Publish a new major revision of first-mate and link it in package.json
  • Ensure all package specs pass
  • Switch hot package paths off the tokens shim
@benogle

This comment has been minimized.

Copy link
Contributor

benogle commented May 12, 2015

Are the scope name strings in a symbol table or something, or are the strings repeated?

@nathansobo

This comment has been minimized.

Copy link
Contributor

nathansobo commented May 12, 2015

@benogle They're stored in a symbol table on the GrammarRegistry. In the tags arrays they are negative integers. Odd for start scopes, even for end scopes. Could have used one universal symbol for the end scope but I wanted to preserve a little information for debugability purposes. We're not going to create so many scopes that using two ids for each scope is a problem. Only the odd ids are actually stored in the symbol table.

@maxbrunsfeld maxbrunsfeld referenced this pull request May 12, 2015

Closed

Find remaining performance bottlenecks for large files #6692

17 of 20 tasks complete

nathansobo added some commits May 6, 2015

Generate line HTML based on tags instead of tokens
This avoids creating a bunch of tokens as temporary objects since they
are no longer stored.

@nathansobo nathansobo force-pushed the ns-less-memory-for-tokens branch from efd70b8 to dc47369 May 13, 2015

nathansobo added a commit that referenced this pull request May 20, 2015

Merge pull request #6757 from atom/ns-less-memory-for-tokens
Use 70% less memory for TokenizedLines by using a different representation

@nathansobo nathansobo merged commit 0cd1f11 into master May 20, 2015

4 of 6 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
atom Build #2064827 succeeded in 896s
Details
atom-linux Build #2064828 succeeded in 310s
Details
atom-rpm Build #2064829 succeeded in 437s
Details
atom-win Build #2064830 succeeded in 592s
Details

@nathansobo nathansobo deleted the ns-less-memory-for-tokens branch May 20, 2015

@maxbrunsfeld

This comment has been minimized.

Copy link
Contributor

maxbrunsfeld commented May 20, 2015

🎆

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

@nathansobo : I'm sadly still able to reproduce the error with latex files. I'm getting a
Uncaught Error: Encountered an invalid scope end id. Popped -13, expected to pop -71 Every time I open a .tex file.

I'm currently running a build from the git. (commit: 75d02cd). Any ideas what is going wrong here?

@as-cii

This comment has been minimized.

Copy link
Member

as-cii commented May 21, 2015

I'm sadly still able to reproduce the error with latex files. I'm getting a
Uncaught Error: Encountered an invalid scope end id. Popped -13, expected to pop -71 Every time I open a .tex file.

Thanks for reporting this issue, @winnieXY: your help is much appreciated! 🙇

Are you able to reproduce this consistently? If so, could you attach the steps here, please? Any clue to make this behavior repeatable would be really, really helpful!

Thanks!

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

Yes, I'm able to reproduce this consistently in the tex files of my PhD thesis ;(
I'm failing currently to write a minimal working example.

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

/usr/share/atom/resources/app.asar/src/tokenized-buffer.js:573

Error: Encountered an invalid scope end id. Popped -13, expected to pop -65.
at TokenizedBuffer.module.exports.TokenizedBuffer.scopesFromTags (/usr/share/atom/resources/app.asar/src/tokenized-buffer.js:573:21)
at TokenizedBuffer.module.exports.TokenizedBuffer.openScopesForRow (/usr/share/atom/resources/app.asar/src/tokenized-buffer.js:555:21)
at TokenizedBuffer.module.exports.TokenizedBuffer.tokenizeNextChunk (/usr/share/atom/resources/app.asar/src/tokenized-buffer.js:286:105)
at /usr/share/atom/resources/app.asar/src/tokenized-buffer.js:263:26
at /usr/share/atom/resources/app.asar/node_modules/underscore-plus/node_modules/underscore/underscore.js:666:47

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

I found another very annoying bug. Sometimes the cursor goes mad and I'm typing somewhere else than the cursor is. this is easy reproduceable with soft wrap lines and wrap lines at 80 chars. If I type a sentence and add at the end a \textsc{ which is then automatically wrapped into the next line, the cursor is not in the position it should be.

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

This is easy reproduceable with a very small .tex file. Do you have an emailadress than I can send you the minimal working example.

@mnquintana

This comment has been minimized.

Copy link
Member

mnquintana commented May 21, 2015

@winnieXY You can always paste the example into a Gist

@nathansobo

This comment has been minimized.

Copy link
Contributor

nathansobo commented May 21, 2015

There's something I missed in the changes to first-mate. Please do post an example or point at an example repo. @atom/feedback if anyone can mess around with LaTeX files and get a reproduction case, that would be great. @kevinsawicki release is going to be blocked until we can iron this out.

@mnquintana

This comment has been minimized.

Copy link
Member

mnquintana commented May 21, 2015

@winnieXY What LaTeX package(s) do you have installed?

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

autocomplete-bibtex, latex and language-latex - nothing else.There are two errors. First the strange behaviour of the cursor, second the error message popping up. The second one is not reproduceable with a minimal working example, thus I thought about sending one chapter of my PhD thesis to the person working on the fix - however I do not want to make this public (as this is still very alpha and contains results which are not yet published).

@winnieXY

This comment has been minimized.

Copy link

winnieXY commented May 21, 2015

The first error is already reported here: #6885. Maybe it is not related to this change, I don't know

nathansobo added a commit that referenced this pull request May 21, 2015

Revert "Merge pull request #6757 from atom/ns-less-memory-for-tokens"
This reverts commit 0cd1f11, reversing
changes made to d75d202.

Conflicts:
	package.json

@nathansobo nathansobo restored the ns-less-memory-for-tokens branch May 21, 2015

@nathansobo

This comment has been minimized.

Copy link
Contributor

nathansobo commented May 21, 2015

Okay, I've reverted the merge in 7cb0bc3. Seems like there were at least a couple of issues I didn't detect. Thanks for the rapid feedback on these. @winnieXY can you email me the example that reproduces the issue at nathan@github.com. Thanks again guys. Sorry for the history pollution but this change was a tough one to predict every corner case.

@kevinsawicki Just a heads-up that master is no longer blocked on this.

nathansobo added a commit that referenced this pull request May 21, 2015

nathansobo added a commit that referenced this pull request May 21, 2015

basarat added a commit to TypeStrong/atom-typescript that referenced this pull request May 28, 2015

@@ -580,7 +572,7 @@ describe "TokenizedBuffer", ->

describe "when the selector matches a run of multiple tokens at the position", ->
it "returns the range covered by all contigous tokens (within a single line)", ->
expect(tokenizedBuffer.bufferRangeForScopeAtPosition('.function', [1, 18])).toEqual [[1, 6], [1, 28]]
expect(tokenizedBuffer.bufferRangeForScopeAtPosition('.meta.function', [1, 18])).toEqual [[1, 6], [1, 28]]

This comment has been minimized.

@ypresto

ypresto May 28, 2015

Contributor

I think this is destructive change on selector if this doesn't accept .function selector anymore and we should use .meta.function instead.

This caused bug on the plugin because it uses .regexp instead of .string.regexp.
klorenz/atom-regex-railroad-diagrams#42

Is this intended change of (experimental) API?

This comment has been minimized.

@ypresto

ypresto May 28, 2015

Contributor

Merge commit 0cd1f11 reproduces bug, and it's parent d75d202 does not reproduce.

This comment has been minimized.

@nathansobo

nathansobo May 28, 2015

Contributor

@ypresto You're right. Sorry about that. I switched to using ScopeSelector from first-mate without recalling that their matching criteria is stricter. Can you switch to .string.regexp until I can get a fix in today?

This comment has been minimized.

@ypresto

ypresto May 28, 2015

Contributor

OK, I already PRed that repo ;).

I think it'd better to provide backward compatibility (with deprecation warning) only if dot prefix is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment