Use new TextBuffer LanguageMode API #16087

maxbrunsfeld · 2017-11-03T00:20:40Z

This is a continuation of work started in #15713, with the overall goal of allowing Atom to use parsing systems other than TextMate grammars.

One Language Per Buffer

The main user-facing change introduced in this PR is that language modes (previously called TokenizedBuffers) now belong to a TextBuffer, not an individual TextEditor. This means that if have you have a file open in 3 different tabs, the file will only be parsed once instead of 3 times for the purpose of syntax highlighting. It also means that when you select a language for the file using the grammar selector, your choice will apply in all 3 places.

TextEditor API Changes

Previously, the way to configure an editor's language was to assign to the editor a TextMate Grammar, and the editor would internally construct a TokenizedBuffer which would use the first-mate APIs to parse the buffer.

Now, the TextEditor class is no longer responsible for constructing its parser directly; it relies on the buffer having a language mode which provides information about syntax highlighting, folding, commenting, and suggested indentation.

This means that the .getGrammar and .setGrammar methods don't really make sense any more, and neither does the grammar parameter to the TextEditor constructor. These APIs will continue to work for backwards-compatibility, but will be deprecated at some point.

Other API Changes

Other language-related APIs like TextEditorRegistry.setGrammarOverride were designed around the TextMate concept of scope names - strings like source.js that are intended to fit into TextMate's scope selector system. Though scope selectors will continue to be a concept in Atom, we will no longer assume that these selectors use the TextMate-style naming conventions. Where possible, I'd like to start referring to these strings as language ids, not scope names.

Since TextEditors no longer construct their own language modes, I've moved this functionality into the existing GrammarRegistry object exposed as atom.grammars. It is implemented in the following methods:

atom.grammars.assignLanguageMode(buffer, languageId) - Assign the buffer a language mode that uses the language with the given id. For textmate grammars, we use the scopeName (e.g. source.js) as an ID. Tree-sitter grammars will have different looking ids (probably just javascript, go, etc).
atom.grammars.autoAssignLanguageMode(buffer) - Assign the buffer a language mode that uses the most appropriate language, based on the buffer's file extension and content.
atom.grammars.maintainLanguageMode(buffer) - Assign the buffer a language mode and continue to update the language mode as additional language packages are loaded, or the buffer's file path changes.

The TextEditorRegistry is no longer responsible for anything to do with grammars. The grammar-related methods will continue to work but they will be deprecated sometime soon.

Remaining Tasks

Rename TokenizedBuffer to TextMateLanguageMode (I'm going to do this in a separate PR to keep this PR's diff more readable).
Track visibility correctly when a language mode is used by multiple editors

Next Steps

It should now be possible to integrate Tree-sitter parsers into Atom just by changing the GrammarRegistry and the PackageManager. The core editor no longer assumes anything about how its parser works. Once this is merged, I will begin doing this.

Signed-off-by: Nathan Sobo <nathan@github.com>

maxbrunsfeld · 2017-11-23T01:38:58Z

😓 Finally ✅ .

maxbrunsfeld · 2017-11-27T19:54:34Z

/cc @atom/maintainers - Curious if anyone has thoughts or opinions on the API changes described above (assignLanguageMode, autoAssignLanguageMode, maintainLanguageMode). Reasonable? Confusing?

winstliu · 2017-11-27T21:12:39Z

Where possible, we will identify languages by their common name, instead of assuming that they have a special scope name that is different from their regular name.

@Alhadis @lierdakil this change would mean that any TextMate grammars that are assigned programmatically are required to have a name value, even if they generally shouldn't be user-selectable. One example is the JS regexp grammars that are used for the find-and-replace fields. Do you have any concerns with that change? Since I think it makes sense to set grammars based on their names rather than the internal Grammar instances, I can potentially see a new hidden property or something similar being added to TextMate grammars that need names but should still be hidden from the Grammar Selector.

Alhadis · 2017-11-27T23:02:03Z

I'm fine with having a hidden property if that's what it takes.

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

maxbrunsfeld · 2017-11-28T00:44:20Z

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

It's definitely a fairly major change. My problem with the current paradigm is that the concept of scope names is kind of TextMate-specific. With most parsing systems, syntax tree nodes aren't named in this hierarchical scheme like entity.name.type.object.console.js. They just have a single name like identifier or function_definition. Similarly, programming languages are usually identified using their name, like ruby or javascript; strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?).

Practically speaking, I think grammars' names should be just as unique as their scopeName (you would hope so, or else the Grammar Selector wouldn't be usable). The only issue that I can think of with using them is that the letter casing might not be obvious (C++ vs c++, JavaScript vs javascript), but I think that should be addressed by treating the names as case-insensitive.

I agree that some of the names like HTML (Ruby - ERB) are not very easy to guess. You wouldn't want to hardcode that string into your code. But scopes have a similar problem. It's not obvious at all that Objective-C is encoded as source.objc.

I think one solution is to fix some of the name properties. The ERB grammar should probably just be called Embedded Ruby.

Alhadis · 2017-11-28T00:59:23Z

My approach would be to introduce a new field simply called id. If not specified, it's derived from the existing scopeName, with the source. and text. prefixes hacked off: .source.js ⇒ js

Non-alphanumeric characters are replaced with dashes, so that .source.js.rails becomes js-rails instead.

Thoughts? :)

maxbrunsfeld · 2017-11-28T01:19:58Z

The word id (identifier) is pretty much a synonym of the word name, so I feel like that would just add to the confusion. There would then be three ways of referring to a grammar:

id: "ruby"
name: "Ruby"
scopeName: "source.ruby"

Can you elaborate on what concrete problem you're seeing with using the grammar's name to identify it?

maxbrunsfeld · 2017-11-28T02:07:45Z

The internal property seems reasonable to me. Should we also have the grammar selector exclude grammars that have injection selectors?

Alhadis · 2017-11-28T02:40:31Z

If a package author changes the name of a grammar, wouldn't that invalidate any serialised overrides or user config settings which reference the old name? Say, for example, an author decided to change "Fortran - Fixed Form" to "Fortran (Fixed form)" or even "FORTRAN77". There's less risk of this happening with an internal identifier, because authors have no incentive to change it. Human-readable identifiers, however, are more vulnerable to bikeshed-ish changes.

Alhadis · 2017-11-28T02:47:48Z

I also wouldn't filter out grammars simply because they contain an injectionSelector field. See atom/language-gfm#175.

damieng · 2017-11-28T03:47:04Z

We could go with same process as the atom config. i.e. the name is "fortranFixedForm" and if you want to later change how it's automatically reformatted for display then you use the optional title field to set it to "Fortran (Fixed Form)"

Alhadis · 2017-11-28T04:21:00Z

Possibly, but that still wouldn't address something like, say, Batchfile script -> Batchfile (if an author later decides the "script" part is extraneous. Or something like abbreviating a name the author later decides is too lengthy and looks better in a contracted form...

Alhadis · 2017-11-28T04:35:15Z

Ideally, it'd be great if identifiers matched whatever followed language- in the names of Atom's core language packages. E.g., language-gfm is gfm instead of GitHub-flavoured Markdown, et al.

I'm all for having a SSOT, and I'm all for getting rid of the weird, nonsensical scope-name conventions used by TextMate (which have never made sense to me; the source./text. inconsistencies drive me nuts). So I'm certainly not protesting this PR... however, I sense we could go about this differently and promote the use of a single, explicit identifier to represent grammars internally.

strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?)

Personally, I feel the id should match whatever is used as the last scope in a grammar's patterns. So for JavaScript, that would be js, because that's what's affixed to the scope-list of each successful match (keyword.operator.js, etc).

damieng · 2017-11-28T05:14:35Z

v1

name: batchfileScript

v2

name: batchfileScript
title: Batchfile

What am I missing?

maxbrunsfeld · 2017-11-28T05:17:17Z

There's less risk of this happening with an internal identifier, because authors have no incentive to change it.

I see your point. I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

@damieng I think the issue is that the name field on grammars is currently used for the more human readable version (similar to the title in your proposal).

Alhadis · 2017-11-28T05:26:42Z

I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

Even that's inconsistent, though. The same field-name is used for describing the scopes of successful matches. E.g., the name property of a patterns entry is semantically different to the name field used by the grammar itself.

It's all a horrid mess, in all honesty...

maxbrunsfeld · 2017-11-28T19:21:03Z

@Alhadis @damieng @50Wliu I've updated the PR to remove the use of language names in assignLanguageMode. The method now takes a parameter called languageId. For TextMate grammars, this will just be the scopeName, so that today's grammars will continue to be identified primarily by their scopeName, exactly as before.

Tree-sitter grammars will have their own ids that will look a bit different: javascript, ruby, c++, fortran77.

Thanks to @as-cii for helping come up with this safer path forward. Does this seem reasonable to everyone? I think it is less risky and disruptive than what I originally proposed.

We will still wait to tokenize a buffer until it is visible in *some* tab, but we no longer enable and disable tokenization as different editors become visible.

Alhadis · 2017-11-28T22:12:25Z

LGTM.

Summary: setGrammar is being deprecated soon, see atom/atom#16087 Based on the changes recommended in the pull request changing the used API for setting language modes. Gating the change to just atom 1.24 and above. Reviewed By: tjfryan Differential Revision: D6776948 fbshipit-source-id: 29553915de77f9fa876b439bdb986cb3332139d5

maxbrunsfeld added 4 commits November 2, 2017 11:56

Inline unnecessary addBufferAtIndex method

b7ae57b

🎨 project-spec.js

1a7ea33

Add grammar registry as an injected dependency of Project

0d6c746

Start work on switching to new language mode API

140a783

maxbrunsfeld force-pushed the mb-use-language-mode-api branch from 4397f71 to 140a783 Compare November 3, 2017 07:12

maxbrunsfeld added 2 commits November 3, 2017 11:48

Don't require that language modes have a getGrammar method

bfffde9

Assign a language mode in text editor component specs

e6c3891

maxbrunsfeld force-pushed the mb-use-language-mode-api branch 2 times, most recently from 0460484 to 78c2e45 Compare November 4, 2017 00:33

maxbrunsfeld added 2 commits November 6, 2017 09:36

Iterate on GrammarRegistry APIs

351f96d

Signed-off-by: Nathan Sobo <nathan@github.com>

Make several languageMode APIs optional in TextEditor

113b563

maxbrunsfeld force-pushed the mb-use-language-mode-api branch from 78c2e45 to 113b563 Compare November 6, 2017 17:36

maxbrunsfeld added 4 commits November 6, 2017 09:36

⬆️ text-buffer

44be612

Serialize text editors' tab lenghts

17c90ea

Merge branch 'master' into mb-use-language-mode-api

4ec1d85

Resubscribe to grammar events after clearing grammar registry

96e4018

maxbrunsfeld requested review from matthewwithanm and removed request for matthewwithanm November 7, 2017 00:09

maxbrunsfeld added 3 commits November 6, 2017 17:07

Make getNonWordCharacters take a position

3d36455

Handle grammars with no name

52f7099

Move largeFileMode logic to TokenizedBuffer

fe6b385

maxbrunsfeld force-pushed the mb-use-language-mode-api branch from 9765ddd to 316bcda Compare November 7, 2017 01:07

maxbrunsfeld added 3 commits November 6, 2017 17:18

Pass tabLength to suggestedIndent methods for now

70cca84

Use assignLanguageMode in TextEditor spec

20b0fc6

Pass tabLength to getFoldableRangeContainingPoint

de27d4e

maxbrunsfeld force-pushed the mb-use-language-mode-api branch from 316bcda to de27d4e Compare November 7, 2017 01:19

Merge branch 'master' into mb-use-language-mode-api

653ecd7

maxbrunsfeld force-pushed the mb-use-language-mode-api branch from 81b862f to 653ecd7 Compare November 15, 2017 20:52

maxbrunsfeld added 2 commits November 15, 2017 17:15

Use assignLanguageMode in TextEditorComponent spec

91d4f53

📝 Fix comment on GrammarRegistry class

00bea43

Remove handling of old grammarOverridesByPath serialization key

bffdeb4

maxbrunsfeld requested a review from winstliu November 27, 2017 19:01

⬆️ text-buffer

a1fc5ef

Make assignLanguageMode take a language id instead of a language name

e08091f

maxbrunsfeld added 2 commits November 28, 2017 12:12

Replace TokenizedBuffer.setVisible with simpler .startTokenizing

61ac366

We will still wait to tokenize a buffer until it is visible in *some* tab, but we no longer enable and disable tokenization as different editors become visible.

Ensure stateStore isConnected for .shouldPromptToSave tests

74a2609

maxbrunsfeld merged commit ce8f300 into master Nov 28, 2017

maxbrunsfeld deleted the mb-use-language-mode-api branch November 28, 2017 21:49

maxbrunsfeld mentioned this pull request Dec 1, 2017

Ensure that all opened editors' buffers are added to the project #16306

Merged

mupchrch mentioned this pull request Mar 2, 2018

Syntax highlighting broken mupchrch/split-diff#141

Closed

Use new TextBuffer LanguageMode API #16087

Use new TextBuffer LanguageMode API #16087

Uh oh!

Conversation

maxbrunsfeld commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

One Language Per Buffer

TextEditor API Changes

Other API Changes

Remaining Tasks

Next Steps

Uh oh!

maxbrunsfeld commented Nov 23, 2017

Uh oh!

maxbrunsfeld commented Nov 27, 2017

Uh oh!

winstliu commented Nov 27, 2017

Uh oh!

Alhadis commented Nov 27, 2017

Uh oh!

maxbrunsfeld commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alhadis commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Nov 28, 2017

Uh oh!

maxbrunsfeld commented Nov 28, 2017

Uh oh!

Alhadis commented Nov 28, 2017

Uh oh!

Alhadis commented Nov 28, 2017

Uh oh!

damieng commented Nov 28, 2017

Uh oh!

Alhadis commented Nov 28, 2017

Uh oh!

Alhadis commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damieng commented Nov 28, 2017

v1

v2

Uh oh!

maxbrunsfeld commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alhadis commented Nov 28, 2017

Uh oh!

maxbrunsfeld commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alhadis commented Nov 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxbrunsfeld commented Nov 3, 2017 •

edited

Loading

maxbrunsfeld commented Nov 28, 2017 •

edited

Loading

Alhadis commented Nov 28, 2017 •

edited

Loading

Alhadis commented Nov 28, 2017 •

edited

Loading

maxbrunsfeld commented Nov 28, 2017 •

edited

Loading

maxbrunsfeld commented Nov 28, 2017 •

edited

Loading