New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new TextBuffer LanguageMode API #16087

Merged
merged 53 commits into from Nov 28, 2017

Conversation

Projects
None yet
4 participants
@maxbrunsfeld
Contributor

maxbrunsfeld commented Nov 3, 2017

Depends on atom/text-buffer#260

This is a continuation of work started in #15713, with the overall goal of allowing Atom to use parsing systems other than TextMate grammars.

One Language Per Buffer

The main user-facing change introduced in this PR is that language modes (previously called TokenizedBuffers) now belong to a TextBuffer, not an individual TextEditor. This means that if have you have a file open in 3 different tabs, the file will only be parsed once instead of 3 times for the purpose of syntax highlighting. It also means that when you select a language for the file using the grammar selector, your choice will apply in all 3 places.

TextEditor API Changes

Previously, the way to configure an editor's language was to assign to the editor a TextMate Grammar, and the editor would internally construct a TokenizedBuffer which would use the first-mate APIs to parse the buffer.

Now, the TextEditor class is no longer responsible for constructing its parser directly; it relies on the buffer having a language mode which provides information about syntax highlighting, folding, commenting, and suggested indentation.

This means that the .getGrammar and .setGrammar methods don't really make sense any more, and neither does the grammar parameter to the TextEditor constructor. These APIs will continue to work for backwards-compatibility, but will be deprecated at some point.

Other API Changes

Other language-related APIs like TextEditorRegistry.setGrammarOverride were designed around the TextMate concept of scope names - strings like source.js that are intended to fit into TextMate's scope selector system. Though scope selectors will continue to be a concept in Atom, we will no longer assume that these selectors use the TextMate-style naming conventions. Where possible, I'd like to start referring to these strings as language ids, not scope names.

Since TextEditors no longer construct their own language modes, I've moved this functionality into the existing GrammarRegistry object exposed as atom.grammars. It is implemented in the following methods:

  • atom.grammars.assignLanguageMode(buffer, languageId) - Assign the buffer a language mode that uses the language with the given id. For textmate grammars, we use the scopeName (e.g. source.js) as an ID. Tree-sitter grammars will have different looking ids (probably just javascript, go, etc).

  • atom.grammars.autoAssignLanguageMode(buffer) - Assign the buffer a language mode that uses the most appropriate language, based on the buffer's file extension and content.

  • atom.grammars.maintainLanguageMode(buffer) - Assign the buffer a language mode and continue to update the language mode as additional language packages are loaded, or the buffer's file path changes.

The TextEditorRegistry is no longer responsible for anything to do with grammars. The grammar-related methods will continue to work but they will be deprecated sometime soon.

Remaining Tasks

  • Rename TokenizedBuffer to TextMateLanguageMode (I'm going to do this in a separate PR to keep this PR's diff more readable).
  • Track visibility correctly when a language mode is used by multiple editors

Next Steps

It should now be possible to integrate Tree-sitter parsers into Atom just by changing the GrammarRegistry and the PackageManager. The core editor no longer assumes anything about how its parser works. Once this is merged, I will begin doing this.

@maxbrunsfeld maxbrunsfeld requested review from matthewwithanm and removed request for matthewwithanm Nov 7, 2017

@maxbrunsfeld maxbrunsfeld requested a review from 50Wliu Nov 27, 2017

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 27, 2017

Contributor

/cc @atom/maintainers - Curious if anyone has thoughts or opinions on the API changes described above (assignLanguageMode, autoAssignLanguageMode, maintainLanguageMode). Reasonable? Confusing?

Contributor

maxbrunsfeld commented Nov 27, 2017

/cc @atom/maintainers - Curious if anyone has thoughts or opinions on the API changes described above (assignLanguageMode, autoAssignLanguageMode, maintainLanguageMode). Reasonable? Confusing?

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Nov 27, 2017

Member

Where possible, we will identify languages by their common name, instead of assuming that they have a special scope name that is different from their regular name.

@Alhadis @lierdakil this change would mean that any TextMate grammars that are assigned programmatically are required to have a name value, even if they generally shouldn't be user-selectable. One example is the JS regexp grammars that are used for the find-and-replace fields. Do you have any concerns with that change? Since I think it makes sense to set grammars based on their names rather than the internal Grammar instances, I can potentially see a new hidden property or something similar being added to TextMate grammars that need names but should still be hidden from the Grammar Selector.

Member

50Wliu commented Nov 27, 2017

Where possible, we will identify languages by their common name, instead of assuming that they have a special scope name that is different from their regular name.

@Alhadis @lierdakil this change would mean that any TextMate grammars that are assigned programmatically are required to have a name value, even if they generally shouldn't be user-selectable. One example is the JS regexp grammars that are used for the find-and-replace fields. Do you have any concerns with that change? Since I think it makes sense to set grammars based on their names rather than the internal Grammar instances, I can potentially see a new hidden property or something similar being added to TextMate grammars that need names but should still be hidden from the Grammar Selector.

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 27, 2017

I'm fine with having a hidden property if that's what it takes.

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

Alhadis commented Nov 27, 2017

I'm fine with having a hidden property if that's what it takes.

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 28, 2017

Contributor

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

It's definitely a fairly major change. My problem with the current paradigm is that the concept of scope names is kind of TextMate-specific. With most parsing systems, syntax tree nodes aren't named in this hierarchical scheme like entity.name.type.object.console.js. They just have a single name like identifier or function_definition. Similarly, programming languages are usually identified using their name, like ruby or javascript; strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?).

Practically speaking, I think grammars' names should be just as unique as their scopeName (you would hope so, or else the Grammar Selector wouldn't be usable). The only issue that I can think of with using them is that the letter casing might not be obvious (C++ vs c++, JavaScript vs javascript), but I think that should be addressed by treating the names as case-insensitive.

I agree that some of the names like HTML (Ruby - ERB) are not very easy to guess. You wouldn't want to hardcode that string into your code. But scopes have a similar problem. It's not obvious at all that Objective-C is encoded as source.objc.

I think one solution is to fix some of the name properties. The ERB grammar should probably just be called Embedded Ruby.

Contributor

maxbrunsfeld commented Nov 28, 2017

I'm more concerned with the usage of a human-readable identifier instead of a scopeName property to uniquely identify a grammar instance...

It's definitely a fairly major change. My problem with the current paradigm is that the concept of scope names is kind of TextMate-specific. With most parsing systems, syntax tree nodes aren't named in this hierarchical scheme like entity.name.type.object.console.js. They just have a single name like identifier or function_definition. Similarly, programming languages are usually identified using their name, like ruby or javascript; strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?).

Practically speaking, I think grammars' names should be just as unique as their scopeName (you would hope so, or else the Grammar Selector wouldn't be usable). The only issue that I can think of with using them is that the letter casing might not be obvious (C++ vs c++, JavaScript vs javascript), but I think that should be addressed by treating the names as case-insensitive.

I agree that some of the names like HTML (Ruby - ERB) are not very easy to guess. You wouldn't want to hardcode that string into your code. But scopes have a similar problem. It's not obvious at all that Objective-C is encoded as source.objc.

I think one solution is to fix some of the name properties. The ERB grammar should probably just be called Embedded Ruby.

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

My approach would be to introduce a new field simply called id. If not specified, it's derived from the existing scopeName, with the source. and text. prefixes hacked off: .source.jsjs

Non-alphanumeric characters are replaced with dashes, so that .source.js.rails becomes js-rails instead.

Thoughts? :)

Alhadis commented Nov 28, 2017

My approach would be to introduce a new field simply called id. If not specified, it's derived from the existing scopeName, with the source. and text. prefixes hacked off: .source.jsjs

Non-alphanumeric characters are replaced with dashes, so that .source.js.rails becomes js-rails instead.

Thoughts? :)

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 28, 2017

Contributor

The word id (identifier) is pretty much a synonym of the word name, so I feel like that would just add to the confusion. There would then be three ways of referring to a grammar:

id: "ruby"
name: "Ruby"
scopeName: "source.ruby"

Can you elaborate on what concrete problem you're seeing with using the grammar's name to identify it?

Contributor

maxbrunsfeld commented Nov 28, 2017

The word id (identifier) is pretty much a synonym of the word name, so I feel like that would just add to the confusion. There would then be three ways of referring to a grammar:

id: "ruby"
name: "Ruby"
scopeName: "source.ruby"

Can you elaborate on what concrete problem you're seeing with using the grammar's name to identify it?

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 28, 2017

Contributor

The internal property seems reasonable to me. Should we also have the grammar selector exclude grammars that have injection selectors?

Contributor

maxbrunsfeld commented Nov 28, 2017

The internal property seems reasonable to me. Should we also have the grammar selector exclude grammars that have injection selectors?

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

If a package author changes the name of a grammar, wouldn't that invalidate any serialised overrides or user config settings which reference the old name? Say, for example, an author decided to change "Fortran - Fixed Form" to "Fortran (Fixed form)" or even "FORTRAN77". There's less risk of this happening with an internal identifier, because authors have no incentive to change it. Human-readable identifiers, however, are more vulnerable to bikeshed-ish changes.

Alhadis commented Nov 28, 2017

If a package author changes the name of a grammar, wouldn't that invalidate any serialised overrides or user config settings which reference the old name? Say, for example, an author decided to change "Fortran - Fixed Form" to "Fortran (Fixed form)" or even "FORTRAN77". There's less risk of this happening with an internal identifier, because authors have no incentive to change it. Human-readable identifiers, however, are more vulnerable to bikeshed-ish changes.

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

I also wouldn't filter out grammars simply because they contain an injectionSelector field. See atom/language-gfm#175.

Alhadis commented Nov 28, 2017

I also wouldn't filter out grammars simply because they contain an injectionSelector field. See atom/language-gfm#175.

@damieng

This comment has been minimized.

Show comment
Hide comment
@damieng

damieng Nov 28, 2017

Contributor

We could go with same process as the atom config. i.e. the name is "fortranFixedForm" and if you want to later change how it's automatically reformatted for display then you use the optional title field to set it to "Fortran (Fixed Form)"

Contributor

damieng commented Nov 28, 2017

We could go with same process as the atom config. i.e. the name is "fortranFixedForm" and if you want to later change how it's automatically reformatted for display then you use the optional title field to set it to "Fortran (Fixed Form)"

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

Possibly, but that still wouldn't address something like, say, Batchfile script -> Batchfile (if an author later decides the "script" part is extraneous. Or something like abbreviating a name the author later decides is too lengthy and looks better in a contracted form...

Alhadis commented Nov 28, 2017

Possibly, but that still wouldn't address something like, say, Batchfile script -> Batchfile (if an author later decides the "script" part is extraneous. Or something like abbreviating a name the author later decides is too lengthy and looks better in a contracted form...

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

Ideally, it'd be great if identifiers matched whatever followed language- in the names of Atom's core language packages. E.g., language-gfm is gfm instead of GitHub-flavoured Markdown, et al.

I'm all for having a SSOT, and I'm all for getting rid of the weird, nonsensical scope-name conventions used by TextMate (which have never made sense to me; the source./text. inconsistencies drive me nuts). So I'm certainly not protesting this PR... however, I sense we could go about this differently and promote the use of a single, explicit identifier to represent grammars internally.

strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?)

Personally, I feel the id should match whatever is used as the last scope in a grammar's patterns. So for JavaScript, that would be js, because that's what's affixed to the scope-list of each successful match (keyword.operator.js, etc).

Alhadis commented Nov 28, 2017

Ideally, it'd be great if identifiers matched whatever followed language- in the names of Atom's core language packages. E.g., language-gfm is gfm instead of GitHub-flavoured Markdown, et al.

I'm all for having a SSOT, and I'm all for getting rid of the weird, nonsensical scope-name conventions used by TextMate (which have never made sense to me; the source./text. inconsistencies drive me nuts). So I'm certainly not protesting this PR... however, I sense we could go about this differently and promote the use of a single, explicit identifier to represent grammars internally.

strings like source.ruby and source.js are not particularly familiar to most people, and they're not even consistent with each other (should the name be abbreviated or not?)

Personally, I feel the id should match whatever is used as the last scope in a grammar's patterns. So for JavaScript, that would be js, because that's what's affixed to the scope-list of each successful match (keyword.operator.js, etc).

@damieng

This comment has been minimized.

Show comment
Hide comment
@damieng

damieng Nov 28, 2017

Contributor

v1

name: batchfileScript

v2

name: batchfileScript
title: Batchfile

What am I missing?

Contributor

damieng commented Nov 28, 2017

v1

name: batchfileScript

v2

name: batchfileScript
title: Batchfile

What am I missing?

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 28, 2017

Contributor

There's less risk of this happening with an internal identifier, because authors have no incentive to change it.

I see your point. I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

@damieng I think the issue is that the name field on grammars is currently used for the more human readable version (similar to the title in your proposal).

Contributor

maxbrunsfeld commented Nov 28, 2017

There's less risk of this happening with an internal identifier, because authors have no incentive to change it.

I see your point. I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

@damieng I think the issue is that the name field on grammars is currently used for the more human readable version (similar to the title in your proposal).

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis Nov 28, 2017

I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

Even that's inconsistent, though. The same field-name is used for describing the scopes of successful matches. E.g., the name property of a patterns entry is semantically different to the name field used by the grammar itself.

It's all a horrid mess, in all honesty...

Alhadis commented Nov 28, 2017

I guess name has taken on a certain usage pattern in the TextMate grammars because it's not used internally as an identifier in existing systems.

Even that's inconsistent, though. The same field-name is used for describing the scopes of successful matches. E.g., the name property of a patterns entry is semantically different to the name field used by the grammar itself.

It's all a horrid mess, in all honesty...

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 28, 2017

Contributor

@Alhadis @damieng @50Wliu I've updated the PR to remove the use of language names in assignLanguageMode. The method now takes a parameter called languageId. For TextMate grammars, this will just be the scopeName, so that today's grammars will continue to be identified primarily by their scopeName, exactly as before.

Tree-sitter grammars will have their own ids that will look a bit different: javascript, ruby, c++, fortran77.

Thanks to @as-cii for helping come up with this safer path forward. Does this seem reasonable to everyone? I think it is less risky and disruptive than what I originally proposed.

Contributor

maxbrunsfeld commented Nov 28, 2017

@Alhadis @damieng @50Wliu I've updated the PR to remove the use of language names in assignLanguageMode. The method now takes a parameter called languageId. For TextMate grammars, this will just be the scopeName, so that today's grammars will continue to be identified primarily by their scopeName, exactly as before.

Tree-sitter grammars will have their own ids that will look a bit different: javascript, ruby, c++, fortran77.

Thanks to @as-cii for helping come up with this safer path forward. Does this seem reasonable to everyone? I think it is less risky and disruptive than what I originally proposed.

maxbrunsfeld added some commits Nov 28, 2017

Replace TokenizedBuffer.setVisible with simpler .startTokenizing
We will still wait to tokenize a buffer until it is visible in *some*
tab, but we no longer enable and disable tokenization as different
editors become visible.

@maxbrunsfeld maxbrunsfeld merged commit ce8f300 into master Nov 28, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@maxbrunsfeld maxbrunsfeld deleted the mb-use-language-mode-api branch Nov 28, 2017

@Alhadis

This comment has been minimized.

Show comment
Hide comment
@Alhadis

Alhadis commented Nov 28, 2017

LGTM.

facebook-github-bot added a commit to facebook/nuclide that referenced this pull request Jan 25, 2018

Fix diff view auto scrolling in atom 1.24.0-beta3
Summary:
setGrammar is being deprecated soon, see atom/atom#16087

Based on the changes recommended in the pull request changing the used API for setting language modes. Gating the change to just atom 1.24 and above.

Reviewed By: tjfryan

Differential Revision: D6776948

fbshipit-source-id: 29553915de77f9fa876b439bdb986cb3332139d5

hansonw added a commit to facebook-atom/atom-ide-ui that referenced this pull request Jan 26, 2018

Fix diff view auto scrolling in atom 1.24.0-beta3
Summary:
setGrammar is being deprecated soon, see atom/atom#16087

Based on the changes recommended in the pull request changing the used API for setting language modes. Gating the change to just atom 1.24 and above.

Reviewed By: tjfryan

Differential Revision: D6776948

fbshipit-source-id: 29553915de77f9fa876b439bdb986cb3332139d5

@mupchrch mupchrch referenced this pull request Mar 2, 2018

Closed

Syntax highlighting broken #141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment