Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: rewrite type parser #9208

Merged
merged 4 commits into from
May 18, 2020
Merged

Conversation

makenowjust
Copy link
Contributor

@makenowjust makenowjust commented Apr 30, 2020

Problem

For some reasons, the parser for types and generic type path literals is known as one of the most complex part of the parser. (Another one is do ... end block parsing.)

  • too much sharing: the parser for a single type, type arguments, proc type arguments, tuple element types and others are unified to just one method. This code-sharing is excessive. They are not share syntax really, so there are many conditions and branches, thus it is hard to follow code flow.
  • historical naming: a type path (or part ot this) is called as ident for historical reason. However ident means different today.

As the above reasons, gazillion bugs live in there unfortunately. For instance, they are valid type restriction for now:

  • *Foo (orphan splat)
  • (Foo, Bar) | Baz (comma + union type)
  • {(Foo, Bar), Baz} (comma + tuple type)

So, refining type grammar and rewriting type parser is necessary.

Type Grammar

The following is complete type grammar which is supported by this PR.

bare-proc-type := union-type ("," union-type)* proc-type-output?

union-type := atomic-type-with-suffix ("|" atomic-type-with-suffix)*

atomic-type-with-suffix := atomic-type type-suffix*

atomic-type := "_"                                                            # underscore type
             / "self"                                                         # self type
             / "self?"                                                        # nilable self type
             / typeof                                                         # typeof type
             / generic                                                        # generic type
             / "{" union-type ("," union-type)* ","? "}"                      # tuple type
             / "{" named-type-list "}"                                        # named tuple type
             / proc-type-output                                               # proc type without input types
             / "(" (union-type ("," union-type)* ","?)? ")" proc-type-output  # proc type
             / "(" bare-proc-type ")"

proc-type-output := "->" union-type?

generic := path type-args?

path := "::"? CONST ("::" CONST)*

type-args := "(" "*"? type-arg ("," "*"? type-arg)* ","? ")"
           / "(" bare-proc-type ")"
           / "(" named-type-args ")"

named-type-args := key ":" bare-proc-type ("," key ":" bare-proc-type)* ","?

type-arg := NUMBER / sizeof / instance-sizeof / offsetof / union-type

type-suffix := "." "class"       # class type
             / "*"               # pointer type
             / "**"              # double-pointer type
             / "?"               # nilable type
             / "[" type-arg "]"  # static array type

typeof := ...
sizeof := ...
instance-sizeof := ...
offsetof := ...

I think this grammar behaves as we expect.
(Repeatedly the type parser is broken, so we cannot know the current right grammar. I tried to observe the parser behavior and reject unexpected ones.)

At least, modifying stdlib and all specs for this grammar compatibility was not needed.

Fixes

Rewriting type parser is done in this PR, then some bugs are fixed.

Reject orphan splat

Now, splat (*) can appear only in type arguments (proc input type arguments also).

The following example is syntax error by this PR:

def foo(x : *Int32)
end

Disable auto-flatting comma list

Currently {(Foo, Bar), Baz} is parsed to {Foo, Bar, Baz}. However no one wants this behavior. In addition (Foo, Bar) | Baz is valid for near reason. They should be disabled.

The following example is syntax error by this PR:

def foo(x : {(Int32, Int32), Int32}, y : (Int32, Int32) | Int32)
end

And, some minor bugs are fixed:

  • Format (self)? restriction correctly.
  • Unify parsed AST for self? and A?. (Current compiler generates self | ::Nil and ::Union(A, ::Nil). I think the latter is correct AST because Generic has question flag to represent nilable type.)

/cc @asterite

@jwoertink
Copy link
Contributor

Does this still allow for the star after a type like in libs?

fun open_device = OpenDevice(chr : Char*) : UInt8*

@asterite
Copy link
Member

My guess is yes because CI passes and that syntax is in our code.

@makenowjust
Copy link
Contributor Author

@jwoertink This represents pointer type and it is allowed of course. In type grammar, prefix * means splat, and suffix * means pointer type.

I think this PR fix does not break any existing codes.

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Since it passes CI I expect it to have few if any regressions.

@straight-shoota straight-shoota added kind:refactor topic:compiler:parser kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:lang:type-system labels May 1, 2020
@oprypin
Copy link
Member

oprypin commented May 1, 2020

I remembered the infra that tries to build external Crystal projects with the new changes (that I think @bcardiff owned). Should that be involved?

@jhass
Copy link
Member

jhass commented May 1, 2020

https://github.com/bcardiff/test-ecosystem

I think he runs that as part of the release process, but sure, no harm in doing it on this branch too.

@straight-shoota straight-shoota added this to the 0.35.0 milestone May 1, 2020
Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I spotted a typo. 😅

Thanks for the BNF!

src/compiler/crystal/syntax/parser.cr Outdated Show resolved Hide resolved
@makenowjust
Copy link
Contributor Author

To try bcardiff/test-ecosystem, I ran distribution-scripts against this branch. However it was failed on building shards due to Log module changes.

How should I do? or could anyone try it?

@straight-shoota
Copy link
Member

You should try again, crystal-lang/shards#379 probably fixed this.

@bcardiff
Copy link
Member

bcardiff commented May 6, 2020

@makenowjust nightlies are stuck until a new version of shards is releases.

The easiest way to use test-ecosystem today would be to use the local target in the makefile. You can use LOCAL_CRYSTAL and LOCAL_SHARDS env variables to indicate the path where the compiler and shards binary are available.

make clean local LOCAL_CRYSTAL=~/Projects/crystal/master/bin LOCAL_SHARDS=~/Projects/crystal/shards/bin

That way you can use your working copy of both projects (or even use the current shards binary)

@bcardiff
Copy link
Member

bcardiff commented May 6, 2020

if things stuck again ping me and I'll check what can I do.

It's hard to automate things in test-ecosystem. In every release I need to change the content in 10-clone-repos.sh to point to forked branches with overrided dependencies. If the scope is to check if master of those projects work, things are doable, but IME is hardly the case. for example checking PR:8893 db drivers requires some update, and that mean forking lots of things later. Shards overrides will be helpful of course.

@straight-shoota
Copy link
Member

straight-shoota commented May 6, 2020

I'm pretty sure this change is fine, though. Perhaps there might be some rare glitch that hasn't shown up on the radar yet. But this might be easier to spot when this PR is merged and in nightly.

@jhass
Copy link
Member

jhass commented May 7, 2020

I agree, if you have troubles getting it running I think we can do without!

@asterite
Copy link
Member

asterite commented May 7, 2020

If all specs pass, this can be merged. This isn't a language feature that's changing or a breaking change. The compiler and std use some C bindings too. I think merging this is fine. We can revert it if it breaks things just before releasing the next version.

@asterite
Copy link
Member

asterite commented May 7, 2020

Actually, I'll review this today.

it_parses "def foo(var : self?); end", Def.new("foo", [Arg.new("var", restriction: Crystal::Union.new([Self.new, Path.global("Nil")] of ASTNode))])
it_parses "def foo(var : self?); end", Def.new("foo", [Arg.new("var", restriction: Crystal::Generic.new(Path.global("Union"), [Self.new, Path.global("Nil")] of ASTNode))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change? Same for other place where we now produce a Generic instead of Union. I think Union is simpler...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic has question property to represent suffix ?. (see here). So I think it is better than using Union.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that we have code in restrictions.cr for Union and I think it should work the same as for Generic, but I'm not sure. So changing this might indeed break someones code because we are changing behavior.

Copy link
Contributor Author

@makenowjust makenowjust May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure too. But your concern is semantic analyzer's bug, and this is also parser's bug fix.
For instance, currently Foo(Bar?)? is parsed to Generic.new(Path.global("Union"), [Generic.new(Path.new("Foo"), [Union.new([Path.new("Bar"), Path.global("Nil")])]), Path.global("Nil")]) (inner Bar? uses Union and outer Foo(...)? uses Generic.)

When such a bug is found really, we can fix analyzer correctly. Leaving a certain bug sounds not good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's wait and see if bugs happen then.

@makenowjust
Copy link
Contributor Author

Who can merge this PR...

@asterite If possible, could you merge or approve this?

@asterite
Copy link
Member

asterite commented May 8, 2020

Sorry, I won't have time to review this. But if others feel confident this is okay, please merge.

I'm still hesitant about the Union -> Generic change, I think representing it with Union is much simpler.

@makenowjust
Copy link
Contributor Author

Aha, mixing Generic and Union makes sense in this point:

  • Generic is used on expression context.
  • Union is used on type context.

So I will try to keep this behavior.

@asterite Is this okay?

@asterite
Copy link
Member

asterite commented May 8, 2020

@makenowjust I think so. It's just those changes in the parser that didn't need to change. Before they were Union and now they are Generic. My only comment in this PR.

makenowjust and others added 2 commits May 13, 2020 14:44
For some reasons, the parser for types and generic type path literals is
known as one of the most complex part of the parser.
(Another one is `do ... end` block parsing.)

- **too much sharing**: the parser for a single type, type arguments,
  proc type arguments, tuple element types and others are unified to just one method.
  This code-sharing is excessive. They are not share syntax really, so there are many
  conditions and branches, thus it is hard to follow code flow.
- **historical naming**: a type path (or part ot this) is called as `ident` for historical reason.
  However `ident` means different today.

As the above reasons, gazillion bugs live in there unfortunately.
For instance, they are valid type restriction for now:

  - `*Foo` (orphan splat)
  - `(Foo, Bar) | Baz` (comma + union type)
  - `{(Foo, Bar), Baz}` (comma + tuple type)

This commit refines type grammer and rewrite parsers.
Co-authored-by: Brian J. Cardiff <bcardiff@gmail.com>
@makenowjust
Copy link
Contributor Author

makenowjust commented May 13, 2020

Now Generic and Union are used properly. Please review this 🙏

Note that 3235714 introduces a tiny breaking change. See the following. This code can be parsed by the current compiler (Of course it is failed on semantic phase.).

class Foo < Bar?
  include Baz?
end

After 3235714, it becomes parsing error. I think any problems are not happened by this change because every such code is not compilable.

@asterite
Copy link
Member

@makenowjust That's great, thank you!

I'll try to review it this week if I can.

@asterite
Copy link
Member

Well, sorry again, I don't think I'll have time to review this 😓

@bcardiff
Copy link
Member

FYI test-ecosystem is happy with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. kind:refactor topic:compiler:parser topic:lang:type-system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants