Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Regex revamp #323

Closed
asterite opened this issue Dec 30, 2014 · 10 comments
Closed

[RFC] Regex revamp #323

asterite opened this issue Dec 30, 2014 · 10 comments

Comments

@asterite
Copy link
Member

Ruby has the =~ operator for doing regular-expression matching:

"foo" =~ /o+/ #=> 1

The operator is defined in Object, returning nil, and only redefined in String and Regexp.

There are some thing I don't like about this:

  1. The =~ has no intuitive name: Ruby calls it pattern match, but it's not that obvious from it looks (I usually think of case ... when .. end when somebody says "pattern match").
  2. Programmers have to learn new syntax for something that could perfectly be a method name (match?) and it's only used for String and Regexp.
  3. It returns the index of the match instead of a MatchData. Then you have to use the $~, $1, etc. magic variables.

Granted, once you learn it you get used to it and it feels kind of comfortable, but is Regexp really that important for it to have an operator just for itself, and the $~, $1 magic variables?

I'd like to do this:

  1. Remove the =~ operator. Instead, add a String#match method (similar to Regex#match) that returns a MatchData if successful and nil if not.
  2. Remove the $~, $1, etc. magic variables. They might be thread-local, but to make them fiber-local or method-local we would have to add some tricks or magic to the compiler (or a way to make a method define variables in an outer scope, which sounds very magical and unintuitive), when their removal and use of a match result would be much simpler and understandable.

The only downside I see is that this wouldn't work:

case some_string
when /some_(group1)/
  # how to access group1 without $1?
when /some_(group2)/
  # how to access group2 without $1?
end

Instead, one would have to write:

if match = some_string.match(/some_(group1)/)
  # match[1]
elsif match = some_string.match(/some_(group2)/)
  # match[1]
end

which is a bit more verbose, but lacks all magic. But I don't think that code is way too often to justify those magic features.

What do you think?

As a side note, I'd also like to remove the $? magic variable, and then we would be free of them.

@jhass
Copy link
Member

jhass commented Dec 30, 2014

I think I agree about removing =~. The case calls Regexp#=== and I think Crystal should do that too, just aliasing Regexp#match to Regexp#=== should work.

@vendethiel
Copy link
Contributor

Yes, yes, and for the last one: yes!

I don't actually think it makes sense for String to have regexp method defined on it. I'd love to kill magic variables. You could think case match = xx would work, though?

@asterite
Copy link
Member Author

@vendethiel Yes, I was thinking more about something like this:

case some_string
when match = /some_(group1)/
  # match[1]
when match = /some_(group2)/
  # match[1]
end

But case exp = some.complex.exp could also work and be shorter. The only "problem" is that it is currently used in the compiler in some places, with the meaning "assign some.complex.exp to exp and then do the case with that". It's useful when you want to match against an expression's type and capture that variable at the same time:

case exp = some.complex.expression
when Foo
  # compiler knows exp is Foo
when Bar
  # compiler knows exp is Bar
end

But the change would be to move the assignment before the case, not a big deal. In an if an assignment is more useful because it can appear in one part of an && or in an elsif.

(But maybe the when match = /regex/ is clearer, although more verbose)

@asterite
Copy link
Member Author

Another possibility would be to keep the =~ operator but still remove the magic variables. Argh, it's so hard to take choices :-P

@asterite
Copy link
Member Author

We brainstormed a bit about this. We imagined what would some code look like without the magic $1 and $~ variables. For example this method would look like this:

private def process_handler(handler)
  flag = handler.flag
  block = handler.block

  case
  when match = flag.match /--(\S+)\s+\[\S+\]/
    process_double_flag("--#{match[1]}", block)
  when match = flag.match /--(\S+)(\s+|\=)(\S+)?/
    process_double_flag("--#{match[1]}", block, true)
  when flag.match /--\S+/
    process_flag_presence(flag, block)
  when flag.match /-(.)\s*\[\S+\]/
    process_single_flag(flag[0 .. 1], block)
  when flag =~ /-(.)\s+\S+/, flag =~ /-(.)\s+/, flag =~ /-(.)\S+/
    process_single_flag(flag[0 .. 1], block, true)
  else
    process_flag_presence(flag, block)
  end
end

To us, it looks much less readable (although it obviously gets the job done). $1 and $~ do look useful and make code more readable.

So, we were thinking of ways to keep these magic variables while at the same time making them concurrent-safe. The ideas that we had are:

  1. Hard-wire in the compiler that some expressions involving regexes define a local variable named $~. So for example if you write case foo; when /some_regex/; end the compiler wold rewrite it as if $~ = foo.match(/some_regex/).
  2. Allow to mark methods with an attribute that say "this method defines this magic variable". So for example String#=~ would be marked with @[Magic("$~")] (we would probably make String#=~ return a MatchData, and also String#===), so the compiler would know that it needs to assign that call to a local $~ variable.
  3. Similar to 2, but instead of using an attribute one would define def =~(pattern, out $~) and in the method you could assign to $~. The compiler would declare $~ before invoking the call and pass a pointer to it, somehow. In that way =~ could still return an index, and $~ would define a MatchData.

The other alternatives are to keep them as global variables (maybe thread-local, but fiber-local would be a bit hard to implement, I guess), or to remove them completely from the language.

All of the above decisions apply to $! as well.

We are just putting these ideas here so we can know what you think about these. And also maybe you have a better idea of how to "fix" this. We didn't decide anything yet :-)

What do you think?

@vendethiel
Copy link
Contributor

That seems a bit scary. (i'm not sure who the "you", is, though, but i'll take it as crystal users in general :P)
I think I'd link a MatchData being returned better, but i think you've agreed on that.

I'm mostly against global variables being used here. Magical variables are a lesser evil, imho.
I'll quote Perl6 here, because I think its grammar/actions model is quite good.
They just have a "last match" variable "$/" that their ~~ smart-matching operator uses (=~ was removed because 1) they thought it was bad and didn't mean anything 2) it conflicts with $a = ~$b # unary stringify).

It's really a mix of things that make it easier for them. Their "switch" statement will use said smart-matching feature, so each case can just be a regexp, and you can reap the fruits of the ... well, of the smart-matching :

given "string" {
  # "when" operates on the topic, so, "given"'s operand
  when /hey: .* here/ {
    say "I found {$/.Str}";
  }
}

But really, I'm not sure it even matters that much. Do you want crystal to be a scripting (as in, crystal -e to do awk-like things)-full language?

@asterite asterite changed the title Regex revamp [RFC] Regex revamp Feb 10, 2015
@ysbaddaden
Copy link
Contributor

I'm not fond of magic global variables, but Ruby made me love $1, $2, ... when dealing with conditional regexp matches. That being said, having them to be real globals sounds crazy. Yes, $ means it's global, but I expect them to be just local to the current scope, as in solution 1 (if I understand correctly).

I don't see the need for $! and $@ (at all). As for $? with a better std/core lib, it should be avoidable.

@asterite
Copy link
Member Author

Well, we're happy to let you know that we have found a way to keep $~, $1 and $? in a thread-safe way, with just a little magic. This magic could be extended later for other symbols, but I don't think other symbols are as used in Ruby (or, say, real programs) as these.

The idea is that you can assign to $~ and $?, but this defines the variable in the caller. For example:

def foo
  $~ = "hey"
end

foo
puts $~ #=> "hey"

The way this is implemented is by passing a hidden pointer to foo with the data for $~, which is assigned inside foo. So it actually ends up being something like this (but the compiler does it without an AST rewrite):

def foo(hidden_ptr)
  hidden_ptr.value = "hey"
end

$~ :: MatchData?
foo(pointerof($~))
puts $~.not_nil! #=> "hey"

The not_nil! call is placed automatically by the compiler because $~ might not be assigned to, or might be nil. This doesn't matter much, really, because when you use $~ you almost always guard it with a condition for a regex match.

$? would use the same trick, allowing you to invoke backtick or system, while also getting the process status.

Finally, $1 is rewritten by the compiler to $~.not_nil![1], etc.

We would assign to these variables in String#=~, Regex#===, etc.

We believe these magic variables, once learned, make code much more readable and easier to read and write (although they might look cryptic at first). Text processing and command execution is something that is very common to do, so it's nice to have good support for this. And all of this is guaranteed to be concurrent (thread and fiber) safe. And, at the syntax level, nothing was really added.

We hope you like it! :-)

@asterite
Copy link
Member Author

I'm closing this because of the above decision.

@brandondrew
Copy link

@asterite It was not entirely clear from your final comments whether you decided to keep Ruby's =~ operator (since the comment seemed to focus on magic variables instead), but I'm glad to see it's still there, at least in the latest version that I just installed:

puts "They kept the Ruby syntax" if "this" =~ /is/

I hope it stays in, and I'd like to explain why I think keeping it is the best choice, and the best thing for Crystal.

I think you're going to face decisions like this over and over, that amount to the question

We can see a way of doing things that is an improvement on Ruby's way of doing things—should we do it the better way or the Ruby way?

Personally, I think the answer should either be "the Ruby way" or "both", except in cases where you need to diverge from Ruby in order to meet primary (or critical) goals of Crystal, such as being a compiled language, and keeping compilation times down to a reasonably fast speed, and mapping language types to native machine types, and so forth.

In the case of using =~ or .match, you could have both, where =~ maintains the behavior of Ruby, and where .match returns a MatchData object. I'm not saying you should have both, but it's an option.

The value I see in maintaining Ruby syntax is that the closer you are to Ruby, the more Ruby developers will adopt Crystal. They will see Crystal as basically Ruby that compiles and is super-fast (even though that's not entirely an accurate description once you dig into the details: there's more to Crystal than just Ruby that compiles, but that's an easy way to describe it to new developers). The more you diverge, the more people will say "well, it's a little bit like Ruby, but it's really an entirely different language that just borrows ideas from Ruby", and it will be perceived more like Elixir—friendly to Rubyists, but still very different. As things stand, I think Crystal is very easy for a Rubyist to learn, although there are things you must learn. I see the changes as good, and I'm very excited about Crystal, but the more it diverges the more I'll be just a little bit disappointed, and a little bit more with each divergence. And I think there are thousands of people who will have similar feelings.

So, in short: I think there is great value in maintaining as much compatibility with Ruby as is reasonable. Any changes should bring incredible gains, instead of just minor or debatable ones. And it's also possible to have it both ways, if the new way is really that much better than the old way.

asterite referenced this issue Aug 19, 2016
The Global AST node is still around, and code associated to it too,
because it's used internally by the compiler to store pre-compiled
regexes. The special $~ and $! are also represented as globals,
in the syntax. We can improve this in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants