New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexes #390

Closed
FroMage opened this Issue Aug 28, 2012 · 45 comments

Comments

Projects
None yet
5 participants
@FroMage
Member

FroMage commented Aug 28, 2012

In #382 we talked a lot about regexes, and how we might want to put something like that in the base language, to support single-quoted literals.

@gavinking says that the PCRE (Perl 5-Compatible Regular Expressions) syntax used by every other language is too confusing and wants to restrict it to a subset of it. It looks like the subset he uses is considerably smaller than mine, though.

I do agree that the syntax is confusing for those who don't know it, but on the other hand it's shared by 100% of the languages that use regexes, give or take a few features that some might not support. So there's a good chance people will already know the syntax.

On the other hand, the authors of the Perl 5 regexes rewrote the regexes support entirely in Perl 6 for many of the reasons that we don't consider Perl 5 regexes readable and friendly.

See https://github.com/perlpilot/perl6-docs/blob/master/intro/p6-regex-intro.pod for a quick intro on those new regexes. I find them a lot better myself, and that's a lot coming from me who's been mostly fluent in Perl 5 regexes since 1998. It's a lot more readable and powerful. So please read this and compare the feature list with the Perl 5 syntax you already know.

Now, there's another thing I want to touch on, which is that regexes are a lot better if they're part of the language itself. With proper operator and literal support. That's what Perl does, and Ruby, JavaScript and (to a lesser extent) Python do it too.

I always found it alien that Java didn't support regexes in the language, because that means silly quoting, no support for non-string values or expressions in the pattern, and a generally confusing API.

I think we should consider something like Perl 6 regexes in Ceylon. Not as a library but as a language feature.

To translate some of the examples in the Perl 6 doc linked above in Ceylon-ish:

value pattern = "ab*c";
value choice = {"one", "two", "three"};
value re = / $pattern $choice /;
// equivalent to
value re = /  "ab*c" [ one | two | three ] /;
// assign from within the regex
variable String foo;
value re = / $foo := [ <[A-Z]+[0-9]>**4 ] / ;
// matching
value result = / ( abc* ) /.match("asd abcccdef");
assert result[0] == "asd abcccdef";
assert result[1] == "abccc";
// replacing
value result = / ( abc* ) /.replace("asd abcccdef", "$1");
assert result == "abccc";
// modifiers via annotations on the regex literal
value result = caseInsensitive / ( abc* ) /.match("ABC abc");

And the cool bit about the grammars:

grammar Calc {

    regex expr {
        <term> '*' <expr> |
        <term> '/' <expr> |
        <term>
    }

    regex term {
        <factor> '+' <term> |
        <factor> '-' <term> |
        <factor>
    }

    regex factor { <digit>+ |  '(' <expr> ')' }
}

It looks very alien, and it's effectively a DSL within Ceylon, but it's so much more powerful than not having support in the language, and a lot more readable IMO than Perl 5 regexes.

Anyways, WDYT?

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

WTF? seconds after I post this I get this spam from OReilly about mastering regexes? http://post.oreilly.com/form/oreilly/viewhtml/9z1zet5j192bp5m7d0qv7r3r18amqascuicnoo3uk28?imm_mid=092e31&cmp=em-code-books-videos-reg-expresssions-direct

One of github, wikipedia or the perl 6 spec site sold me off!

Member

FroMage commented Aug 28, 2012

WTF? seconds after I post this I get this spam from OReilly about mastering regexes? http://post.oreilly.com/form/oreilly/viewhtml/9z1zet5j192bp5m7d0qv7r3r18amqascuicnoo3uk28?imm_mid=092e31&cmp=em-code-books-videos-reg-expresssions-direct

One of github, wikipedia or the perl 6 spec site sold me off!

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

Don't forget Google ;)

Member

quintesse commented Aug 28, 2012

Don't forget Google ;)

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

So I get your wish to include regexes, but I also think they're a little overrated, sure they are useful in a lot of cases, but enough to increase the complexity of the base language by such an amount? If escaping is the only major worry we could probably come up with a special quoting mechanism that ignores all the escape characters and just passes the exact string as it's encountered in the source code.

If on the other hand you want something that the parser can validate... why stop with regexes? Why not JSON and XML and a whole bunch of other useful constructs? Why not go one step further and support pluggable parsers?

Member

quintesse commented Aug 28, 2012

So I get your wish to include regexes, but I also think they're a little overrated, sure they are useful in a lot of cases, but enough to increase the complexity of the base language by such an amount? If escaping is the only major worry we could probably come up with a special quoting mechanism that ignores all the escape characters and just passes the exact string as it's encountered in the source code.

If on the other hand you want something that the parser can validate... why stop with regexes? Why not JSON and XML and a whole bunch of other useful constructs? Why not go one step further and support pluggable parsers?

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Yeah, I know. It's just that I don't care about JSON and XML literals. Regex literals are so much more powerful than storing them in stupid strings.

Member

FroMage commented Aug 28, 2012

Yeah, I know. It's just that I don't care about JSON and XML literals. Regex literals are so much more powerful than storing them in stupid strings.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

Maybe they're more powerful, but you hardly ever do anything intelligent with those regexes, the compiler and IDE might check for obvious syntax errors but that's about it. (Although admittedly the new regexp format seems to offer more possibilities in that aspect)

Member

quintesse commented Aug 28, 2012

Maybe they're more powerful, but you hardly ever do anything intelligent with those regexes, the compiler and IDE might check for obvious syntax errors but that's about it. (Although admittedly the new regexp format seems to offer more possibilities in that aspect)

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Well, with them the compiler validates every other single-quoted literal ;)

Member

FroMage commented Aug 28, 2012

Well, with them the compiler validates every other single-quoted literal ;)

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

Well "validate" is a big word, sometimes you might be able to do some checking that covers 100% of the cases, but in case of dates for example you'll only be able to do some basic syntax checking, you could still easily write nonsence (or have to write some pretty complex regexes). I just don't see how that's worth it all. Do we have a list of real life examples where having regexp checked literals will really make a big difference?

Member

quintesse commented Aug 28, 2012

Well "validate" is a big word, sometimes you might be able to do some checking that covers 100% of the cases, but in case of dates for example you'll only be able to do some basic syntax checking, you could still easily write nonsence (or have to write some pretty complex regexes). I just don't see how that's worth it all. Do we have a list of real life examples where having regexp checked literals will really make a big difference?

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Having regex literals allows us to mingle code with the regex, such as callbacks, assignments, references and stuff. You can't do the same with regexes as a String literal.

Member

FroMage commented Aug 28, 2012

Having regex literals allows us to mingle code with the regex, such as callbacks, assignments, references and stuff. You can't do the same with regexes as a String literal.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

At compile time? Are you serious?

Member

quintesse commented Aug 28, 2012

At compile time? Are you serious?

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Check out my example above. In Perl regexes you can use code as parts of your regex, for more powerful matching, and for actions. I don't see why we couldn't do that too.

Member

FroMage commented Aug 28, 2012

Check out my example above. In Perl regexes you can use code as parts of your regex, for more powerful matching, and for actions. I don't see why we couldn't do that too.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

Well yes, but our latest discussion was about single quoted literals, which means performing the regexp at compile time which would either mean a) having a more limited subset of the regexp functionality available for them or b) being able to somehow run code at compile time. (And if the answer would somehow surprisingly be "b" then why not go the extra mile and just make pluggable parsers)

Member

quintesse commented Aug 28, 2012

Well yes, but our latest discussion was about single quoted literals, which means performing the regexp at compile time which would either mean a) having a more limited subset of the regexp functionality available for them or b) being able to somehow run code at compile time. (And if the answer would somehow surprisingly be "b" then why not go the extra mile and just make pluggable parsers)

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Arg, yes, OK I see what you mean. Running code at compile-time. Mmmm… Not sure if that's a security risk or not, after all macros do it. But it does add a layer of complexity. Though we could restrict single-quoted type regexes to not contain any dynamic stuff.

Member

FroMage commented Aug 28, 2012

Arg, yes, OK I see what you mean. Running code at compile-time. Mmmm… Not sure if that's a security risk or not, after all macros do it. But it does add a layer of complexity. Though we could restrict single-quoted type regexes to not contain any dynamic stuff.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 28, 2012

Member

Ok, so we're now talking about two different things:

a) adding full-fledged reg-exp support to the language, and

b) having a sub-set of that available for single quoted literals

Of course of you go for "a" there might not be a problem anymore doing "b".

Still personally, although I think regexp support is cool, I'm not sure it worth the added complexity. I'd rather have some full fledged ANTLR-like parser support and be able to do some really cool stuff (the extended Perl 6 support looks similar, though, so maybe that could be "it", I'll have to read it a bit more thoroughly though).

Member

quintesse commented Aug 28, 2012

Ok, so we're now talking about two different things:

a) adding full-fledged reg-exp support to the language, and

b) having a sub-set of that available for single quoted literals

Of course of you go for "a" there might not be a problem anymore doing "b".

Still personally, although I think regexp support is cool, I'm not sure it worth the added complexity. I'd rather have some full fledged ANTLR-like parser support and be able to do some really cool stuff (the extended Perl 6 support looks similar, though, so maybe that could be "it", I'll have to read it a bit more thoroughly though).

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

Arg, yes, OK I see what you mean. Running code at compile-time. Mmmm…

FTR, originally I had thought that the validation for single-quoted literal format would be done by a method. But I've never really got comfortable with the idea, and that's the only reason that the current proposal calls for use of regex-like validation.

Member

gavinking commented Aug 28, 2012

Arg, yes, OK I see what you mean. Running code at compile-time. Mmmm…

FTR, originally I had thought that the validation for single-quoted literal format would be done by a method. But I've never really got comfortable with the idea, and that's the only reason that the current proposal calls for use of regex-like validation.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

I do agree that the syntax is confusing for those who don't know it, but on the other hand it's shared by 100% of the languages that use regexes, give or take a few features that some might not support.

This argument is considerably undermined by the fact that the Perl 6 regex syntax you just linked to is such a major departure from the traditional Unix / Perl 5 syntax.

FTR, I had never seen this stuff before, and it does actually go some way to addressing the confusion of "what is a metacharacter" that is one of the most broken things about the traditional regex syntax. See my post here:

#382 (comment)

OTOH, it doesn't address my other big objections:

  • we would need conforming implementations in Java and probably in JavaScript
  • we would need a proper specification as part of the Ceylon language spec

These problems aren't about syntax, but about semantics, i.e. that Perl regexes are simply a much too bloated and overcomplex solution to the very narrow range of problems they solve. Proof: ANTLR is a much simpler language that can do much, much, much more than Perl regexes can. ANTLR makes regexes look like a sad joke.

Member

gavinking commented Aug 28, 2012

I do agree that the syntax is confusing for those who don't know it, but on the other hand it's shared by 100% of the languages that use regexes, give or take a few features that some might not support.

This argument is considerably undermined by the fact that the Perl 6 regex syntax you just linked to is such a major departure from the traditional Unix / Perl 5 syntax.

FTR, I had never seen this stuff before, and it does actually go some way to addressing the confusion of "what is a metacharacter" that is one of the most broken things about the traditional regex syntax. See my post here:

#382 (comment)

OTOH, it doesn't address my other big objections:

  • we would need conforming implementations in Java and probably in JavaScript
  • we would need a proper specification as part of the Ceylon language spec

These problems aren't about syntax, but about semantics, i.e. that Perl regexes are simply a much too bloated and overcomplex solution to the very narrow range of problems they solve. Proof: ANTLR is a much simpler language that can do much, much, much more than Perl regexes can. ANTLR makes regexes look like a sad joke.

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

This argument is considerably undermined by the fact that the Perl 6 regex syntax you just linked to is such a major departure from the traditional Unix / Perl 5 syntax.

Well, yes. What I meant was that if we had to depart the Perl 5 syntax, which is common across every language, then we should look at the Perl 6 one. It does address many issues.

we would need conforming implementations

Indeed we'd have to write one

we would need a proper specification as part of the Ceylon language spec

Yeah, that's a task in itself. That and the implementation.

As for AntLR being more powerful than regexes, I hope so because it was written to parse languages, while regexes were written to match patterns in strings and files, which is why they're so much more concise. With the additions that Perl 6 did with grammars and rules, I'm not so sure that AntLR is still so much more powerful.

Member

FroMage commented Aug 28, 2012

This argument is considerably undermined by the fact that the Perl 6 regex syntax you just linked to is such a major departure from the traditional Unix / Perl 5 syntax.

Well, yes. What I meant was that if we had to depart the Perl 5 syntax, which is common across every language, then we should look at the Perl 6 one. It does address many issues.

we would need conforming implementations

Indeed we'd have to write one

we would need a proper specification as part of the Ceylon language spec

Yeah, that's a task in itself. That and the implementation.

As for AntLR being more powerful than regexes, I hope so because it was written to parse languages, while regexes were written to match patterns in strings and files, which is why they're so much more concise. With the additions that Perl 6 did with grammars and rules, I'm not so sure that AntLR is still so much more powerful.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

@FroMage: sorry, but I didn't completely follow why you consider:

\ab*c\

better than:

re('ab*c')

is it because of the interpolation? Well, FTR, I assume we will support interpolation for single-quoted strings just like we support it for double-quoted strings, for example:

re('(' word ')*')

I always found it alien that Java didn't support regexes in the language, because that means silly quoting,

Not a problem for us, because we have single-quoted literals.

no support for non-string values or expressions in the pattern,

I am working on the assumption that we will have these in single-quoted literals.

and a generally confusing API.

This doesn't seem to be very relevant to us.

P.S. Perhaps irrelevant, but I think \ is a terrible quote character and immediately makes things less readable.

Member

gavinking commented Aug 28, 2012

@FroMage: sorry, but I didn't completely follow why you consider:

\ab*c\

better than:

re('ab*c')

is it because of the interpolation? Well, FTR, I assume we will support interpolation for single-quoted strings just like we support it for double-quoted strings, for example:

re('(' word ')*')

I always found it alien that Java didn't support regexes in the language, because that means silly quoting,

Not a problem for us, because we have single-quoted literals.

no support for non-string values or expressions in the pattern,

I am working on the assumption that we will have these in single-quoted literals.

and a generally confusing API.

This doesn't seem to be very relevant to us.

P.S. Perhaps irrelevant, but I think \ is a terrible quote character and immediately makes things less readable.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

As for AntLR being more powerful than regexes, I hope so because it was written to parse languages, while regexes were written to match patterns in strings and files, which is why they're so much more concise.

Are they really more concise?! I dispute that. The only reason they might look superficially less verbose is that you don't need as many quotes in a regex, and you can't use whitespace or define subrules. And to my way of thinking, these are terrible things.

Member

gavinking commented Aug 28, 2012

As for AntLR being more powerful than regexes, I hope so because it was written to parse languages, while regexes were written to match patterns in strings and files, which is why they're so much more concise.

Are they really more concise?! I dispute that. The only reason they might look superficially less verbose is that you don't need as many quotes in a regex, and you can't use whitespace or define subrules. And to my way of thinking, these are terrible things.

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

but I didn't completely follow why you consider \ab*c\ better than re('ab*c')

First it's /ab*c/ ;)

It's better because the quoting rules are different. In Java you get silly things like "\\w". That might be irrelevant for our single-quoted literals.

Well, FTR, I assume we will support interpolation for single-quoted strings just like we support it for double-quoted strings.

But that interpolation doesn't let us embed code in the pattern, or arrays, because we'd first convert those values to Strings, no?

I am working on the assumption that we will have these in single-quoted literals

Can you give an example?

and you can't use whitespace or define subrules

WDYM? In Perl 6 regexes whitespace is non-matching by default (was a modifier in Perl 5), and you can define subrules. Even in Perl 5 you could name subrules within the regex and refer to them later on. In Perl 6 you can define them outside of the regex.

Member

FroMage commented Aug 28, 2012

but I didn't completely follow why you consider \ab*c\ better than re('ab*c')

First it's /ab*c/ ;)

It's better because the quoting rules are different. In Java you get silly things like "\\w". That might be irrelevant for our single-quoted literals.

Well, FTR, I assume we will support interpolation for single-quoted strings just like we support it for double-quoted strings.

But that interpolation doesn't let us embed code in the pattern, or arrays, because we'd first convert those values to Strings, no?

I am working on the assumption that we will have these in single-quoted literals

Can you give an example?

and you can't use whitespace or define subrules

WDYM? In Perl 6 regexes whitespace is non-matching by default (was a modifier in Perl 5), and you can define subrules. Even in Perl 5 you could name subrules within the regex and refer to them later on. In Perl 6 you can define them outside of the regex.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

But that interpolation doesn't let us embed code in the pattern, or arrays, because we'd first convert those values to Strings, no?

You mean that I would have to write:

value alts = { "foo", "bar", "baz" };
value regex = re('(' alternatives(alts) ')*');

I guess I don't see that as a really big deal...

OK, I suppose one issue I can see with my idea of doing interpolating in single-quoted literals is that compile-time validation becomes a much more difficult problem. Hrm....

Member

gavinking commented Aug 28, 2012

But that interpolation doesn't let us embed code in the pattern, or arrays, because we'd first convert those values to Strings, no?

You mean that I would have to write:

value alts = { "foo", "bar", "baz" };
value regex = re('(' alternatives(alts) ')*');

I guess I don't see that as a really big deal...

OK, I suppose one issue I can see with my idea of doing interpolating in single-quoted literals is that compile-time validation becomes a much more difficult problem. Hrm....

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

WDYM? In Perl 6 regexes whitespace is non-matching by default (was a modifier in Perl 5), and you can define subrules.

OK, fine, but what I'm saying is that if you actually go and take full advantage of this, you will wind up with regexes that are no more compact than an ANTRL grammar. I'm saying that the perceived "conciseness" is a direct result of the fact that regexes are usually written to be a totally unreadable string of mush.

Member

gavinking commented Aug 28, 2012

WDYM? In Perl 6 regexes whitespace is non-matching by default (was a modifier in Perl 5), and you can define subrules.

OK, fine, but what I'm saying is that if you actually go and take full advantage of this, you will wind up with regexes that are no more compact than an ANTRL grammar. I'm saying that the perceived "conciseness" is a direct result of the fact that regexes are usually written to be a totally unreadable string of mush.

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Aug 28, 2012

Member

Perhaps. The other problem is Java regexes don't support pattern naming though they do support the non-significant whitespace flag. Oh and hey they support look-ahead and look-behind. Cool.

Member

FroMage commented Aug 28, 2012

Perhaps. The other problem is Java regexes don't support pattern naming though they do support the non-significant whitespace flag. Oh and hey they support look-ahead and look-behind. Cool.

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Aug 28, 2012

Member

This reminds me a lot of what I had in mind for #83. In general, try to design our syntax so that these things can be expressed concisely using a library rather than something built into the language. For example, 'ab' * repeat('c') or something for a regex pattern. If you want to capture values, then use stuff like in #83.

Member

RossTate commented Aug 28, 2012

This reminds me a lot of what I had in mind for #83. In general, try to design our syntax so that these things can be expressed concisely using a library rather than something built into the language. For example, 'ab' * repeat('c') or something for a regex pattern. If you want to capture values, then use stuff like in #83.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

For example, 'ab' * repeat('c')

Well, this is the parser-combinator approach. Parser combinators are cool—though they can't do some things that every parser compiler does do—but in the context of single-quoted strings, it seems to me that they run into the same problem of essentially letting you run arbitrary code at compile time.

Member

gavinking commented Aug 28, 2012

For example, 'ab' * repeat('c')

Well, this is the parser-combinator approach. Parser combinators are cool—though they can't do some things that every parser compiler does do—but in the context of single-quoted strings, it seems to me that they run into the same problem of essentially letting you run arbitrary code at compile time.

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Aug 28, 2012

Member

I am confused. No arbitrary code runs at compile time in my example. A pattern matcher is generated at execution time. I overloaded * so that it concatenates two patterns. Ideally there would be some convenient syntax for memoizing this so that a new pattern matcher isn't being created every time, but I'm not requiring that.

Member

RossTate commented Aug 28, 2012

I am confused. No arbitrary code runs at compile time in my example. A pattern matcher is generated at execution time. I overloaded * so that it concatenates two patterns. Ideally there would be some convenient syntax for memoizing this so that a new pattern matcher isn't being created every time, but I'm not requiring that.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 28, 2012

Member

I am confused. No arbitrary code runs at compile time in my example. A pattern matcher is generated at execution time.

The main problem we're trying to solve is compile-time validation of single-quoted literal strings.

Member

gavinking commented Aug 28, 2012

I am confused. No arbitrary code runs at compile time in my example. A pattern matcher is generated at execution time.

The main problem we're trying to solve is compile-time validation of single-quoted literal strings.

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Aug 28, 2012

Member

Okay, I misunderstood what single-quoted literals are, and now I'm trying to understand what y'all intend them to be. Unfortunately the documentation is not informative at all:

Single-quoted strings are used to express literal values for user-defined types

???

Member

RossTate commented Aug 28, 2012

Okay, I misunderstood what single-quoted literals are, and now I'm trying to understand what y'all intend them to be. Unfortunately the documentation is not informative at all:

Single-quoted strings are used to express literal values for user-defined types

???

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 31, 2012

Member

So I've just figured out how you can make regexes work against interpolated strings. Imagine the interpolated string:

"I have the character " char ", the integer, " int ", and the float " float "."

This would match the pattern:

Pattern(' "I have the character " _ ", the integer, " integer ", and the float " float "." ')

Where:

  • _ matches any character, or an interpolated expression of type Character,
  • the rule integer matches a well-formed ceylon integer literal or an interpolated expression of type Integer, and
  • the rule float matches a well-formed ceylon floating point literal or an interpolated expression of type Float.

So, for a more realistic example, the following pattern would match dates:

Pattern(' integer "/" integer "/" integer ')

And could be used to validate shit as complex as:

date('1/' month '/' year '')
Member

gavinking commented Aug 31, 2012

So I've just figured out how you can make regexes work against interpolated strings. Imagine the interpolated string:

"I have the character " char ", the integer, " int ", and the float " float "."

This would match the pattern:

Pattern(' "I have the character " _ ", the integer, " integer ", and the float " float "." ')

Where:

  • _ matches any character, or an interpolated expression of type Character,
  • the rule integer matches a well-formed ceylon integer literal or an interpolated expression of type Integer, and
  • the rule float matches a well-formed ceylon floating point literal or an interpolated expression of type Float.

So, for a more realistic example, the following pattern would match dates:

Pattern(' integer "/" integer "/" integer ')

And could be used to validate shit as complex as:

date('1/' month '/' year '')
@chochos

This comment has been minimized.

Show comment
Hide comment
@chochos

chochos Aug 31, 2012

Member

well since you bring up dates... let's stick to ISO format for now, yyyy-mm-dd, so the first number can be an integer alright, but the second one has to be an internet between 1 and 12 and the last one has to be between 1 and 31, in order to validate at least the ranges (ignoring for now that the max number of days varies from month to month).

Member

chochos commented Aug 31, 2012

well since you bring up dates... let's stick to ISO format for now, yyyy-mm-dd, so the first number can be an integer alright, but the second one has to be an internet between 1 and 12 and the last one has to be between 1 and 31, in order to validate at least the ranges (ignoring for now that the max number of days varies from month to month).

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Aug 31, 2012

Member

So I'm still confused as to what you're trying to accomplish here. Given your most recent example, it seems like the stuff in #83 is along the lines of what you're trying to accomplish.

Parser<Integer> integer = ...;
Parser<Integer,Integer,Integer> datePattern1 = (integer "/" integer "/" integer).map((d,m,y) => (y,m,d));
Parser<Integer,Integer,Integer> datePattern2 = integer "-" integer "-" integer;

What am I misunderstanding?

Member

RossTate commented Aug 31, 2012

So I'm still confused as to what you're trying to accomplish here. Given your most recent example, it seems like the stuff in #83 is along the lines of what you're trying to accomplish.

Parser<Integer> integer = ...;
Parser<Integer,Integer,Integer> datePattern1 = (integer "/" integer "/" integer).map((d,m,y) => (y,m,d));
Parser<Integer,Integer,Integer> datePattern2 = integer "-" integer "-" integer;

What am I misunderstanding?

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 31, 2012

Member

@chochos Sure, in reality we would probably want to let you write stuff like

Pattern(' integer(1..31) "-" integer(1..12) "-" integer(0..9999) ') 

but I'm trying to keep it a bit simple for now.

Member

gavinking commented Aug 31, 2012

@chochos Sure, in reality we would probably want to let you write stuff like

Pattern(' integer(1..31) "-" integer(1..12) "-" integer(0..9999) ') 

but I'm trying to keep it a bit simple for now.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 31, 2012

Member

So I'm still confused as to what you're trying to accomplish here.

Ceylon is a language which lets you freely and naturally mix code, data, and metadata with authoring support. That means we need the ability to define embedded minilanguages (i.e. data formats) for stuff like dates, times, cron expressions, regexes, etc, and have compile-time syntactic validation and even autocompletion.

Member

gavinking commented Aug 31, 2012

So I'm still confused as to what you're trying to accomplish here.

Ceylon is a language which lets you freely and naturally mix code, data, and metadata with authoring support. That means we need the ability to define embedded minilanguages (i.e. data formats) for stuff like dates, times, cron expressions, regexes, etc, and have compile-time syntactic validation and even autocompletion.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Aug 31, 2012

Member

How could it validate stuff like date('1/' month '/' year '') , isn't it supposed to be a literal? Which means compile-time validation?

Member

quintesse commented Aug 31, 2012

How could it validate stuff like date('1/' month '/' year '') , isn't it supposed to be a literal? Which means compile-time validation?

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Aug 31, 2012

Member

@quintesse I don't understand your question. It would be the typechecker validating the string template expression against the pattern. The typechecker knows what type the interpolated expressions have.

Member

gavinking commented Aug 31, 2012

@quintesse I don't understand your question. It would be the typechecker validating the string template expression against the pattern. The typechecker knows what type the interpolated expressions have.

@quintesse

This comment has been minimized.

Show comment
Hide comment
@quintesse

quintesse Sep 1, 2012

Member

So we're only validating them to be structurally more or less similar to a date? date('1/' 20 '/' 500000 '') would be seen as valid then I guess, so is it worth the trouble, because it doesn't seem to add much?

Member

quintesse commented Sep 1, 2012

So we're only validating them to be structurally more or less similar to a date? date('1/' 20 '/' 500000 '') would be seen as valid then I guess, so is it worth the trouble, because it doesn't seem to add much?

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Sep 1, 2012

Member

@quintesse I have no idea what you're trying to say. What does "structurally" mean? How is "structurally" different to "syntactically"?

Member

gavinking commented Sep 1, 2012

@quintesse I have no idea what you're trying to say. What does "structurally" mean? How is "structurally" different to "syntactically"?

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Sep 1, 2012

Member

Trying to understand what you at proposing. This is a pattern matching syntax where characters have to be in double quotes, integers re matched by a rule named integer,rather than [0-9]+,etc?

That's a pattern matching syntax you're proposing here right? Just making sure I understand what you're talking about.

Member

FroMage commented Sep 1, 2012

Trying to understand what you at proposing. This is a pattern matching syntax where characters have to be in double quotes, integers re matched by a rule named integer,rather than [0-9]+,etc?

That's a pattern matching syntax you're proposing here right? Just making sure I understand what you're talking about.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Sep 1, 2012

Member

Trying to understand what you at proposing. This is a pattern matching syntax where characters have to be in double quotes, integers re matched by a rule named integer,rather than [0-9]+,etc?

Right, it's essentially the syntax of an ANTLR lexer grammar, with some built-in rules.

Member

gavinking commented Sep 1, 2012

Trying to understand what you at proposing. This is a pattern matching syntax where characters have to be in double quotes, integers re matched by a rule named integer,rather than [0-9]+,etc?

Right, it's essentially the syntax of an ANTLR lexer grammar, with some built-in rules.

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Sep 2, 2012

Member

Could someone explain in detail what y'all are trying to accomplish? I'd like to contribute, but I feel like there is some prior context that y'all are working in that I'm not informed about.

Member

RossTate commented Sep 2, 2012

Could someone explain in detail what y'all are trying to accomplish? I'd like to contribute, but I feel like there is some prior context that y'all are working in that I'm not informed about.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Sep 2, 2012

Member

@RossTate Like I replied in the other thread: the problem is that we want to be able to embed literal dates, times, regexes, cron patterns, URLs, file system paths, css stuff, etc, in a bit of declarative code and have the format validated at compile time. This is an open-ended list of formats, where libraries can add there own formats / mini-languages.

Member

gavinking commented Sep 2, 2012

@RossTate Like I replied in the other thread: the problem is that we want to be able to embed literal dates, times, regexes, cron patterns, URLs, file system paths, css stuff, etc, in a bit of declarative code and have the format validated at compile time. This is an open-ended list of formats, where libraries can add there own formats / mini-languages.

@RossTate

This comment has been minimized.

Show comment
Hide comment
@RossTate

RossTate Sep 2, 2012

Member

You want to be able to do all that without allowing user/library code to be run at compile time?!

Member

RossTate commented Sep 2, 2012

You want to be able to do all that without allowing user/library code to be run at compile time?!

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Sep 2, 2012

Member

You want to be able to do all that without allowing user/library code to be run at compile time?!

Right. This option is especially good for the IDE, which will might be able to use the pattern in interesting ways.

Member

gavinking commented Sep 2, 2012

You want to be able to do all that without allowing user/library code to be run at compile time?!

Right. This option is especially good for the IDE, which will might be able to use the pattern in interesting ways.

@FroMage

This comment has been minimized.

Show comment
Hide comment
@FroMage

FroMage Sep 3, 2012

Member

If this pattern matching syntax is indeed not going to be our regex syntax, then I have no objections to it at all. We just have to make sure that whatever we end up using will work for any literal we can think of.

Can it be used to validate URIs?

The RFC (http://tools.ietf.org/html/rfc3986#appendix-B) implies that ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? can be used to split it, though not to validate it. http://stackoverflow.com/questions/30847/regex-to-validate-uris propose /^([a-z0-9+.-]+):(?://(?:((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*)@)?((?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*)(?::(\d*))?(/(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?|(/?(?:[a-z0-9-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?)(?:\?((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?(?:#((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?$/i.

Note that that's a particularly good example of both the fact that regexes can be insane written this way, and that being able to support named sub-rules is key (not used in that example for some reason).

Member

FroMage commented Sep 3, 2012

If this pattern matching syntax is indeed not going to be our regex syntax, then I have no objections to it at all. We just have to make sure that whatever we end up using will work for any literal we can think of.

Can it be used to validate URIs?

The RFC (http://tools.ietf.org/html/rfc3986#appendix-B) implies that ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? can be used to split it, though not to validate it. http://stackoverflow.com/questions/30847/regex-to-validate-uris propose /^([a-z0-9+.-]+):(?://(?:((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*)@)?((?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*)(?::(\d*))?(/(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?|(/?(?:[a-z0-9-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?)(?:\?((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?(?:#((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?$/i.

Note that that's a particularly good example of both the fact that regexes can be insane written this way, and that being able to support named sub-rules is key (not used in that example for some reason).

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Nov 17, 2012

Member

We're not going to have time to do regexes in Ceylon 1.0.

Member

gavinking commented Nov 17, 2012

We're not going to have time to do regexes in Ceylon 1.0.

@gavinking

This comment has been minimized.

Show comment
Hide comment
@gavinking

gavinking Oct 19, 2014

Member

#902 is a much better option than stupid broken regexes.

Member

gavinking commented Oct 19, 2014

#902 is a much better option than stupid broken regexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment