Decide whether Parsimonious is for Unicode, bytestrings, or both #31

erikrose opened this Issue May 9, 2013 · 0 comments


None yet
1 participant

erikrose commented May 9, 2013

First, we should probably stop supporting the re.L flag; it's unreliable and worse than re.U, as observes.

In order to simplify things and make the API work uniformly across Python 2 and 3, I propose we adopt the convention from Python 3's re lib: grammars defined in Unicode can match only Unicode strings, and those defined by bytestrings can match only bytestrings. We drop support for the re.U flag, letting it be determined at Grammar construction time by what sort of string is passed in. Support re.A if you want, but I'd be content to make people spell out what they mean by \s, \w, and \d explicitly. (What about `\b'?)

To support the naive use of grammars, we can try to promote bytestrings to Unicode if an attempt is made to parse them with a Unicode grammar. But people defining grammars should know better.

Remember to address ParseError.line() and column(), which assume '\n' will be a bytestring in 2 and a Unicode in 3 atm.

@erikrose erikrose added a commit that referenced this issue Jun 6, 2013

@erikrose erikrose WIP. Try to figure out our Unicode/bytestring story. Go a bit too nut…
…s with the unicode=True bits. Ref #31.

erikrose added this to the 1.0 milestone Jul 14, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment