Skip to content

anarchodin/pct-coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

pct-coding — A set of functions to deal with percent-encoded data

Percent-encoding is a part of the IETF URI standard, RFC3986. It is usually viewed—and is arguably primarily defined—as an escaping mechanism for URIs to permit characters such as ’/’ to appear without taking on their normal syntactic significance. The percent-encoding mechanism has also, historically, been used to embed non-ASCII characters from various single-byte encodings into URIs. This further developed into the use of percent-encoded UTF-8 sequences embedded in URIs, a usage formally specified in RFC3987. Dealing with URIs as anything more than simple strings more or less requires coping with this mechanism, whether it is to escape the characters not allowed to appear in URIs or to translate a substring of a URI or IRI into useful form.

There is a complication, however. Sequences of percent-encoded bytes do not have to be valid UTF-8, neither in URIs nor IRIs. The simple way to deal with this problem is to ignore it: Decode it as UTF-8 anyway, and if it contains something invalid, just barf. This works for most situations. Another approach is to translate IRIs to URIs and never try to go the other way. The problem here is, essentially, that percent-encoding doesn’t encode character strings, it encodes arbitrary binary data.

Me, I wanted to normalise arbitrary IRIs, which involves decoding the valid UTF-8 sequences, but leaving invalid bytes encoded. I found no existing percent-encoding tool that made this possible. So I wrote one. It exports three functions and one constant. They all have what I believe to be decent docstrings, but here’s a summary anyway:

pct-decode takes a percent-encoded string and turns it into an octet-vector. It also takes two keyword arguments, :encoding and :reserved. :encoding controls how characters not allowed to appear directly in URIs are coded into bytes. It uses Babel, so babel:list-character-encodings tells you what you can use. The default is UTF-8, which produces IRIs. :reserved is a mechanism for normalisation: It takes a sequence of characters whose percent-encodings should not be decoded, but left in the byte sequence as-is.

pct-encode takes an octet-vector and turns it into an percent-encoded string. It also takes three keyword arguments, :iri, :reserved and :ignore-existing. :iri controls whether or not to attempt to reconstitute UTF-8 sequences into Unicode characters in the resulting string. It defaults to t, which makes the output valid as an IRI but not as an URI. :reserved is essentially the opposite of the decoding version: It’s a sequence of characters that should be percent-encoded in the resulting string, rather than appearing directly. It defaults to +uri-reserved+. Finally, :ignore-existing will leave any already percent-encoded sequence in the result string.

pct-normalize uses the two prior functions in a specific setup to normalise the provided string. It accepts :encoding with the semantics of decode, :iri with the semantics of encode, and :reserved with its own: A character, allowed in URIs, found in the sequence passed will neither be decoded nor encoded by the routine. That is: Reserved characters appearing directly in the source string will appear directly in the result, and those appearing encoded in the source string will remain encoded in the result. Its defaults are similar to those of the other functions: assume characters represent their UTF-8 sequences, return IRIs, and leave the URI reserved characters alone.

+uri-reserved+ is a string containing the characters reserved in URIs: Colon, slash, question mark, hash, square brackets, commercial at, exclamation mark, dollar sign, ampersand, apostrophe (well, ASCII’s approximation thereto, anyway), parenthesis, asterisk, plus, comma, semicolon, and the equals sign.

About

Percent-coding for arbitrary binary data in Common Lisp.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published