2 First Step: CSV Parser

David Dufresne edited this page Nov 26, 2015 · 5 revisions

Quick Tour

SwiftParsec provides, among other things, primitive parsers, combinator parsers, a few operators and basic String parsers. The primitive parsers can be used to modify other parsers by mapping their result, adapting error messages, etc. Combinators combine simple parsers to build sophisticated parsers. Operators help to keep the syntax simple and clear. And among the string parsers we find commonly used character matching functionality.

CSV Parsing

Following is an example of a comma-separated values (CSV) parser. The CSV format is defined in RFC 4180. It consists of data records represented by lines which are made of one or more fields separated by commas. Sophisticated CSV implementations permit special characters such as newline, comma and double quotes. They are allowed by requiring " (double quote) characters around the fields containing them. Embedded double quote are represented by a pair of consecutive double quotes.

let noneOf = StringParser.noneOf

let quotedChars = noneOf("\"") <|>
    StringParser.string("\"\"").attempt *>
    GenericParser(result: "\"")

let character = StringParser.character

let quote = character("\"")
let quotedField = quote *> quotedChars.many.stringValue <*
    (quote <?> "quote at end of field")

let field = quotedField <|> noneOf("\r\n,\n\r").many.stringValue
let record = field.separatedBy(character(","))

let endOfLine = StringParser.crlf.attempt <|>
    (character("\n") *> character("\r")).attempt <|>
    character("\n") <|>
    character("\r") <?> "end of line"

let csv = record.separatedBy(endOfLine)

Lets analyse this parser line by line.

let noneOf = StringParser.noneOf

First the noneOf function of the StringParser type is assigned to a constant to save a bit of typing. StringParser is a type alias for GenericParser<String, (), Character>. Every parser is defined by 3 type parameters:

  • Stream: The type of the stream to parse.
  • UserState: The user supplied state passed around the combined parsers. It can be set to void if not used.
  • Result: The type of result the parser will return upon success.
let quotedChars = noneOf("\"") <|>
    StringParser.string("\"\"").attempt *>
    GenericParser(result: "\"")

Next we build the parser that will apply to characters enclosed within quotes. The noneOf function returns a parser that will accept any item not contained in the passed string. Following is the '<|>' operator (pronounce 'choice'). It first tries the parser on the left, if no input is consumed, it tries the parser on the right. In our example the parser on the right is defined using the StringParser.string function, the attempt primitive parser, the '*>' operator and a parser that always succeeds with the supplied result. The parser returned by the string function matches the supplied string and return it as result. Its behaviour is modified by combining it with the attempt parser. The attempt parser is used whenever arbitrary look ahead is need. In this case, even if StringParser.string("\"\"") consumes only one double quote and fail, it will pretend that it didn't consumed anything. Often the attempt parser is used in conjunction with the '<|>' operator to allow the right parser to be executed even if the left parser failed while consuming some input (as we will see a bit further). But in this case, why use the attempt parser on the right side of the '<|>' operator? It seems there is no other parser to try after. It is because the quotedChars parser will be used repeatedly using the many combinator and we don't want the parser to stop because it partly consumed some input and failed. Used in conjunction with the many combinator it will give something like noneOf("\"") <|> StringParser.string("\"\"").attempt *> GenericParser(result: "\"") <|> noneOf("\"") <|> StringParser.string("\"\"").attempt *> GenericParser(result: "\"") <|> ... There is no need to use attempt with noneOf("\"") because it will never partly consume some input and fail. It either consumes one character or nothing.

The sequencing operator '*>' discards the value of the first parser and return the value of the second parser. As a mnemonic, the result of the parser pointed to by the '>' is kept. So here we try to parse two double quotes and if we succeed we return one double quote.

let character = StringParser.character

Again we save ourselves a bit of typing by assigning StringParser.character to a constant. This parser matches the passed character.

let quote = character("\"")
let quotedField = quote *> quotedChars.many.stringValue <*
    (quote <?> "quote at end of field")

Here we build a parser that will apply to fields defined using double quotes. First we parse a double quote and discard the result. Then the quotedChars parser is applied as many times as possible, until it fails, using the many combinator. many applies the combined parser, zero or more times, and returns an array of the returned values of the combined parser. So the result will be an array of Characters, but we want a String. That's why stringValue is used to convert the array of characters to a String. Lastly, another double quote is parsed and discarded. '<*' does the same thing as '*>' except that it keeps the value on its left. Also the '<?>' operator is used to replace the default error message with our own. This way if the parser fails at this point it will provide a more meaningful message, something like ' expecting quote at end of field ' instead of ' expecting """ '

let field = quotedField <|> noneOf("\r\n,\n\r").many.stringValue

Now we can define a parser that will parse quoted fields and 'regular' fields. Using the '<|>' operator we first try to parse a quoted field then a regular field. A regular field can contain any character that is not a comma, a carriage return or a newline (\r\n is included also because it is considered as one character in a swift String. i.e. "\r\n".count == 1). We do not need to use the attempt combinator with quotedField because a regular field cannot contain a double quote. It means that when quotedField fails, while having consumed some input, the CSV data is malformed.

let record = field.separatedBy(character(","))

A record is simply made of fields separated by commas. This is defined by the parser field.separatedBy(character(",")) that will parse zero or more fields separated by commas. This parser will return an array of Strings, each string corresponding to a field.

let endOfLine = StringParser.crlf.attempt <|>
    (character("\n") *> character("\r")).attempt <|>
    character("\n") <|>
    character("\r") <?> "end of line"

There is one record per line, so we have to create a parser that will match all possible endings of a line. To do so we build a parser that will match any combination of the two characters '\r' and '\n'. crlf parses a carriage return an a new line.

let csv = record.separatedBy(endOfLine)

We finish with a parser that will apply to zero or more records separated by the endOfLine parser. The final parser will return an array containing arrays of strings ([[String]]).

Remark: Do not forget to add import SwiftParsec at the top of your file. Also, to test the code in a playground you might have to wrap it in a function, or add type annotation on a few variables to prevent it from crashing (for now generics + type inference == flimsy swiftc, but I am sure it will be fixed). Something like this:

func csvParser() -> GenericParser<String, (), [[String]]> {
    ...
    return record.separatedBy(endOfLine)
}

Passing the following input as csvData to csvParser().test(csvData):

Last Name,First Name,Phone
Appleseed,Johnny,123-456-7890
Doe,John,234-567-8901
Doe,Jane,345-678-9012
Roe,"John ""Richard""",000-000-0000
Roe,"Jane, J.",111-111-1111

will return:

[
    ["Last Name", "First Name", "Phone"],
    ["Appleseed", "Johnny", "123-456-7890"],
    ["Doe", "John", "234-567-8901"],
    ["Doe", "Jane", "345-678-9012"],
    ["Roe", "John \"Richard\"", "000-000-0000"],
    ["Roe", "Jane, J.", "111-111-1111"]
]

In this example we have built a robust CSV parser with only 15 lines of code. It shows that learning how SwiftParsec works worth the time. And we have barely unleashed the full potential of the library (this is where your jaw should drop). If you want to deepen your knowledge of SwiftParsec you can read the tutorial Digging Deeper - JSON Parser