-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on metadata header (or line format) for ABIF #6
Comments
Well thought through! I am in favor of the one-pass parser option, with the structure simple enough so that no additional parsing libraries such as YAML or JSON are necessary. Therefore, I see the benefit of a clear order a3 (e-mail-style headers), b3 (candidate mappings), b1 interspersed with b2 (ballot groupings with comments). The only use-case for interspersing metadata with ballots would IMHO be designating special ballot classes (such as ballots with pending recount); if that is desired, the headers (metadata) would have to start with an alphabetical character to distinguish them from ballots, and the "global" headers affecting the whole ABIF file would still have to come before the first ballot line.. |
My intended preference is to allow using your ABIF format to supply only the vote-count numbers and the letters A, B, C, etc. as placeholders for the candidates, and using a case number (or serial number if that's the wording you prefer) that associates this case with a separate file that contains (in any format!) candidate names, political parties, election title, and whatever other metadata is desired. This allows your format to be used in real elections where interested voters can verify that the winner-calculation software has absolutely no knowledge of who the candidates are. That way anyone can verify that the calculation software does not have any bias -- such as political party names or even whether a candidate name contains accented characters. Also, for my purposes, it allows my calculation software to ignore the character encoding and just assume ASCII (not Unicode, etc.). FYI I use my Dashrep programming language to handle text, and that's where the candidate names, etc. are. But I need a unique number that associates the two software paths (instead of relying on filenames, which is not reliable in real elections). As another benefit, when discussing specific cases on the EM forum, the case number can be changed when the vote numbers change, which avoids having to ask "did you mean this set of vote counts or this other set of vote counts?" |
My experience with hybrid (multi-) file formats is, diplomatically speaking, mixed - accompanying files get lost or misnamed, etc. A good example of this is the ESRI shapefile format, the dubious de facto standard for spatial data. I would try to avoid that if at all possible. |
All of the important data should be in a single file. I think it might look something like ABIF 0.001 BTW, There should be a way of preserving optional subdivision data in the ballots section, but I think it will be clearer how to do this when more of the format has been worked out. |
As I've been thinking about this, I realize that the first four bytes of every line should be special. The table I'm envisioning is below. Note that periods ("
I also believe that the first line of the file should be different than every subsequent line, though I still think it should be possible to have nothing but ballot bundle lines. There's a lot to respond to here in the comments from the past 24 hours (thank you, everyone!). This comment isn't a full reply to the ideas raised by other folks, but it was inspired by thinking about a response. There is a lot more to do.... |
I believe this issue is becoming the meta-issue for the answer to the question "what is the structure of an ABIF file?" I would like very few requirements, but here are the requirements that I would like for the v1.0 version of this specification:
The specification may include an easily-recognized (and optional) version string as the first line of the document, which may or may not include option information and/or well-structured metadata about the file. For example, this would allow for the "case number" (as @cpsolver calls it) or "serial number" (as I would currently prefer to call it). One of us may also want to create an issue dedicated to this optional first line, pulling in bits from this issue. I recently created issue #8 to discuss candidate token declarations. As I stated over there:
What I also think: we can make life easier on implementors by having a soft requirement for line ordering, thus having implicit sections. As much as I like RFC-5322-formatted headers from a simplicity and flexibility perspective, I suspect they are too flexible for this new format. My hunch is that adopting unmodified RFC-5322-formatted header lines will make it difficult to expand the format later, since the lines can begin with pretty much any ASCII alphabetic character. I like NDJSON, with a soft requirement that there be only one key-value pair per line (and maybe optional comments at the end of the line), and a strongly-recommended ordering. So, for example, we could allow for this to be a valid ABIF file:
...and we could declare that to be equivalent to the following version with more comments and whitespace:
The rationale for the "soft requirements" is to provide implementors guidance on what will always be "safe", while at the same time not choosing between a couple of different obvious paths until implementors (and common usage) force us to choose one or the other for a future version. We may (in the future) want to allow for easy concatenation of many ABIF files with lines with sections that have candidate declarations, then ballot bundles, then more candidate declarations, and maybe some metadata interspersed for various reasons. But we can make soft requirements pf prdering so that we don't have to expend a lot of mental energy on every possible combination of weird loopholes in the spec, and give ourselves the rationalization to tell implementors who try to exploit loopholes that "you really shouldn't have implemented that in the first place", and encourage folks to alert us when they are planning to exploit a loophole or extend the format in weird ways. More to think about.... |
I agree with Robla's bullet points except for bare bundle. Some identification of the election should be a minimum. Parsers would need to guess from the first ballot line. For example Vote::Count needs to translate equal ranking rcv ballots to range ballots, if the first ballot looked like a normal rcv ballot the data type would be guessed wrong. If only one line has an '=' is it a typo or did only one voter give an equal ranking? Enforcing some metadata can also help humans writing the files by making it easier for validators to detect errors. If a choices list is required, then for example any mistyped entries would be undefined choices. Developers will want to use the choice_identifiers as keys, allowing the [long_description] in the data is asking for trouble. If the choices list is going to be optional, when skipped the data should still be required to be legal as identifier strings. Also humans writing these files by hand want less mandatory metadata to save typing, but using short identifiers instead of long strings that require multi-key combos to enter will save typing, so this isn't really a burden on human file creators. P.S. If I wrote Vote::Count again I would represent rcv ballots as range (with negative scores to invert the order) instead of having two different internal data structures. |
@brainbuz wrote:
I learned a lot working with the W3C back in the late 1990s and early 2000s. A few things:
There are plenty of use cases (including the use cases I would like to implement one day) that do NOT require any of the metadata you are describing. We have discussing election methods on the EM-list for 25 years without providing a lot of metadata that you suggest should be a requirement. I disagree. I think this should be a complete, valid ABIF file:
Most people who have studied electoral reform long enough probably know what that election is, and I'm guessing you do too @brainbuz . For others reading this that don't get the reference, see any of the Wikipedia articles that use the "Tennessee voting example". While there's plenty of interesting metadata that could be included, it's not necessary to communicate the point of the electoral example. I believe that an fully compatible ABIF implementation should be able to read that example. Do implementations need to be fully compatible in order to be useful? No, of course not. There are almost no implementations of SVG (for example) that are able to pass the SVG test suite with 100% compatibility, and those implementations may not be the most useful implementations. Moreover, there are many very useful SVG implementations (e.g. Inkscape) that read and write SVG files that include extra metadata not specified by the SVG specification. That's okay! SVG has been pretty darn useful despite rigid adherence to a specification. Moreover, XML remains a useful specification, even though XHTML no longer gets the attention that it once did, and mainstream browsers seem to prefer HTML5. Back to the subject of ABIF: I think it would be perfectly valid and useful for someone to write an ABIF linter that recommends the addition of metadata, and perhaps even requires the metadata if given a command line switch. That would help implementors needing that metadata in the files they use. However, I don't think the base specification should require an identification of the election. It would be fine for your implementation to require an election identifier, but I don't want to try requiring that the people who build authoring tools (and people who write ABIF by hand) put an election identifier if they don't want to. |
If the parser has to be able to determine the ballot type from the first line then the default allow_equal_ranking for scored should be yes and for rcv should be no. how can the parser tell if the ballots are approval or plurality from the first line? |
Short answer: the parser won't know, and that's okay. Longer answer: the user of the software will know what they want to do with the ABIF file. Sometimes they will use a command line switch or the user interface to indicate that they want a plurality (or "choose-one") count. Then the software may encounter a ballot with more than one candidate listed. The implementor of the software should choose what should happen with that ballot based on the rules of the election in that locality. Are those spoiled ballots? Should the ballots be recounted using approval voting? It's not our decision (as the creators of ABIF) to make. |
In votelib, I (at least provisionally) chose to parse both approval and plurality ballots as approval ballots, and only determine at the end that the ballots are plurality by checking that all of them contain at most one option. Otherwise, I encountered no problems in determining ballot types from each line independently. |
For headerless files it will need to be up to the application and user to decide what to do. In Vote::Count for example, I can add attributes to the new ABIF import library for providing/overriding the header data. For the optional Metadata I'm going to suggest: _# One Line Header Format _# Multi Line Header Format If the file will contain nested precinct/division data I suggest something like Note 1: should 'rcv_equal' be the ballot type or should it be ballot_type:rcv allow_equal:true? I prefer the first. |
The issue of whether or not two or more candidates can be ranked at the same preference level is specific to the method being used to count the ballots. This ballot format standard is just for the ballot information. There are many, many different ways to count ranked ballots. The best methods allow "equal rankings." In fact the IRV and STV methods promoted by FairVote are the only counting methods I can think of that impose that limitation. Let's not go down the rabbit hole of including information that is specific to the counting method. Of course optional comments must always be allowed, so such restrictions can be specified as comments when they are relevant. |
An objective which has been proposed and which no one has objected is that "Implementations must ... must not use comments to change the semantics of the machine-readable portions of the file." Knowing whether the ballots were cast under a rule that allowed or denied equal rankings is an important semantic. |
What @cpsolver seems to be saying is that the rules of the election-specific constraints are outside of the scope of the .abif machine-readable portion, and I concur. @brainbuz 's assertion seems to challenge that: "Knowing whether the ballots were cast under a rule that allowed or denied equal rankings is an important semantic." I disagree that it's important in the context of a machine interpretation of an ABIF file. We're trying to communicate what ballot the voters cast, not all of the possible ballots they could have cast, or the laws constraining them in a particular election. BTW, @brainbuz : using a combination of the backtick (`) char and triple backtick (```) will help make the examples you provide in your June 18th comment more readable, since you'll be able to use the hash sign ( My current inclination is to use a variation of option
...but this wouldn't be valid:
That would simplify parsing for those that don't want to rely on a full JSON parser to interpret each of the metadata lines, and would allow implementations to associate one key with each line. We can also restrict values to a subset of valid values, and possibly make this invalid:
I'm not sure just how much we should restrict the range of valid JSON, if we restrict it. Regardless, I prefer using JSON key value pairs to using RFC-822/5322-style key value pairs, since it gives us an obvious path forward if we choose to expand beyond simple key-value pairs. More to think about.... |
I think some use cases would help answer these questions. For real elections, I think using the NIST standard or some other existing open clean standard should be encouraged and facilitated. Having folks in this group add to the implementations of that standard (and/or identify issues for future revisions) might well be a much more beneficial contribution than adding yet another format with idiosyncratic metadata approaches. The use case we see in election methods discussions is already clear in this discussion. Are there other use cases that are common enough and likely enough to be implemented, to warrant the extra complexity of the metadata discussed here? |
The metadata is already being discussed as optional. |
since we're looking at a readable text format and have rejected having the entire file in json/yaml, I would prefer to stick with text kv/pairs. if there is a solid case for something jsonlike, I'd rather put the whole metadata block in a json object, which can then be handed off to a json parser. |
@nealmcb wrote:
Yup, you're right. I filed this issue in anticipation of a problem that we arguably don't have yet, and thus this conversation may be a violation of the YAGNI principle. That said, I also want to facilitate full interoperability with other formats, per this part of your comment:
When you refer to "the NIST standard", I'm guessing you're referring to the "Cast Vote Records Common Data Format Specification Version 1.0", a 94-page PDF published by the National Institute of Standards and Technology (NIST), a division of the United States Department of Commerce. NIST certainly is a respectable organization, and recommendations published by the organization seem likely to gain adoption over time as government bodies mandate use of the NIST specification (like federal, state and local agencies in the United States) . The government imprimatur bodes well for NIST's CVR specification. However, I'm writing this as someone who was heavily involved with the World Wide Web Consortium (W3C) during the rise of wikis and wiki formats (and I seem to recall attending a W3C meeting hosted at/near NIST's headquarters in Gaithersburg). Wiki formats are often criticized as being "idiosyncratic" and "difficult to work with", but I think SGML and XML are also idiosyncratic and difficult to work with. More generally, interoperability is difficult because other people are idiosyncratic and difficult to work with. 😃 I'm grateful that the people who invented wiki formats like Markdown, MediaWiki's wikitext, asciidoc and others didn't listen to people who called on them to give up their efforts and follow "the standard". Note that I am also grateful for John MacFarlane's tireless efforts on pandoc and CommonMark, and I'm hopeful that CommonMark gains the imprimatur of the most commonly recognized "official" standards bodies. I also realize that we're going to be needing I see ABIF as the "wiki-like" counterpart to NIST's CVR format. With any luck, we can form a community of developers that more-or-less agree on the "correct" format for @nealmcb - I'm guessing you've heard of the IETF's mantra of "rough consensus and running code". I suspect I should be spending less time writing words about ABIF, and more time writing code implementing ABIF. Would you be willing to help write code for an implementation of a two-way converter between ABIF and NIST's CVR? Are there any implementations of NIST's CVR format that have been published as open source? Is there a reference test suite that's been published? |
Thanks, Rob. I've been embedded in both the wiki world (as a participant in the first wiki, c2, and a friend of Ward Cunningham, the inventor of Wikis, and a wikipedia editor since 2002) and the IETF / OASIS / NIST worlds of standards. Good points, though it seems like a very challenging proposition to me to achieve valid, lossless two-way mappings, especially when we really want to always work with digitally signed CVRs in the world of official elections. The NIST format has important momentum behind it, though it does not follow my notion and urging over the years to be very clear about what is required for the important use cases like auditing, and to insist on at least one wire format as a must-implement case. That makes interoperability more challenging than it already is. At any rate, another person very interested in this space is @raylutz. We've both been discussion this stuff a bit (though it is off topic) at the Common Data Format Research Group: Common Data Format (CDF) Ballot Styles Subgroup | NIST. |
Re implementations of the NIST CDFs for elections, I think the best source right now is from John Dziurlaj, chair of the the Common Data Format Research Group working with NIST: |
I think conversion of CVR to ABIF is much more important than the other direction. CVR is a heavier format to facilitate Election Administration and (Voting) Machine to (Tabulating) Machine copying of data. Raw data from elections would be coming in CVR, which will need to be converted. I've created issue #14 for work on the metadata dictionary, which will be a better place to discuss how to map data from other formats. The people who imagine would want to go ABIF to CVR are likely to be the vendors who sell equipment to elections officials who are looking to use data sets that become publicly available in ABIF for validating their implementations, and they will want to get the ABIF data to match their voting machines version of CVR, so will be writing their own converters anyway. |
@nealmcb wrote:
Cool! So you'll be able to namedrop and make obscure 1990s tech references that some people will be too young to understand. 👴 I'm just kidding about purposefully going overboard with antique acronyms. We should probably take our reminiscing elsewhere, such as a "
Time to get super nerdy now. I think one seldom-understood aspects of git is its use of Merkle trees to build a unique commit hash. SHA-1 may not be the best choice for a modern version control system if one was starting from scratch in 2021, but it was a pretty good choice for the time. Should ABIF files have a cryptographic hash? I'm not sure....
Yup. To be clear, I don't anticipate that ABIF is going to take the world by storm, and supplant NIST's CVR format. NIST needs a marketing department. Michael Lewis's book "The Fifth Risk" explains why. In short, I suspect that most people don't understand the incredibly important work that NIST does. I just want to come up with a text format that is succinct and easy to copy/paste between plain-text emails (and easy to express in a few lines of text, for a reasonable definition of "few").
Is Common Data Format (CDF) a layer on top of XML and JSON, or is it something else? Judging from the Wikipedia article, I'm guessing that CDF actually predates XML and JSON, which scares me a lot. I looked into ASN.1 many years ago, so I'm not surprised that XML (and later JSON) ended up subsuming many of the use cases of of ASN.1. I think we should take discussion of the metadata format (or data model) over to issue #14 . I think we should take the discussion of the overall data model for ABIF over to issue #15 . It seems that this issue (issue #6) may have run its course, but I'm going to keep this open until we have greater consensus about the ABIF metadata text format. |
CDF is a generic term and does not imply a particular serialization. The NIST 1500 series CDFs support JSON and XML from a shared UML Data Model. Details on its mapping are available here. |
A couple of weeks ago (in May 2021), @cpsolver wrote a message to the EM-list that I'm only now getting around to responding to (indirectly). See "Re: [EM] Ballot Data Format" by VoteFair on 2021-06-06 for more.
In the email, he suggests the following:
I think it's inevitable that we're going to need to figure out how to allow for custom metadata outside of comments. One thing that I love about the old email standards (and in particular, RFC 822) is how simple the rules were for distinguishing between the header (with the metadata about the email) and the body (which contained the message, which could be pretty much ANYTHING).
The following message is vaguely compatible with RFC 822:
I suspect my example above has a few problems of non-compliance with RFC 822, and probably also has problems with the updated specs (RFC 5322 and RFC 6854). Still, the format hasn't changed much; in fact, it still uses US-ASCII rather than UTF-8, and most developers who have done much with email will recognize the example as something vaguely compatible with RFC 822.
Note that there are many arbitrary headers in the top portion of the example, and that the order seems a bit random. My hope for ABIF is that we would do something very similar. I realize now that my proposed headers on some of the test cases for ABIF (as I write this on June 13) don't seem to allow a lot of room for expansion.
There's many ways I can see for solving this problem:
[0-9]
as the first line character correspond to a ballot grouping#
" as the first line character correspond to a comment[
) correspond to an ABIF mapping line (like "[Sue Ye (蘇業)]: SY
"){
) correspond to a valid NDJSON line. Arbitrary metadata can be placed inside of JSON dictionaries, which most parsers MAY ignore.My current preference is option "c", because I think writing parsers will be easier if all of the metadata is declared at the top of the file, but I also want to keep the option to have metadata and comments down in the body of the document. I also think that it should be safe for authors to add spaces and tabs at the beginning of the line, and have those stripped out by parsers. I'd also like to make it reasonably easy to write a single-pass parser for ABIF files, which becomes much easier if the candidate mappings (described in "b3." above) are handled as part of "header" handling, so that there are no surprise candidate token declarations in the body.
Thoughts?
The text was updated successfully, but these errors were encountered: