Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on metadata header (or line format) for ABIF #6

Open
robla opened this issue Jun 14, 2021 · 24 comments
Open

Decide on metadata header (or line format) for ABIF #6

robla opened this issue Jun 14, 2021 · 24 comments

Comments

@robla
Copy link
Contributor

robla commented Jun 14, 2021

A couple of weeks ago (in May 2021), @cpsolver wrote a message to the EM-list that I'm only now getting around to responding to (indirectly). See "Re: [EM] Ballot Data Format" by VoteFair on 2021-06-06 for more.

In the email, he suggests the following:

A case number allows the ballot data to be processed through separate
vote-counting software while the metadata -- such as precinct number,
political-party affiliations, etc. -- can follow a different path and be
re-joined to produce the published results.

In particular, my vote-counting software focuses on the numbers/counts,
and I use different software (written in my Dashrep programming
language) to process the text info.

The use of a case number also has other benefits.

I think it's inevitable that we're going to need to figure out how to allow for custom metadata outside of comments. One thing that I love about the old email standards (and in particular, RFC 822) is how simple the rules were for distinguishing between the header (with the metadata about the email) and the body (which contained the message, which could be pretty much ANYTHING).

The following message is vaguely compatible with RFC 822:

Hyphen-separate-field-1: Random-ish characters, terminated by CRLF
Hyphen-separate-field-3: Even more random-ish characters, terminated by CRLF
Hyphen-separate-field-2: More random-ish characters, terminated by another CRLF
From: Random name with random characters <email-address@example.com>
Subject: Does anyone remember RFC 822?
To: The world <world@example.com>
Date: Today-ish
Hyphen-separate-field-4: Oh, yeah, here's another header, terminated by another CRLF

This is my email ode to RFC 822!  業業業業whee業業業業wheeee!!!!!!!

Did I mention this: whee!  Oh, yeah, and 業!  ña, ña, ña!

I suspect my example above has a few problems of non-compliance with RFC 822, and probably also has problems with the updated specs (RFC 5322 and RFC 6854). Still, the format hasn't changed much; in fact, it still uses US-ASCII rather than UTF-8, and most developers who have done much with email will recognize the example as something vaguely compatible with RFC 822.

Note that there are many arbitrary headers in the top portion of the example, and that the order seems a bit random. My hope for ABIF is that we would do something very similar. I realize now that my proposed headers on some of the test cases for ABIF (as I write this on June 13) don't seem to allow a lot of room for expansion.

There's many ways I can see for solving this problem:

  • a. create a way of having a mandatory body, and an optional header in all ABIF files
    • a1. Create a way of expressing the header as valid JSON (allowing for newlines), and a way of delimiting between JSON and an ABIF-body section
    • a2. Create a way of expressing the header as valid YAML (allowing for newlines and following YAML whitespace rules), and create a way of delimiting between YAML and an ABIF-body section
    • a3. Create a way of attaching a valid RFC 5322 header to the top of the file, with a blank newline as the delimiter between the RFC-5322-formatted header and the ABIF body
    • a4. Create some other header format
  • b. Create rules for having a variety of line types in ABIF which can be recognized and routed according to their first character. The following sub-options are NOT mutually exclusive
    • b1. Have [0-9] as the first line character correspond to a ballot grouping
    • b2. Have "#" as the first line character correspond to a comment
    • b3. Have open square bracket ([) correspond to an ABIF mapping line (like "[Sue Ye (蘇業)]: SY")
    • b4. Have open squirrelly bracket ({) correspond to a valid NDJSON line. Arbitrary metadata can be placed inside of JSON dictionaries, which most parsers MAY ignore.
    • b5. Allow all b1 through b4 to occur in any order in a valid ABIF file
  • c. Some combination of the "a" and "b" above

My current preference is option "c", because I think writing parsers will be easier if all of the metadata is declared at the top of the file, but I also want to keep the option to have metadata and comments down in the body of the document. I also think that it should be safe for authors to add spaces and tabs at the beginning of the line, and have those stripped out by parsers. I'd also like to make it reasonably easy to write a single-pass parser for ABIF files, which becomes much easier if the candidate mappings (described in "b3." above) are handled as part of "header" handling, so that there are no surprise candidate token declarations in the body.

Thoughts?

@simberaj
Copy link

Well thought through!

I am in favor of the one-pass parser option, with the structure simple enough so that no additional parsing libraries such as YAML or JSON are necessary. Therefore, I see the benefit of a clear order a3 (e-mail-style headers), b3 (candidate mappings), b1 interspersed with b2 (ballot groupings with comments). The only use-case for interspersing metadata with ballots would IMHO be designating special ballot classes (such as ballots with pending recount); if that is desired, the headers (metadata) would have to start with an alphabetical character to distinguish them from ballots, and the "global" headers affecting the whole ABIF file would still have to come before the first ballot line..

@cpsolver
Copy link

My intended preference is to allow using your ABIF format to supply only the vote-count numbers and the letters A, B, C, etc. as placeholders for the candidates, and using a case number (or serial number if that's the wording you prefer) that associates this case with a separate file that contains (in any format!) candidate names, political parties, election title, and whatever other metadata is desired.

This allows your format to be used in real elections where interested voters can verify that the winner-calculation software has absolutely no knowledge of who the candidates are. That way anyone can verify that the calculation software does not have any bias -- such as political party names or even whether a candidate name contains accented characters.

Also, for my purposes, it allows my calculation software to ignore the character encoding and just assume ASCII (not Unicode, etc.). FYI I use my Dashrep programming language to handle text, and that's where the candidate names, etc. are. But I need a unique number that associates the two software paths (instead of relying on filenames, which is not reliable in real elections).

As another benefit, when discussing specific cases on the EM forum, the case number can be changed when the vote numbers change, which avoids having to ask "did you mean this set of vote counts or this other set of vote counts?"

@simberaj
Copy link

My experience with hybrid (multi-) file formats is, diplomatically speaking, mixed - accompanying files get lost or misnamed, etc. A good example of this is the ESRI shapefile format, the dubious de facto standard for spatial data. I would try to avoid that if at all possible.
The lack of bias verification can AFAIK easily be done by preprocessing the ABIF file by stripping metadata and replacing candidates with placeholder strings. I don't see non-ASCII characters as much of a problem nowadays - most languages now have built-in support for Unicode and UTF-8 reading/writing.

@brainbuz
Copy link
Contributor

All of the important data should be in a single file. I think it might look something like

ABIF 0.001
election_date: 2021-06-15
election_name: best acronym meaning for ABIF
...
=choices
.... # more kv data for the choices
=ballots
...

BTW, There should be a way of preserving optional subdivision data in the ballots section, but I think it will be clearer how to do this when more of the format has been worked out.

@robla
Copy link
Contributor Author

robla commented Jun 15, 2021

As I've been thinking about this, I realize that the first four bytes of every line should be special. The table I'm envisioning is below. Note that periods (".") correspond to "unknown character" in this particular table.

first four characters in the line interpretation
"[..." "[" is the beginning of a quoted candidate token. Assume that every character afterward is a UTF-8 character making up the quoted candidate token. How the candidate token is interpreted depends on what follows the candidate token.
"a" through "z" (or A" through "Z") followed by three letters "a" is the beginning of a bare candidate token, as is every other English lowercase letter ([a-z]) or every other English uppercase character. All valid bare candidate token characters ([A-Z] and [a-z]) that follow are also part of the bare candidate token. How the candidate token is interpreted depends on what follows the candidate token.
"0" through "9", followed by an arbitrary number of digits, followed by a colon (":") This line is the beginning of a "ballot bundle", which is a bundle of identically scored/ranked ballots. Everything after the colon in the line expresses the scoring (or ranking) for each candidate in the bundle.
"# .." ("#" followed by " ", followed by any two characters) is the beginning of a comment (as it is anywhere on the line)
"{..." Optional context-dependent metadata. This would indicate that the line will be JSON; probably in a format like NDJSON.
whitespace (spaces or tabs) Whitespace at the beginning of any line (except for the first line) can be discarded when parsing. The first non-whitespace character in a line is interpreted as above.

I also believe that the first line of the file should be different than every subsequent line, though I still think it should be possible to have nothing but ballot bundle lines.

There's a lot to respond to here in the comments from the past 24 hours (thank you, everyone!). This comment isn't a full reply to the ideas raised by other folks, but it was inspired by thinking about a response. There is a lot more to do....

@robla
Copy link
Contributor Author

robla commented Jun 16, 2021

I believe this issue is becoming the meta-issue for the answer to the question "what is the structure of an ABIF file?" I would like very few requirements, but here are the requirements that I would like for the v1.0 version of this specification:

  • A file consisting only of ballot bundle lines is a valid ABIF file
  • Implementations must recognize human-readable comments, and must not use comments to change the semantics of the machine-readable portions of the file.
  • Each machine-readable line should be human readable outside of the context of the specific ABIF file it is included in. Not perfectly so, but more so than with single lines out of other formats (like CSV files).
  • It must be reasonably simple to combine and split ABIF files.
  • Parsing should be easy, and preferably possible with a single-pass parser

The specification may include an easily-recognized (and optional) version string as the first line of the document, which may or may not include option information and/or well-structured metadata about the file. For example, this would allow for the "case number" (as @cpsolver calls it) or "serial number" (as I would currently prefer to call it). One of us may also want to create an issue dedicated to this optional first line, pulling in bits from this issue.

I recently created issue #8 to discuss candidate token declarations. As I stated over there:

I believe that we should make it possible to infer the section that a line is in from the first character of the line, and have a convention (rather than a requirement) of using comments to delimit sections for readability for now. We may want to make sections more explicit in the near future, but my hunch is that having line-based section identification will force us to make the line formats we design more robust (and human readable) and will also encourage more robust implementations without too much burden.

What I also think: we can make life easier on implementors by having a soft requirement for line ordering, thus having implicit sections. As much as I like RFC-5322-formatted headers from a simplicity and flexibility perspective, I suspect they are too flexible for this new format. My hunch is that adopting unmodified RFC-5322-formatted header lines will make it difficult to expand the format later, since the lines can begin with pretty much any ASCII alphabetic character.

I like NDJSON, with a soft requirement that there be only one key-value pair per line (and maybe optional comments at the end of the line), and a strongly-recommended ordering. So, for example, we could allow for this to be a valid ABIF file:

@ABIF-{"SerialNumber":001}
{"Date": "2021-06-15T18:33:59-07:00"}
{"Software": "ABIFomatic 0.8"}
{"PrecinctZipcode": "12345"}
=DGM:[Doña García Márquez]
=SBJ:[Steven B. Jensen]
=SY:[Sue Ye (蘇業)]
=AM:[Adam Muñoz]
27:DGM/5>SBJ/2>SY/1>AM/0
26:SBJ/5>DGM/3=SY/3>AM/1
24:SY/5>DGM/2=AM/2>SBJ/1
23:AM/5>SY/3>DGM/1>SBJ/0

...and we could declare that to be equivalent to the following version with more comments and whitespace:

@ABIF - {"SerialNumber" : 001}   # Batch 1 of 7

#################
#  Optional metadata
{"Date": "2021-06-15T18:33:59-07:00"}
{"Software": "ABIFomatic 0.8"}             # "ABIFomatic" was formerly referred
                                            # to as "ABIForoni"
{"PrecinctZipcode": "12345"}

#################
# Candidates
=DGM: [Doña García Márquez]
=SBJ: [Steven B. Jensen]    # dropped out 3 days before the election
=SY:  [Sue Ye (蘇業)]
=AM:  [Adam Muñoz]

#################
#  Ballot bundles
27: DGM/5 > SBJ/2 >  SY/1 > AM/0
26: SBJ/5 > DGM/3 =  SY/3 > AM/1
24:  SY/5 > DGM/2 =  AM/2 > SBJ/1
23:  AM/5 >  SY/3 > DGM/1 > SBJ/0

The rationale for the "soft requirements" is to provide implementors guidance on what will always be "safe", while at the same time not choosing between a couple of different obvious paths until implementors (and common usage) force us to choose one or the other for a future version. We may (in the future) want to allow for easy concatenation of many ABIF files with lines with sections that have candidate declarations, then ballot bundles, then more candidate declarations, and maybe some metadata interspersed for various reasons. But we can make soft requirements pf prdering so that we don't have to expend a lot of mental energy on every possible combination of weird loopholes in the spec, and give ourselves the rationalization to tell implementors who try to exploit loopholes that "you really shouldn't have implemented that in the first place", and encourage folks to alert us when they are planning to exploit a loophole or extend the format in weird ways.

More to think about....

@brainbuz
Copy link
Contributor

I agree with Robla's bullet points except for bare bundle.

Some identification of the election should be a minimum. Parsers would need to guess from the first ballot line. For example Vote::Count needs to translate equal ranking rcv ballots to range ballots, if the first ballot looked like a normal rcv ballot the data type would be guessed wrong. If only one line has an '=' is it a typo or did only one voter give an equal ranking?

Enforcing some metadata can also help humans writing the files by making it easier for validators to detect errors. If a choices list is required, then for example any mistyped entries would be undefined choices.

Developers will want to use the choice_identifiers as keys, allowing the [long_description] in the data is asking for trouble. If the choices list is going to be optional, when skipped the data should still be required to be legal as identifier strings. Also humans writing these files by hand want less mandatory metadata to save typing, but using short identifiers instead of long strings that require multi-key combos to enter will save typing, so this isn't really a burden on human file creators.

P.S. If I wrote Vote::Count again I would represent rcv ballots as range (with negative scores to invert the order) instead of having two different internal data structures.

@robla
Copy link
Contributor Author

robla commented Jun 17, 2021

@brainbuz wrote:

Some identification of the election should be a minimum. Parsers would need to guess from the first ballot line. For example Vote::Count needs to translate equal ranking rcv ballots to range ballots, if the first ballot looked like a normal rcv ballot the data type would be guessed wrong. If only one line has an '=' is it a typo or did only one voter give an equal ranking?

I learned a lot working with the W3C back in the late 1990s and early 2000s. A few things:

  1. Standards bodies don't have the ability to enforce standards. This is why the W3C calls them "Recommendations" rather than "Standards"
  2. "Compliance to the standard" isn't much of a market disincentive for not complying with the standard.
  3. Don't overcomplicate the specification. XML is a great example of a specification that became too complicated to provide a stable foundation for XHTML. That's what led to the WHAT-WG forming and creating HTML5.
  4. I don't want to set up a standards body. It's too much work. I would prefer to write a clear specification that becomes a standard because it has lots of great test cases (where the meaning of the files in the test suite is clear) and a many developers want to implement the specification.

There are plenty of use cases (including the use cases I would like to implement one day) that do NOT require any of the metadata you are describing. We have discussing election methods on the EM-list for 25 years without providing a lot of metadata that you suggest should be a requirement. I disagree. I think this should be a complete, valid ABIF file:

42:Memphis>Nashville>Chattanooga>Knoxville
26:Nashville>Chattanooga>Knoxville>Memphis
15:Chattanooga>Knoxville>Nashville>Memphis 
17:Knoxville>Chattanooga>Nashville>Memphis

Most people who have studied electoral reform long enough probably know what that election is, and I'm guessing you do too @brainbuz . For others reading this that don't get the reference, see any of the Wikipedia articles that use the "Tennessee voting example". While there's plenty of interesting metadata that could be included, it's not necessary to communicate the point of the electoral example. I believe that an fully compatible ABIF implementation should be able to read that example.

Do implementations need to be fully compatible in order to be useful? No, of course not. There are almost no implementations of SVG (for example) that are able to pass the SVG test suite with 100% compatibility, and those implementations may not be the most useful implementations. Moreover, there are many very useful SVG implementations (e.g. Inkscape) that read and write SVG files that include extra metadata not specified by the SVG specification. That's okay! SVG has been pretty darn useful despite rigid adherence to a specification.

Moreover, XML remains a useful specification, even though XHTML no longer gets the attention that it once did, and mainstream browsers seem to prefer HTML5.

Back to the subject of ABIF: I think it would be perfectly valid and useful for someone to write an ABIF linter that recommends the addition of metadata, and perhaps even requires the metadata if given a command line switch. That would help implementors needing that metadata in the files they use. However, I don't think the base specification should require an identification of the election. It would be fine for your implementation to require an election identifier, but I don't want to try requiring that the people who build authoring tools (and people who write ABIF by hand) put an election identifier if they don't want to.

@brainbuz
Copy link
Contributor

If the parser has to be able to determine the ballot type from the first line then the default allow_equal_ranking for scored should be yes and for rcv should be no. how can the parser tell if the ballots are approval or plurality from the first line?

@robla
Copy link
Contributor Author

robla commented Jun 17, 2021

how can the parser tell if the ballots are approval or plurality from the first line?

Short answer: the parser won't know, and that's okay. Longer answer: the user of the software will know what they want to do with the ABIF file. Sometimes they will use a command line switch or the user interface to indicate that they want a plurality (or "choose-one") count. Then the software may encounter a ballot with more than one candidate listed. The implementor of the software should choose what should happen with that ballot based on the rules of the election in that locality. Are those spoiled ballots? Should the ballots be recounted using approval voting? It's not our decision (as the creators of ABIF) to make.

@simberaj
Copy link

In votelib, I (at least provisionally) chose to parse both approval and plurality ballots as approval ballots, and only determine at the end that the ballots are plurality by checking that all of them contain at most one option. Otherwise, I encountered no problems in determining ballot types from each line independently.

@brainbuz
Copy link
Contributor

brainbuz commented Jun 19, 2021

For headerless files it will need to be up to the application and user to decide what to do. In Vote::Count for example, I can add attributes to the new ABIF import library for providing/overriding the header data.

For the optional Metadata I'm going to suggest:

_# One Line Header Format
_# for the one line header the key and value should be separated by a colon without space for easier regex matching
_# the one line header also doesn't have room to support lists and shouldn't have them
ABIF ballot_type:rcv_equal ...

_# Multi Line Header Format
_# with one key/value per line, space can be allowed after the colon for readability.
ABIF
ballot_type: rcv
election_name: [2021 New York City Democratic District Attorney Primary]
election_date: 2021-06-22 # I would like to specify YYYY-MM-DD so all dates are in same format
=choices ... # whatever the final format for lists is, but I prefer choices to candidates
111:.... # lines with ballot data will always begin with a number, headers will always begin with an alpha character

If the file will contain nested precinct/division data I suggest something like
!division: BRONX_PRECINCT_41 # there's been no discussion of this yet, I just picked ! for this example.
.... lines from BRONX_PRECINCT_41
!division: QUEENS_PRECINCT_6
.... lines from QUEENS_PRECINCT_6

Note 1: should 'rcv_equal' be the ballot type or should it be ballot_type:rcv allow_equal:true? I prefer the first.
Note 2: I went with lowercase for all the header information except the ABIF declaration, but I don't think there's any decision made yet on the casing rules.
Note 3: github thinks # at the beginning of a line is a markdown heading.

@cpsolver
Copy link

The issue of whether or not two or more candidates can be ranked at the same preference level is specific to the method being used to count the ballots. This ballot format standard is just for the ballot information.

There are many, many different ways to count ranked ballots. The best methods allow "equal rankings." In fact the IRV and STV methods promoted by FairVote are the only counting methods I can think of that impose that limitation.

Let's not go down the rabbit hole of including information that is specific to the counting method. Of course optional comments must always be allowed, so such restrictions can be specified as comments when they are relevant.

@brainbuz
Copy link
Contributor

An objective which has been proposed and which no one has objected is that "Implementations must ... must not use comments to change the semantics of the machine-readable portions of the file."

Knowing whether the ballots were cast under a rule that allowed or denied equal rankings is an important semantic.

@robla
Copy link
Contributor Author

robla commented Jun 20, 2021

What @cpsolver seems to be saying is that the rules of the election-specific constraints are outside of the scope of the .abif machine-readable portion, and I concur. @brainbuz 's assertion seems to challenge that: "Knowing whether the ballots were cast under a rule that allowed or denied equal rankings is an important semantic." I disagree that it's important in the context of a machine interpretation of an ABIF file. We're trying to communicate what ballot the voters cast, not all of the possible ballots they could have cast, or the laws constraining them in a particular election.

BTW, @brainbuz : using a combination of the backtick (`) char and triple backtick (```) will help make the examples you provide in your June 18th comment more readable, since you'll be able to use the hash sign (#) at the beginning of a line.

My current inclination is to use a variation of option b4 outlined in my original comment ("*Have open squirrelly bracket ({) correspond to a valid NDJSON line") as the way of including arbitrary metadata. It seems each line should only have one key value pair, so for example this would be valid:

{"county":"king"}
{"city":"Seattle"}
{"congressional_district": 7}
{"precinct": "36-1686"}

...but this wouldn't be valid:

{"county":"king", "city":"Seattle", "congressional_district": 7, "precinct": "36-1686"}

That would simplify parsing for those that don't want to rely on a full JSON parser to interpret each of the metadata lines, and would allow implementations to associate one key with each line. We can also restrict values to a subset of valid values, and possibly make this invalid:

{"CountyCityDistrictPrecinct":["King", "Seattle", 7, "36-1686"]}

I'm not sure just how much we should restrict the range of valid JSON, if we restrict it. Regardless, I prefer using JSON key value pairs to using RFC-822/5322-style key value pairs, since it gives us an obvious path forward if we choose to expand beyond simple key-value pairs.

More to think about....

@nealmcb
Copy link

nealmcb commented Jun 21, 2021

I think some use cases would help answer these questions.

For real elections, I think using the NIST standard or some other existing open clean standard should be encouraged and facilitated. Having folks in this group add to the implementations of that standard (and/or identify issues for future revisions) might well be a much more beneficial contribution than adding yet another format with idiosyncratic metadata approaches.

The use case we see in election methods discussions is already clear in this discussion.
In terms of formats for that purpose, I think the simpler the better, and bare ballot bundles would seem to me to handle most of that need. After all, that's what we've been doing for a long time.

Are there other use cases that are common enough and likely enough to be implemented, to warrant the extra complexity of the metadata discussed here?

@brainbuz
Copy link
Contributor

The metadata is already being discussed as optional.
If you have an archive of data from many elections that you want to use in analysis, having metadata in the files is going to be convenient.

@brainbuz
Copy link
Contributor

since we're looking at a readable text format and have rejected having the entire file in json/yaml, I would prefer to stick with text kv/pairs. if there is a solid case for something jsonlike, I'd rather put the whole metadata block in a json object, which can then be handed off to a json parser.

@robla
Copy link
Contributor Author

robla commented Jun 21, 2021

@nealmcb wrote:

I think some use cases would help answer these questions.

Yup, you're right. I filed this issue in anticipation of a problem that we arguably don't have yet, and thus this conversation may be a violation of the YAGNI principle. That said, I also want to facilitate full interoperability with other formats, per this part of your comment:

For real elections, I think using the NIST standard or some other existing open clean standard should be encouraged and facilitated.

When you refer to "the NIST standard", I'm guessing you're referring to the "Cast Vote Records Common Data Format Specification Version 1.0", a 94-page PDF published by the National Institute of Standards and Technology (NIST), a division of the United States Department of Commerce. NIST certainly is a respectable organization, and recommendations published by the organization seem likely to gain adoption over time as government bodies mandate use of the NIST specification (like federal, state and local agencies in the United States) . The government imprimatur bodes well for NIST's CVR specification.

However, I'm writing this as someone who was heavily involved with the World Wide Web Consortium (W3C) during the rise of wikis and wiki formats (and I seem to recall attending a W3C meeting hosted at/near NIST's headquarters in Gaithersburg). Wiki formats are often criticized as being "idiosyncratic" and "difficult to work with", but I think SGML and XML are also idiosyncratic and difficult to work with. More generally, interoperability is difficult because other people are idiosyncratic and difficult to work with. 😃 I'm grateful that the people who invented wiki formats like Markdown, MediaWiki's wikitext, asciidoc and others didn't listen to people who called on them to give up their efforts and follow "the standard". Note that I am also grateful for John MacFarlane's tireless efforts on pandoc and CommonMark, and I'm hopeful that CommonMark gains the imprimatur of the most commonly recognized "official" standards bodies. I also realize that we're going to be needing pandoc for many years to come, since GitHub is not likely to give up on GFM anytime soon. 😒

I see ABIF as the "wiki-like" counterpart to NIST's CVR format. With any luck, we can form a community of developers that more-or-less agree on the "correct" format for .abif files, and we can come up with a valid, lossless two-way mapping of .abif files to (and from) NIST's CVR format. I think having a way of embedding machine-readable structured text in .abif files is going to make that job a lot easier.

@nealmcb - I'm guessing you've heard of the IETF's mantra of "rough consensus and running code". I suspect I should be spending less time writing words about ABIF, and more time writing code implementing ABIF. Would you be willing to help write code for an implementation of a two-way converter between ABIF and NIST's CVR? Are there any implementations of NIST's CVR format that have been published as open source? Is there a reference test suite that's been published?

@nealmcb
Copy link

nealmcb commented Jun 22, 2021

Thanks, Rob. I've been embedded in both the wiki world (as a participant in the first wiki, c2, and a friend of Ward Cunningham, the inventor of Wikis, and a wikipedia editor since 2002) and the IETF / OASIS / NIST worlds of standards.

Good points, though it seems like a very challenging proposition to me to achieve valid, lossless two-way mappings, especially when we really want to always work with digitally signed CVRs in the world of official elections.

The NIST format has important momentum behind it, though it does not follow my notion and urging over the years to be very clear about what is required for the important use cases like auditing, and to insist on at least one wire format as a must-implement case. That makes interoperability more challenging than it already is.

At any rate, another person very interested in this space is @raylutz. We've both been discussion this stuff a bit (though it is off topic) at the Common Data Format Research Group: Common Data Format (CDF) Ballot Styles Subgroup | NIST.

@nealmcb
Copy link

nealmcb commented Jun 22, 2021

Re implementations of the NIST CDFs for elections, I think the best source right now is from John Dziurlaj, chair of the the Common Data Format Research Group working with NIST:

@brainbuz
Copy link
Contributor

I think conversion of CVR to ABIF is much more important than the other direction. CVR is a heavier format to facilitate Election Administration and (Voting) Machine to (Tabulating) Machine copying of data. Raw data from elections would be coming in CVR, which will need to be converted. I've created issue #14 for work on the metadata dictionary, which will be a better place to discuss how to map data from other formats. The people who imagine would want to go ABIF to CVR are likely to be the vendors who sell equipment to elections officials who are looking to use data sets that become publicly available in ABIF for validating their implementations, and they will want to get the ABIF data to match their voting machines version of CVR, so will be writing their own converters anyway.

@robla
Copy link
Contributor Author

robla commented Jun 22, 2021

@nealmcb wrote:

Thanks, Rob. I've been embedded in both the wiki world (as a participant in the first wiki, c2, and a friend of Ward Cunningham, the inventor of Wikis, and a wikipedia editor since 2002) and the IETF / OASIS / NIST worlds of standards.

Cool! So you'll be able to namedrop and make obscure 1990s tech references that some people will be too young to understand. 👴 I'm just kidding about purposefully going overboard with antique acronyms. We should probably take our reminiscing elsewhere, such as a "which standards body should we present ABIF to? (or should we?)" (note: this issue hasn't been filed yet....)

Good points, though it seems like a very challenging proposition to me to achieve valid, lossless two-way mappings, especially when we really want to always work with digitally signed CVRs in the world of official elections.

Time to get super nerdy now. I think one seldom-understood aspects of git is its use of Merkle trees to build a unique commit hash. SHA-1 may not be the best choice for a modern version control system if one was starting from scratch in 2021, but it was a pretty good choice for the time. Should ABIF files have a cryptographic hash? I'm not sure....

The NIST format has important momentum behind it, though it does not follow my notion and urging over the years to be very clear about what is required for the important use cases like auditing, and to insist on at least one wire format as a must-implement case. That makes interoperability more challenging than it already is.

Yup. To be clear, I don't anticipate that ABIF is going to take the world by storm, and supplant NIST's CVR format. NIST needs a marketing department. Michael Lewis's book "The Fifth Risk" explains why. In short, I suspect that most people don't understand the incredibly important work that NIST does. I just want to come up with a text format that is succinct and easy to copy/paste between plain-text emails (and easy to express in a few lines of text, for a reasonable definition of "few").

At any rate, another person very interested in this space is @raylutz. We've both been discussion this stuff a bit (though it is off topic) at the Common Data Format Research Group: Common Data Format (CDF) Ballot Styles Subgroup | NIST.

Is Common Data Format (CDF) a layer on top of XML and JSON, or is it something else? Judging from the Wikipedia article, I'm guessing that CDF actually predates XML and JSON, which scares me a lot. I looked into ASN.1 many years ago, so I'm not surprised that XML (and later JSON) ended up subsuming many of the use cases of of ASN.1.

I think we should take discussion of the metadata format (or data model) over to issue #14 . I think we should take the discussion of the overall data model for ABIF over to issue #15 . It seems that this issue (issue #6) may have run its course, but I'm going to keep this open until we have greater consensus about the ABIF metadata text format.

@JDziurlaj
Copy link

CDF is a generic term and does not imply a particular serialization. The NIST 1500 series CDFs support JSON and XML from a shared UML Data Model. Details on its mapping are available here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants