Skip to content
This repository has been archived by the owner on Mar 3, 2022. It is now read-only.

[Request for discussion] How agencies should inventory their software #116

Closed
konklone opened this issue Apr 8, 2016 · 11 comments
Closed

Comments

@konklone
Copy link
Contributor

konklone commented Apr 8, 2016

(I’m Eric, an engineer at 18F, an office in the U.S. General Services Administration (GSA) that provides in-house digital services consulting for the federal government. I’m commenting on behalf of 18F; we’re an open source team and happy to share our thoughts and experiences. This comment represents only the views of 18F, not necessarily those of the GSA or its Chief Information Officer.)

The Implementation section of the policy asks agencies to inventory their open- and closed-source software projects, so that OMB and the public can increase the discoverability of agency software. This seems very similar to what agencies do with their datasets, in support of M-13-13 and Project Open Data.

Many agencies don't have existing inventory processes in place, and agencies manage their enterprise data inventories in a variety of mostly manual ways. Our experience with these data inventories is that they are often out of date and incomplete.

Given that, we think cognitive simplicity and automation, for project owners and agency staff managing inventory data, will be key to getting complete and timely inventory data. In other words, we should make lives easier for publishers, even at the cost of inconveniencing consumers, so that consumers end up with better overall data.

There are a variety of ways you could accomplish this. We describe a couple ways below, but would really love discussion on what the best way to achieve this is.

One way is to have agencies list the places that their software projects can be found, rather than a single list of all of their projects. These places would be expected to have a machine-readable way to list those projects -- they could be GitLab or GitHub accounts, or RSS or Atom feeds maintained directly by an agency, or an OMB-designed schema.

This resembles how sitemaps work today, where your initial sitemap may just be an index of links to other sitemaps.

One simple way to represent an index like this might be:

{
  "hosts": [{
    "url": "https://github.com/18F",
    "format": "github"
  },{
    "url": "https://code.gsa.gov/feed/",
    "format": "rss"
  },{
    "url": "https://code.cio.gov/repos.json",
    "format": "omb"
  }]
}

OMB or the public would then need to "walk" each of these places using format-specific adapters that use (in the above example) the GitHub API, an RSS parser, and an OMB-specific parser in turn. (In practice, the only way to inventory closed-source projects would be agency-hosted data, not via GitHub -- so closed GitHub repositories would need to be inventoried elsewhere.)

This approach has some clear limitations. OMB and the public would be limited by the data fields available in the software systems used by agencies, and they would have to employ a more sophisticated system in order to discover every project across a variety of formats. There's also some additional complexity inherent in using a "two-tiered" inventory system, as compared to simply having agencies produce a single large list of repositories.

However, this would reduce the burden of creating new open source projects on agencies and their developers to essentially nothing, and would reduce the burden of agency inventory maintainers to only documenting closed source work. This is proportionate to the level of ease and fluidity that OMB should want agencies to have regarding open source code, and would be an acknowledgment by OMB that they don't want the inventory process to be a major burden to agencies. In addition, the burden of using format-specific adapters to walk different services could be mitigated if the tools OMB uses to walk agency inventories are made open source and straightforward for others to use.

Alternatively, OMB could ask agencies to provide a single simple JSON list of all open- and closed-source software projects, but provide tools (potentially simple in-browser tools) to help agencies make use of their existing GitHub/GitLab/RSS feeds to generate that single list. By comparison to the above, this approach would make OMB's life (and the public's life) easier when walking over agency inventories, but would add the burden to OMB of providing maintained tools that help walk different feeds of software projects. Agency inventory maintainers would have to do more work when updating their inventories, though not at any additional frequency.

@philipashlock
Copy link

(I'm Phil Ashlock, the Chief Architect at Data.gov which is also operated within the U.S. General Services Administration (GSA). As part of my role at Data.gov, I've worked with OMB and agencies to shape and implement the Open Data Policy and maintain Project Open Data (especially the metadata schema) which this policy has been heavily modeled on. This comment represents only my views, not necessarily those of the GSA or its Chief Information Officer.)

The approach described by Eric is a hybrid between an inventory and a list of existing inventories. While it makes it clear that agencies will need to implement an inventory mechanism to document their closed source projects, it suggests agencies won't need to worry about the inventory process for projects that can be automatically inventoried by systems like GitHub.

We should remember that systems like GitHub don't magically create the metadata we'd want to include in these inventories. This information does need to be updated and maintained just as manually as if it was entered into any other system. While one could argue that this hybrid or bifurcated process reduces burden on the agency, I think you could argue that the amount of work to enter and maintain the primary source of information is the same, but it prevents the agencies from actively engaging in the management of a complete inventory. This approach prevents agencies from creating a usable, comprehensive, and complete inventory of their own software projects which would help them better plan internally, keep track of ongoing work, and avoid duplicated effort. Instead it asks OMB or the public to assemble a complete inventory for the agencies as if they shouldn't be bothered to manage or make use of such an effort. This is not meant to be a compliance exercise and it's not just for the public benefit. At it's core, this process should help agencies understand what they're doing in a more holistic way and make decisions accordingly.

This hybrid approach also asks OMB or the public to be responsible for adapting these separate inventory systems. This means we will build tools and process around these separate inventory systems rather than work toward a common standard that would actually make it easier to automate the flow of the information from the source of data entry into these cohesive agency-wide inventories. Ironically this means we're effectively promoting a strategy of vendor lock-in for a policy focused on just the opposite.

Both the open data policy and this open source policy require this inventory metadata to be entered at some point and the challenges with the manual data entry cited for the open data policy will be just the same with this policy any way we implement it - hybrid or not. With the open data policy there were some agencies that already had effective data management and inventory processes in place and the policy simply meant outputting those workflows with the metadata standard established by the policy. However, many agencies had no such process in place, so the policy forced them to establish that metadata entry for the very first time. It will be just the same for this policy. While we're fortunate that code management platforms are seeing wider use across government, there are surely numerous projects, both open and closed, that will either need to be migrated to such a platform or have this metadata entered manually for the first time to meet the goals of this policy.

I wholeheartedly agree that we should automate the management of this metadata as much as possible, but agencies will need to be responsible for ensuring this happens and we should work toward common standards to make the process as streamlined and brainless as possible from any source. Ultimately agencies will still need to actively think about what makes up their inventory. Otherwise, we're not asking them to take advantage of the practical benefits of open source and see what's already out there within their own organization or document anything well enough for others to do the same.

I highly recommend the alternative approach Eric suggested. Agencies should be producing complete comprehensive inventories and OMB should help ensure this process can be as streamlined as possible when taking advantage of existing platforms. We should also be engaging platforms like GitHub to ensure their API and the approach to automatically generate assets like README files can help feed this standardized metadata process as well. We took this approach with the Open Data Policy, working in the public with the broader community, and now almost all the major data inventory systems implement the same open standard. Much of the credit for that strategy and architecting Project Open Data in general goes to @benbalter :)

@JJediny
Copy link

JJediny commented Apr 12, 2016

I'm John Jediny, Chief Data Engineer at Data.gov working with @philipashlock +1...

I'll add:

The simplest approach I agree would be to start with single README (using a YAML/MD format) to be posted on any publicly accessible website and/or git repo like (github/gitlabs/bitbucket) that can be periodically pinged via a registered URL/URI to a central catalog. These same files if they implement a established shared core schema could to be used as a single entry in/as a collection of entries to be rendered via static website which can also be used to generate a consolidated JSON file (ex. as a_collection or _data folder/file in Jekyll). As YAML/JSON are interoperable formats, you can compile many YAML files into one JSON file, conversely you can decompile and breakdown JSON into many YAML files. Both of these formats are the basis for much of the modern code configuration/automation and/or as a dynamic or static API(s) because they are one of the few formats that can contain a 1 to many (or nested hierarchy) within a single flat file.

I suggest the Project Open Source team adopts a similar approach to the distributed generation model of Project Open Data; with centralized validation and cataloging. Here are some related efforts that highlight a similar approach we are attempting to register new data:

This approach provides for the most amount of options and highest level of interoperablity:

  1. Post the YAML/MD file on any public website, register, ping.
  2. Add the YAML/MD file to a static site, compile into a single JSON file that can be harvested/parse/merged.
  3. Use existing repos, CMS, etc. to map/extend their data model to conform to a common standard.

However this is all predicated on establish said standard/spec/schema per #117

@philipashlock
Copy link

philipashlock commented Apr 12, 2016

A few other case studies:

The European Union has taken this same approach of federated repositories with a common metadata schema (ADMS), but I think the schema could be simplified and made less abstract than how they use it for "interoperability solutions" and I also think more effort could be made to incorporate commonly used repository management platforms like GitHub as part of their approach. Nevertheless, it's worth noting that the EU has already implemented this same federated approach for open source repositories for public administrations. For more info see https://joinup.ec.europa.eu/catalogue/repository

The civic.json convention used by some in the civic hacking space, particularly Code for America Brigades, is another relevant case study, but I'm not sure that it seamlessly fits into default workflows as well as many things like README files (and ways platforms like GitHub help create them). That said, perhaps we could all just agree on a common template for a README file especially if we could help platforms (or organizations on platforms) set templates for them the same way GitHub now allows repo owners to create templates for issues and pull requests. Just a thought. Though, we'd also probably want them to stay in sync with metadata managed elsewhere in these platforms. In any case, here are some references for the civic.json convention:

@alexrollin
Copy link

alexrollin commented Apr 18, 2016

I'm Alex, an Enterprise Architect, and would like to endorse the solutions shared by @konklone , @JJediny , and @philipashlock .

I agree that cognitive simplicity and automation are the keys to a successful foundation for what will surely evolve over time.

An approach that uses an existing schema to create a structured data source code inventory will be easier and less expensive to implement while maintaining flexibility for future development.

@andrewhoppin
Copy link

andrewhoppin commented Apr 18, 2016

(I'm Andrew, currently leading an open source enterprise software line of business, and formerly the NY Senate CIO, where we launched the first ever government Github repository in 2009; opinions are my own.)

Regarding Software Inventory for Accountability Mechanisms (Section 6), I concur that auditing the 20% of code to be released as OSS should be facilitated through a public inventory of the software-- proprietary and OSS alike-- that is deployed by Agencies. This would be more effective than simply monitoring open source code repositories, including potentially enabling the 20% requirement to be defined as contracts or dollars, not just lines of code, per #176 and #47. As much as possible, the process and metadata standards followed by US Project Open Data to mandate and assist Agencies with publishing a data.json feed of their data should be leveraged and extended for Project Open Source, yielding a new public dataset to assist with Accountability. It's a design pattern. As a further benefit, government procurement officers and vendors alike would have a rich new source of data to help them work towards delivering better value to the government.

@pjdufour
Copy link

pjdufour commented Apr 18, 2016

(I am Patrick Dufour, a Humanitarian Information Specialist & Data Engineer with about 5 years experience working with government contracting and open source software. My comments only represent my views.)

I support a YAML approach for encoding metadata about each source code repo, as it is the simplest approach and scalable. An about.yml approach and toolkit could certainly be socialized more (as officially part of the policy or unofficially as part of suggested best practices). I have two related-comments.

First, some software (language) ecosystems and package managers are tuned for specific documentation langague. Regretfully, there will not be a way to zero out the need to duplicate some minimal documentation in the native documentation language and in YAML. For example, Python and PyPI (https://pypi.python.org/) natively support reStructedText (http://www.sphinx-doc.org/en/stable/rest.html), but do not support markdown as well. If going down the about.yml approach, language-specific instructions should be provided, such as adding about.yml to a MANIEST.in file for correctly packaging python/PyPI code.

Second, if a metadata requirement is added, the policy should explictly state that the metadata requirement should not be construed as requiring consolidation of related code for a project into one monolith codebase. Modern best practices for modularizing code should still be followed. Modular code provides the best value-add, as monolith codebases provide a layer of obfuscation that incentivizes vendor lock-in.

@andrewhoppin
Copy link

+1 on not allowing room for misinterpretation regarding consolidation @pjdufour ; good catch.

@JJediny
Copy link

JJediny commented May 1, 2016

+1 on @pjdufour comment on repo(s) -> project. Would caveat by saying that so long as some requirements are met; in the form that all child repos to a project are registered via a URI, either by linking in a README or more appropriately as whatever becomes the equivalent to distributions in (re: project-open-data) and that all child repos link back to their parent project in README

@jbjonesjr
Copy link

I'm Jamie Jones, a Solutions Engineer at GitHub :octocat: supporting the Federal Government. I'm a big fan of this thing called GitHub, and a big fan of sharing code and making it open source. The opinions within are my own, but ideas are often from many conversations I've had around the community (both OSS, Govt, and commercial). Before coming to GitHub, I wrote software for the govt, and would have loved to not make YAMA (Yet Another Map App) but instead be able to reuse some better code.


cross-posting from #117 (comment)

Instead of defining metadata in a file, you could let the repository provide some of this information for you (and of course a plug for the advantages of using a modern version control system as well to house these projects ;) ). Much like using pre-existing packaging formats (package.json, pom.xml, etc), the more details that can be inferred without requiring the users to do anything, the better. As the Government slowly migrates their projects to a centralized or managed instance as defined in the Policy, limiting rework for existing projects should be a goal.

Some examples of how the GitHub API can provide more details for you:

### Repo host
Implied by GitHub URL, provided by our API: https://developer.github.com/v3/repos/#get

### Title
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#get

### License
Provided by GitHub UI/API: https://developer.github.com/v3/licenses/

### language
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#list-languages

### Last update
Provided by GitHub API: https://developer.github.com/v3/repos/#get

### Keywords
Pull general information about a project from the description or the README with keyword matching (APsI: https://developer.github.com/v3/repos/#get https://developer.github.com/v3/repos/contents/#get-the-readme)

@skybristol
Copy link

I'm all for many of the concepts discussed here and especially the idea from @jbjonesjr on letting the repos speak for themselves.

I will point out one interesting dynamic that we are addressing in at least one Fed agency (USGS) that comes from an older OMB directive - M-10-23. Depending on whether you interpret that as applying to software projects, there's a case to be made that we should provide an access option for Gov-produced software on a government server somewhere even if we use a third party distributor like GitHub as our collaborative space. We're advocating a best practice of cloning or otherwise snapshotting code to a .gov host (publicly accessible USGS-hosted instance of Bitbucket in our case) and letting people know they have that alternative when visiting a third party locale. We're also saying that the .gov hosted option for accessing the code should point to GitHub, CRAN, or other venues if those are where the actual project is being conducted. And in our case in USGS, we have the added dynamic of some of our software projects containing scientific interpretation and requiring review and approval before release; in which case the flow is either to only our own hosted option or from there out to a third party once the software is releasable for distribution (and not necessarily contribution).

There's also an interesting argument to be made that if a Government agency pays for something like a GitHub account, that could constitute an "official" government provided option (similar to hosting a GitLab instance on gov-funded commercial cloud provider).

At any rate, the spirit of M-10-23 is good. Let people know that they are getting official government software from an official source, and do due diligence to make sure that we've got some longevity if a third party distributor goes belly up. But it might create an interesting inventory and monitoring challenge if we have multiple access points for the same projects/codes - that could also probably be worked out by software, letting the repos speak for themselves.

@mattbailey0
Copy link
Contributor

This discussion has moved forward significantly at GSA/code-gov-web#41

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants