Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical Progress General Inquiry #4

Closed
sgoggins opened this issue Jan 23, 2017 · 6 comments
Closed

Technical Progress General Inquiry #4

sgoggins opened this issue Jan 23, 2017 · 6 comments

Comments

@sgoggins
Copy link
Member

sgoggins commented Jan 23, 2017

Hi everyone:

@bkeepers @howderek @wingr especially:

I am writing to give you an overview of the technical progress we are making, and to foreshadow future requests for accelerated or privileged access to the GitHub API that we may request. There are headings below so you can scan.

BACKGROUND (We All hopefully share this now for the most part):

Looking at changes over time in GitHub repositories will be essential to the aims of our project: understanding their health and sustainability. We hypothesize (and, based on preliminary work, we think with some likelihood that we are right) the following:

  1. H1: There is a relationship between derivable indicators of repository activity on GitHub and the type of organization governing the project
  2. H2: There is a relationship between derivable indicators of repository activity on GitHub and performance, as perceived from the perspective of various stakeholders.
  3. H3: Different stakeholders (owners, contributors, users, regulators, etc.) will be influenced by different combinations of indicators.

I think these flow as lower level operating tests from our research questions:

  1. How and to what extent are community health and sustainability indicators identifiable from GitHub open source community data?
  2. What are dominant genres of community based on health and sustainability indicators, and how and to what extent are health and sustainability indicators different between these communities?
  3. How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?
  4. How and to what extent do heath and sustainability indicators change over time as communities evolve to include increased membership, new governance structures, and support from foundations?

TECHNICAL APPROACH:

Here, to some extent, we are looking to Brandon and Rowan to validate that we are not missing any key concepts or attributes of the available resources from GitHub. In particular, if there are limitations in the data archives and torrents we are referencing, those would be good to be aware of.

  1. We are doing our indicator development against GHTorrent and the GitHub Archive.

    1. Since data about deleted repositories and users may play a role in our research, it's necessary to use archives of GitHub data as opposed to the timestamp information included in GitHub API requests.
    2. From our initial exploration, it appears there will be two projects that will meet our needs, GHTorrent and GitHub Archive
    3. GHTorrent provides a SQL database of metadata created from the events stream, and GitHub Archive archives those events themselves.
    4. There is a lot of overlap between the datasets, but both are needed. A fast interface to the data is needed, such as the SQL database that is populated by GHTorrent.
  2. Once indicators are mature enough to evaluate (estimated 4-6 weeks), we will need more current information to validate with project stakeholders, who will likely have less recall of things going on a month or two ago than last week. We think less archival indicators are also going to be more compelling for GitHub users generally. To that end,

    1. The data we use will need to become quite “up to date”. What is the best strategy?
      1. Daily dumps provided by the GitHub Archive to fill in the gaps between the SQL backups provided by GHTorrent and the realtime data provided by the GitHub API?
      2. Privileged API Access?
      3. Both?
      4. Other?
    2. Ideally, we would like to demonstrate indicators and provide an indicator exploration site with the hope of prototyping a system that could be used to gain wider evaluation of the indicators (from GitHub’s ecosystem).

Perhaps this is too much for an email and a call is warranted? But I thought I would start here!

Thanks!

@sgoggins
Copy link
Member Author

sgoggins commented Feb 2, 2017

Hey @bkeepers and @wingr : A few thoughts on our approach here would be most welcome. :)

@GeorgLink
Copy link
Member

How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?

Just a note: The discussion on developing and understanding of the indicators occurs in the HealthIndicators repository

@bkeepers
Copy link

bkeepers commented Feb 5, 2017

Hey @sgoggins, sorry for the delay here. I just wanted to give you a heads up that I'm at FOSDEM right now now and it'll be a few more days before I get a chance to reply.

@wingr
Copy link

wingr commented Feb 7, 2017

@sgoggins likewise sorry for the delayed response.

Looking over the technical section, your approach looks sound to me. I also believe that GHTorrent and GitHub Archive are going to be your best source of information, although @bkeepers knows a little more about the public data sources than I do. I believe that you should be able to get what you need from the GitHub archive without needing privileged API access and that it should provide you with enough up-to-date information since it is updated hourly.

There are also a number of scripts and wrapper code that people have created to help pull data from these sources that you can find by Googling.

A team I work closely with is in the process of trying to get better documentation around using these public data sets, so ping me with questions or challenges that your encounter and I will pass them along and try to help where I can.

@bkeepers
Copy link

bkeepers commented Mar 5, 2017

Hey @sgoggins, I agree with @wingr that GHTorrent and GitHub Archive are going to be your best sources of information.

As for keeping the information up to date, I don't have any great advice at the moment, but if this is still a challenge, could connect you with some folks I know that keep an internal copy of the GHTorrent data set to see how they do it.

How's everything going with regard to access to the data?

@howderek
Copy link
Contributor

We got this worked out!

sgoggins pushed a commit that referenced this issue Dec 28, 2020
Pull most recent dev
sgoggins pushed a commit that referenced this issue Jan 4, 2021
sgoggins pushed a commit that referenced this issue Jul 13, 2021
sgoggins pushed a commit that referenced this issue Dec 13, 2021
sgoggins pushed a commit that referenced this issue Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants