Skip to content
How do tech companies rank amongst themselves when it comes to github.com activity?
Branch: master
Clone or download
Latest commit 4d7098b Mar 11, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github 🤲 baseline open sourcing files Feb 20, 2019
bin
db 🚜 small fix for top adobe contributors, and added top repos adobe con… Aug 10, 2018
src Revert IBM / Red Hat merging Mar 11, 2019
.gitignore
CODE_OF_CONDUCT.md 🤲 baseline open sourcing files Feb 20, 2019
LICENSE 🤲 baseline open sourcing files Feb 20, 2019
README.md 📝 final readme tweaks Feb 20, 2019
package-lock.json
package.json 📝 documentation updates! Feb 20, 2019

README.md

Tracking Open Source Contributors

Build a(n improved) ranking of companies-as-contributors-to-public-GitHub (based on this blog post).

Too Long; Didn't Read

Pretty graphs over here.

Why?

The user-to-company association in the ranking blog post that inspired us is not ideal: it uses the email associated to a git config, and if the domain to the email is NOT of a public mail provider (gmail, yahoo, etc), it assumes it's a company. That's not a great way of going about it because not many people use their company's e-mail in the git config they use with their public GitHub.com account.

To make that association better, this project cross-reference GitHub.com activity, which is tracked via githubarchive.org data (and is freely available as a dataset in Google BigQuery) with GitHub.com user profiles. We pull the company field from user's profiles and store those in a periodically-updated (currently monthly) database that we then copy over into BigQuery.

Features

  • Leverages githubarchive.org's freely available dataset on Google BigQuery to track public user activity on GitHub.
  • A GitHub.com REST API crawler that pulls users' company associations (based on their public profile), that we then store in a database (and periodically update).
  • Tracking and visualizing GitHub contributors from tech companies' activity over time in a spreadsheet.

Implementation

We have a BigQuery project with relevant supporting tables and queries. If you'd like access, contact @filmaj (via an issue in this repo or on twitter). This project contains:

  1. A database table tracking user-company associations (currently done in an Adobe IT managed MySQL DB). Fields include GitHub username, company field, fingerprint (ETag value as reported from GitHub, as a cache-buster). We synchronize the MySQL DB with BigQuery every now and then using a command this program provides.
  2. Another table tracks GitHub usernames active over a certain time period.
  3. For each active user identified in (2), we pound the GitHub REST API to pull user profile info, and drop the company field from that info into the DB table described in (1).

How Are Companies Tracked?

Check out the src/util/companies.js file. How it works:

  1. There is a "catch-all" regular expression (🤡) that tries to match on known tech company names.
  2. If a match is detected, then we try to map that back to a nicer label for a company name. Note that multiple expressions from the company catch-all may map to a single company (e.g. AWS, AMZN and Amazon all map back to Amazon).

JSON Cache

This project will "cache" the results stored in the MySQL DB locally in a JSON file in order to avoid doing round trips to the DB. At the time of this writing this file is roughly ~330MB and tracking 6.25 million user profiles.

TODO

  1. Describe how to use bigquery in conjunction with this repo.
  2. Real-time visualization of the data.
  3. Tests.

Requirements

  • Node.js 9+
  • a BigQuery account, and a bigquery.json file is needed in the root of the repo, which contains the credentials for access to Google Cloud BigQuery. More info on how to set this file up is available on BigQuery docs.
  • a oauth.token file is needed in the root of the repo, which contains GitHub.com personal access tokens, one per line, which we will use to get data from api.github.com. In my version of this file, I have several tokens (thanks to all my nice friends who graciously granted me one) as there is a maximum of 5,000 calls per hour to the GitHub REST API.
  • a MySQL database to store user-company associations. Currently using an Adobe-IT-managed instance: hostname leopardprdd, database name, table name and username are all GHUSERCO, running on port 3323. @filmaj has the password. The schema for this table is under the usercompany.sql file.

Doing The Thing

$ npm install
$ npm link

At this point you should be able to run the CLI and provide it subcommands:

Updating MySQL DB of User-Company Affiliations

This command will pull the rows from a bigquery table containing github.com usernames, pull user profile information for each user from the GitHub.com REST API and store the result of the company field (and the ETag) in a MySQL DB table.

$ node --max-old-space-size=8192 bin/oss.js update-db <bigquery-table-of-user-activity> <json-cache-file>

Running this command and pointing it to a bigquery table containing ~1.5 million github.com usernames, on last run (Feb 2018), took about 6 days.

Writing DB to JSON Cache

This command simply dumps the contents of the user-company affiliations stored in MySQL into a local file.

$ node --max-old-space-size=8192 bin/oss.js db-to-json <json-cache-file>

On last run (Feb 2018), this command took a few minutes to complete.

Uploading Results Back to BigQuery

This command will push the the contents of the JSON file back up to BigQuery. This command will delete the table you specify before pushing up the results.

$ node --max-old-space-size=8192 bin/oss.js json-to-bigquery <json-cache-file> <bigquery-table-of-user-company-affiliations>

On last run (Feb 2018), this command took a few minutes to complete.

Putting It All Together

If you're still with me here: wow, thanks for sticking it out. How all of this fits together:

  1. Run the incremental user activity query on BigQuery, and store the result in a new table. I usually run this on a monthly basis, but you are free to use whatever time interval you wish.
  2. Run this program's update-db command, specifying the bigquery table name you created in (1), to get the latest company affiliations for the users identified in (1) stored in your MySQL DB. This usually takes days. You have been warned.
  3. Run this program's db-to-json command to make sure you have the latest user-company affiliations stored locally in a JSON file. This should only take a few minutes.
  4. Run this program's json-to-bigquery command to send these affiliations up to bigquery. Note that the table you specify to store these affiliations in, if it already exists, will be deleted. This should only take a few minutes.
  5. Run the contributor-count, repo-count and stars-accrued query on BigQuery, and store the result in a new table. This query will look at all github activity over the time period you specify (top of the query) and correlate it with the user-company affiliations table we created in (4). Make sure you use the correct table name for the user-company affiliations in the query (search for JOIN). BigQuery is awesome so this should never take more than a minute, though do keep an eye on your bill as, well, money goes fast ;)
  6. Bask in sweet, sweet data.

Contributing

Firstly, check out our contribution guidelines. Secondly, there are probably way better ways of doing this! For example, I've noticed that the company field info is somewhat available directly in BigQuery, so probably the whole "use a MySQL DB" thing is dumb. I'm grateful for any help 🙏.

You can’t perform that action at this time.