Best way to determine organization diversity for a project #15

abuhman · 2017-02-08T20:14:05Z

I created a few queries earlier that counts the number of organizations or companies (they are separate concepts in the data) with pull requests on a project.

However, I have been looking more into some of the GHTorrent tables today and found that a user can be a member of multiple organizations. So if user Jane made a pull request, and Jane is a member of 4 different organizations, do we want to count 4 organizations towards diversity for that single pull request? Organizations have unique ids, so it is clear which users are members of which organizations (as compared to companies, described below). Organizations may not match one-to-one with real world companies. "Google" is a separate organization from "Google Drive", "Google Page Speed" and "Google Cloud Platform"

There is another field we can use to determine such diversity, the "company" field. Each user can have only one company. However, looking at the data in this field, I think users can type whatever they want into it. Thus, there are users whose company is "Google Inc.", others whose company is "Google, Inc.", and also those whose company is "Google".

What do others think about how to decide which user is a member of which organization or company?

Should we be using organizations or companies? I think organizations may be better due to unique ids. It is clear which users are members of which organizations, less so with companies. However, neither seems to match one-to-one with companies as we would normally think of them.
If a user is a member of multiple organizations, do we count them all when a user contributes to a project?
If a user is determined to not be a member of an organization or company (through whatever method we use) do we assume they are independent and count independent users as an organization/company?

howderek · 2017-02-08T20:26:38Z

The way I understood diversity, I believe it is the percentage of a codebase that is outside of an company, so it won't be attainable through SQL.

If we care about the diversity of commits, we can use a NOT IN (subquery that finds companies orgs) to make sure a user doesn't belong to any organization that has the name of a company

Here's an example of how you might find all orgs for a company:

SELECT DISTINCT organization_members.org_id, users.name FROM organization_members
JOIN users
ON organization_members.org_id = users.id
WHERE
users.name LIKE "%Google%"

I think we care about the number of contributors and scope of contributions outside of those orgs, regardless of the orgs they belong to.

GeorgLink · 2017-02-08T20:37:20Z

Help me clarify the concepts and their implications for GitHub repositories.

Organizations appear to be constructs within GitHub that own shared repositories.

Can someone who is not a member of an organization, get commit rights on a repository?
- Thus, do all maintainers have to be members of the organization (will we ever see diversity of maintainers)?
Can members of an organization be employed by different companies (diversity within organization)?

I think companies will be better for determining diversity because it reflects the employer of a contributor (we might have to do some clustering of similar company names).

abuhman · 2017-02-08T22:20:48Z

Georg:

Looking through the data, I do see a few examples where it appears that someone has commit rights but is not a member of the organization that owns the repo.

I suspect that organization members can be employed by different companies, but the queries I have written to confirm it are taking a while to run. I'll update when I have that information for sure.

Do you know a good way to do clustering of company names? Or should we create an issue/task for it?

Derek:
Okay, do you know of a data source that allows us to tell which companies are related to what parts of the code? Is that source something we need to be looking for?

I'm not sure how we would bring organizations together for an arbitrary name we may not be aware of. So if we specifically know we are looking for Google, we can do %Google%, but if we are going just based on a repo url I'm not sure how we could do that.

In general:
Is it useful to our project to know about diversity when it comes to commits, pull requests, etc? Is amount of code from different sources the only metric we will want?

Is it useful to know the number of organizations or companies contributing to a project? Or do we mainly want how many/how much of the contributions come from users that are not members of the organization that owns the project?

howderek · 2017-02-08T22:27:58Z

@abuhman We can get the number of additions/deletions from the API and compare between in the orgs and out of the orgs.

abuhman · 2017-02-08T22:28:49Z

Okay great thanks I will start looking into that.

GeorgLink · 2017-02-08T22:40:08Z

@abuhman, thank you for investigating.

Ratio of maintainers that are not part of the organization are interesting

Do you know a good way to do clustering of company names? Or should we create an issue/task for it?

I do not. Something along what @howderek offered: LIKE "%Google%" maybe?
Worst case, we can go through manually and maintain a list of synonyms.

In general:
Is it useful to our project to know about diversity when it comes to commits, pull requests, etc? Is amount of code from different sources the only metric we will want?

Is it useful to know the number of organizations or companies contributing to a project? Or do we mainly want how many/how much of the contributions come from users that are not members of the organization that owns the project?

Yes, we do want more than diversity in code.
We might add to and probably clarify the current list of potential diversity indicators:

Contributor Diversity
Contribution Diversity
Pull Request Diversity

germonprez · 2017-02-09T12:09:28Z

Hi all,

The diversity metric was one that came from folks at the Linux Foundation. The premise was with respect to the code base. When a company releases internal IP as an OSS project (see HP with FOSSOLOGY or NYSETechnologies with OpenMAMA), the code base will naturally be from that company. As foundations broker these projects, they'd like to see broader diversity in the code base over time. @howderek 's suggestion is a great one. I do think that what @GeorgLink also suggests a way to think about the code base diversity a bit more broadly.

There was also the metric that came from Saucelabs that was about building diversity in the maintainers -- looking at ways to determine path/time to becoming a core member of a community. This was a bit different to me but might fall in the same general area.

sgoggins · 2017-02-14T19:34:49Z

Using Matt's background as a guide, I think @abuhman & @howderek & @GeorgLink : what diversity means, IMHO, then, is a ratio of commits/repo-commits and commits/org-commits ... or whatever the other measure is .. knowing these by individual and collectively will help us know the extent to which an organization is largely controlled by a small, non-diverse set of folks.

Contributor is, I think, a committer/issue creator/pull requestor combo grouping.
Pull request is specific. You probably also want metrics for the other component parts of "contributor" ...

We are aiming, I think, to tease out projects where:

People contribute only to that project
Projects are mostly the product of a small set of people
Cases where the small set of people only contribute to one project

And, of course, we want to identify projects with greater diversity.

germonprez · 2017-02-15T13:43:40Z

Diversity is the thing here at the LFLS. The other diversity metric that showed up was geographic diversity (if that can be determined).

howderek · 2017-02-15T15:22:18Z

@germonprez Geographic diversity can be inferred! GHTorrent includes the location information people put on their page. We can use the Google Maps API to turn those vague location names into more detailed location information including coordinates.

GeorgLink · 2017-02-15T15:26:34Z

Another diversity metric is gender diversity. Bitergia reported using first names and combining it with geographic probabilities to express a certain gender.

…

On Wed, Feb 15, 2017 at 7:22 AM, Derek Howard ***@***.***> wrote: @germonprez <https://github.com/germonprez> Geographic diversity can be inferred! GHTorrent includes the location information people put on their page. We can use the Google Maps API to turn those vague location names into more detailed location information including coordinates. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIc5RctjXRk9iqXrS1lWSDFUZgp9fN5Wks5rcxgrgaJpZM4L7Ter> .

howderek · 2017-02-15T15:29:45Z

@GeorgLink I'm totally ready :) https://github.com/howderek/name-gender-csv

GeorgLink · 2017-02-15T15:44:55Z

Awesome!

…

On Feb 15, 2017 7:29 AM, "Derek Howard" ***@***.***> wrote: @GeorgLink <https://github.com/GeorgLink> I'm totally ready :) https://github.com/howderek/name-gender-csv — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIc5Rc9QhqyasRqM-BAVqNK4pUQ8baptks5rcxnpgaJpZM4L7Ter> .

ChristianCme · 2017-02-15T16:24:36Z

In the ghtorrent schema, there are fields for latitude, longitude, country code, city, and state. However, I could not find any information on how those fields are populated.

In the msr14 data set there is only a location field, about 45% of users have info for that, ranging from city, state to "Interwebz".

sgoggins · 2017-02-15T16:28:57Z

Gender diversity can be determined partly by names. We should look for a "bag of names" that include probabilities that different names indicate specific genders. Somebody has likely done this. Somebody, possibly @GeorgLink , might want to follow this thread up and down citation wise ... http://eprints.qut.edu.au/8014/1/8014.pdf ... if we measure gender, we need to be systematic and methodical. We also need to make our algorithm and "bag of names" transparent and available as part of this or another public GitHub Repository.

I point this out because gender issues in computing are epidemic, and we need to be careful not to build metrics that reinforce current biases and gender equity issues in computing. Algorithms have values, and ours needs to be explicit and, I think, forward thinking. I suggest we vet it with a panel of people including Katie Siek at Indiana and Irina Schlovski at IT University in Copenhagen (two people I know will be willing to help us)

ChristianCme · 2017-02-15T16:35:36Z

As for data, here is a set created using info from Social Security for US names and similar organizations for UK names. https://github.com/OpenGenderTracking/globalnamedata

Edit: I didn't notice @howderek 's data he linked to earlier.

ChristianCme · 2017-02-20T17:50:31Z

I added some simple queries that associate locations with repo members and another one that counts commits that come from a location. Both of which could be useful in creating different heat maps

sgoggins · 2017-10-06T15:21:23Z

we have initiated this development, and will, if we can, hit it in the next release.

sgoggins · 2018-02-27T21:43:33Z

Lots of discussion here, so lets put this on the road map, @howderek @ChristianCme @ccarterlandis
Thanks!
Sean

howderek · 2018-10-12T04:18:16Z

This discussion has moved to the CHAOSS D&I working group

Gophers frontend

sgoggins added this to To Do in Front End Developer Update Release 0.4.0 Oct 6, 2017

sgoggins added this to the 0.4.0: Red Pumpkin milestone Oct 6, 2017

sgoggins assigned sgoggins, howderek and ChristianCme Oct 6, 2017

sgoggins added the feature-request Request for a new feature in Augur label Oct 6, 2017

sgoggins moved this from To Do to In Progress in Front End Developer Update Release 0.4.0 Dec 17, 2017

sgoggins removed this from the 0.4.0: Red Pumpkin milestone Dec 18, 2017

sgoggins added this to In progress in Major Road Map for Metrics Dec 18, 2017

howderek closed this as completed Oct 12, 2018

sgoggins pushed a commit that referenced this issue Jan 4, 2021

Merge pull request #15 from malkrc/gophers-frontend

4cdbb3e

Gophers frontend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to determine organization diversity for a project #15

Best way to determine organization diversity for a project #15

abuhman commented Feb 8, 2017

howderek commented Feb 8, 2017 •

edited

GeorgLink commented Feb 8, 2017

abuhman commented Feb 8, 2017 •

edited

howderek commented Feb 8, 2017 •

edited

abuhman commented Feb 8, 2017

GeorgLink commented Feb 8, 2017

germonprez commented Feb 9, 2017

sgoggins commented Feb 14, 2017

germonprez commented Feb 15, 2017

howderek commented Feb 15, 2017

GeorgLink commented Feb 15, 2017 via email

howderek commented Feb 15, 2017

GeorgLink commented Feb 15, 2017 via email

ChristianCme commented Feb 15, 2017 •

edited

sgoggins commented Feb 15, 2017

ChristianCme commented Feb 15, 2017 •

edited

ChristianCme commented Feb 20, 2017

sgoggins commented Oct 6, 2017

sgoggins commented Feb 27, 2018

howderek commented Oct 12, 2018

Best way to determine organization diversity for a project #15

Best way to determine organization diversity for a project #15

Comments

abuhman commented Feb 8, 2017

howderek commented Feb 8, 2017 • edited

GeorgLink commented Feb 8, 2017

abuhman commented Feb 8, 2017 • edited

howderek commented Feb 8, 2017 • edited

abuhman commented Feb 8, 2017

GeorgLink commented Feb 8, 2017

germonprez commented Feb 9, 2017

sgoggins commented Feb 14, 2017

germonprez commented Feb 15, 2017

howderek commented Feb 15, 2017

GeorgLink commented Feb 15, 2017 via email

howderek commented Feb 15, 2017

GeorgLink commented Feb 15, 2017 via email

ChristianCme commented Feb 15, 2017 • edited

sgoggins commented Feb 15, 2017

ChristianCme commented Feb 15, 2017 • edited

ChristianCme commented Feb 20, 2017

sgoggins commented Oct 6, 2017

sgoggins commented Feb 27, 2018

howderek commented Oct 12, 2018

howderek commented Feb 8, 2017 •

edited

abuhman commented Feb 8, 2017 •

edited

howderek commented Feb 8, 2017 •

edited

ChristianCme commented Feb 15, 2017 •

edited

ChristianCme commented Feb 15, 2017 •

edited