Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter only on projects with open-source licenses #6

Open
abitrolly opened this issue Jan 23, 2020 · 11 comments
Open

Filter only on projects with open-source licenses #6

abitrolly opened this issue Jan 23, 2020 · 11 comments
Assignees

Comments

@abitrolly
Copy link
Contributor

abitrolly commented Jan 23, 2020

(Housekeeping - I move the original issue written here by @abitrolly into #8)

Enhance the OSCI algorithm to filter only projects with open-source licenses.
This will require some external datasets.

@EmbeddAlex
Copy link
Contributor

That's a good idea. We were thinking about external datasets. We want to add an information about repositories licenses, but GHArchive.org does not contain their. As an option we can cache the data to DB using GitHub API, which has some limitations of a number requests(5000 requests per a hour). I would like to ask you if you already have some thoughts how to do it better?

@abitrolly
Copy link
Contributor Author

abitrolly commented Jan 27, 2020

First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.

@abitrolly
Copy link
Contributor Author

Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but license is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/

@abitrolly
Copy link
Contributor Author

Yet another way is to add license info to https://release-monitoring.org/ and dump information from there. This one is limited to versioned projects that have enough value to be monitored and packaged.

@EmbeddAlex
Copy link
Contributor

I can help with that. For example at this moment we have ~2600457 unique IDs of repositories for 2020. In average we get an information about ~1026296 unique repositories, bigger part from them repeats everyday.

First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.

@EmbeddAlex
Copy link
Contributor

EmbeddAlex commented Jan 27, 2020

Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but license is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/

This link returns a response with license type if you will fill :owner/:repo fields.
https://api.github.com/repos/:owner/:repo/license
GHArchive uses the following link:
https://api.github.com/events

@abitrolly
Copy link
Contributor Author

For the first approximation it then easier to use https://console.cloud.google.com/marketplace/details/github/github-repos as a start (kaggle tutorial and then https://libraries.io/data. But maybe resorting to http://ghtorrent.org/ is the easiest way.

@patrickstephens1
Copy link
Member

This thread mixes two features so I would like to separate them. Let's use this issue for the feature of how to find license information and add that into our algorithm. I'll open a separate issue to track the feature for adding location information.

@patrickstephens1 patrickstephens1 changed the title Local OSCI rating Filter only on projects with open-source licenses Feb 11, 2020
@patrickstephens1
Copy link
Member

The BigQuery data on the GCP writes "Last modified 20 Mar 2019, 22:03:20". Just BTW. Still can be useful for some test queries.

@abitrolly
Copy link
Contributor Author

@patrickstephens1 maybe it is possible to send letter to new owners of GitHub asking if they plan to fix publishing of this dataset?

@patrickstephens1
Copy link
Member

@abitrolly I'll dig into it. I can see this dataset on GCP was created by Google back in 2016 when github was an independent company. Now that github is owned by Microsoft, it might be a different situation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants