## Efficiently crawl data from GitHub's GraphQL API

What the current data discovery pipeline do:

1. crawl a target keywords (uw-madison) with search API
1. return these selected data:
    - repo's name
    - repo's description
    - repo's owner (login account)
    - repo's url
    - repo's created at timestamp
    - repo's last push to default branch timestamp
    - repo's total no. of stars as of now
    - repo's total commit to default branch as of now
    - repo's readme

Note to self:

- Utilize GraphQL for Efficiency: GraphQL allows you to request exactly the data you need, reducing unnecessary data transfer.
- Public Schema for Data Structure: This [public GH GraphQL schema](https://docs.github.com/en/graphql/overview/public-schema) is quite comprehensive.
- Implement Graceful Retry for Rate Limits: Since GitHub's GraphQL API enforces rate limits, implement a retry mechanism that respects these limits to handle requests gracefully without hitting the rate limit. Refer to the rate limits [documentation]((https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api)) for guidance.
- Batch Queries by Year due to Max Results Limit: The maximum number of results you can fetch in one query is 1,000. To work around this, segment your queries by year or another logical division to retrieve all desired data without exceeding this limit.
- Review and Adjust the Base Query as Necessary: [Existing query](ospo_stats/query.py) may need some changes, especially for handling data updates efficiently. Given the brief runtime of 3 minutes for the entire crawl, the current setup seems fine.



In [1]:
import logging
import pandas as pd
import altair as alt

from ospo_stats.parser import load
from ospo_stats.gh import discover_repos
logging.basicConfig(level=logging.INFO)

In [2]:
# Around 3 minutes crawl time from 2010 to 2024
# Run once
# discover_repos("uw-madison", 2010, 2024, overwrite=True, output_dir="data/discovery")

## Visualizations

- Cumulative repo by year
- Cumulative stars by year? (Not implemented yet)
- Cumulative commits by year? (Not implemented yet)

### Calculate cumulative metrics by year

In [3]:
df = load("data/discovery")
df['year'] = pd.to_datetime(df['created_at']).dt.year

cumulative_repos = df.groupby('year').name.nunique().cumsum()
plot_df = cumulative_repos.reset_index(name='n')

plot_df.head(5)

INFO:root:data/discovery/repos_2011.json
INFO:root:https://github.com/aaronb/xv6 had no README file
INFO:root:https://github.com/khazelton/math801f11 had no README file
INFO:root:data/discovery/repos_2012.json
INFO:root:https://github.com/jklukas/uwthesis had no README file
INFO:root:https://github.com/bploeckelman/cs559-project1 had no README file
INFO:root:https://github.com/bploeckelman/cs559-project3 used a lowercase README file
INFO:root:https://github.com/UWMadisonUcomm/uwmadison_events-wp had no README file
INFO:root:https://github.com/nmillin/uw_madison_omega had no README file
INFO:root:https://github.com/bploeckelman/cs559-project2 had no README file
INFO:root:https://github.com/schnottus/679_P2 had no README file
INFO:root:https://github.com/listrophy/programming-happiness had no README file
INFO:root:https://github.com/kah2011/isthmuswebsite had no README file
INFO:root:https://github.com/adammaus/CS761-Project had no README file
INFO:root:https://github.com/mrkline/painter

Unnamed: 0,year,n
0,2011,2
1,2012,25
2,2013,47
3,2014,84
4,2015,139


### Plot cumulative metrics by year

In [4]:
plot = alt.Chart(plot_df).mark_line(point=True).encode(
    x='year:O', 
    y='n:Q',
).properties(
    title="Yearly Growth of UW–Madison's Open-Source Repositories on GitHub",
    width=600,
    height=400
)

plot

### Somewhat interesting statistics

- top-10 stared repo
- top-10 commited repo

In [5]:
df.sort_values('stars', ascending=False).head(10)

Unnamed: 0,name,description,owner,url,created_at,pushed_at,stars,issues,commits,readme,year
513,stat453-deep-learning-ss20,STAT 453: Intro to Deep Learning @ UW-Madison ...,rasbt,https://github.com/rasbt/stat453-deep-learning...,2020-01-20T23:20:22Z,2020-05-01T22:34:04Z,545,2,84,# STAT 453: Introduction to Deep Learning and ...,2020
694,stat453-deep-learning-ss21,STAT 453: Intro to Deep Learning @ UW-Madison ...,rasbt,https://github.com/rasbt/stat453-deep-learning...,2021-01-28T03:46:42Z,2022-02-03T22:57:35Z,398,4,44,# stat453-deep-learning-ss21\nSTAT 453: Intro ...,2021
514,stat451-machine-learning-fs20,STAT 451: Intro to Machine Learning @ UW-Madis...,rasbt,https://github.com/rasbt/stat451-machine-learn...,2020-08-06T23:42:45Z,2020-12-03T05:43:06Z,366,1,32,[![Binder](https://mybinder.org/badge_logo.svg...,2020
830,Kaggle-UWMGIT,code for kaggle: UW-Madison GI Tract Image Seg...,CarnoZhao,https://github.com/CarnoZhao/Kaggle-UWMGIT,2022-05-08T07:39:51Z,2022-10-18T17:34:22Z,87,7,437,### Introduction\n\nHello!\n\nBelow you can fi...,2022
375,prepare-tax-return,UW-Madison CS 硕士生报税经历,mzj14,https://github.com/mzj14/prepare-tax-return,2019-03-19T00:38:09Z,2019-03-24T15:31:33Z,49,0,8,# prepare-tax-return\n作为UW-Madison CS 硕士生的个人报税...,2019
264,madgrades.com,Frontend for visualizing UW Madison course gra...,Madgrades,https://github.com/Madgrades/madgrades.com,2018-02-27T08:07:01Z,2024-01-10T09:00:30Z,47,10,121,# madgrades.com [![Build and deploy to prod](h...,2018
49,JuliaWorkshop,Materials for a workshop on Julia programming ...,dmbates,https://github.com/dmbates/JuliaWorkshop,2014-06-17T14:46:20Z,2019-06-10T17:29:01Z,31,0,50,JuliaWorkshop\n=============\n\nMaterials for ...,2014
370,UW-Madison-CS540-Introduction-to-AI,UW-Madison CS540: Introduction to Artificial I...,learlinian,https://github.com/learlinian/UW-Madison-CS540...,2019-02-10T00:55:59Z,2021-10-09T08:51:56Z,24,0,48,# UW-Madison CS540: Introduction to Artificial...,2019
371,uw-madison-datacience-club-talk-oct2019,Slides and code for the talk at UW-Madison's D...,rasbt,https://github.com/rasbt/uw-madison-datacience...,2019-10-11T02:44:06Z,2019-10-11T02:49:31Z,20,0,2,# uw-madison-datacience-club-talk-oct2019\nSli...,2019
268,Intro-to-MRI,This is Jupyter notebook/python code developed...,kmjohnson3,https://github.com/kmjohnson3/Intro-to-MRI,2018-11-05T16:05:57Z,2024-03-04T23:23:29Z,19,0,41,# Intro-to-MRI\nThis is Jupyter notebook/pytho...,2018


- Most popular repo are course materials
- Need some way to categorize repo, e.g., LLM with readme / descriptions

### Top-10 most committed repos

In [6]:
df.sort_values('commits', ascending=False).head(10)

Unnamed: 0,name,description,owner,url,created_at,pushed_at,stars,issues,commits,readme,year
1234,WisconsinDiamondQubitLab,Codebase to run the Wisconsin Diamond Qubit La...,GardillA,https://github.com/GardillA/WisconsinDiamondQu...,2023-07-13T16:58:49Z,2023-07-18T03:42:14Z,0,0,3180,This codebase is based off of the codebase ori...,2023
50,2014-08-25-wisc,Software Carpentry Boot Camp at UW-Madison,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2014-08-25-wisc,2014-07-29T02:25:37Z,2014-08-15T18:45:16Z,3,0,1519,Software Carpentry Bootcamps\n================...,2014
319,2018-15-08-UW-Madison,,mkamenet3,https://github.com/mkamenet3/2018-15-08-UW-Mad...,2018-08-15T22:07:58Z,2018-08-15T22:12:57Z,0,0,1315,# workshop-template\n\nThis repository is [Sof...,2018
274,2018-06-04-uwmadison-dc,Website for June 2018 Data Carpentry workshop ...,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2018-06-04-u...,2018-05-01T17:57:52Z,2018-08-29T11:48:25Z,0,0,1160,"## Data Carpentry @ UW Madison\n### June 4-5, ...",2018
279,2018-06-06-uwmadison-swc,Website for June 2018 Software Carpentry works...,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2018-06-06-u...,2018-05-01T17:58:43Z,2018-06-05T21:19:48Z,0,0,1147,# workshop-template\n\nThis repository is [Sof...,2018
87,2016-01-14-uwmadison,http://uw-madison-aci.github.io/2016-01-14-uwm...,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2016-01-14-u...,2015-11-24T20:08:35Z,2016-01-15T21:48:46Z,1,0,632,# workshop-template\n\nThis repository is [Sof...,2015
826,Scavenge,Search for food you need at local food pantrie...,Scavenge-UW,https://github.com/Scavenge-UW/Scavenge,2021-02-13T00:52:25Z,2021-05-01T04:23:23Z,2,97,620,# Scavenge\nSearch for food you need at local ...,2021
86,2015-08-26-uw-madison,,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2015-08-26-u...,2015-07-30T20:10:53Z,2015-08-25T17:28:51Z,0,1,588,# workshop-template\n\nThis repository is [Sof...,2015
88,2016-01-11-uwmadison,http://uw-madison-aci.github.io/2016-01-11-uwm...,UW-Madison-ACI,https://github.com/UW-Madison-ACI/2016-01-11-u...,2015-11-24T14:44:21Z,2016-01-14T17:47:12Z,0,0,555,=======\n\nThis repository is the [Data Carpen...,2015
105,uw-madison-wp-2015,,uwmadisoncals,https://github.com/uwmadisoncals/uw-madison-wp...,2015-05-18T15:27:02Z,2019-04-12T16:55:46Z,0,11,555,\nA UW-Madison Wordpress Theme\n===\n\nThis is...,2015


In [7]:
print(f"we have a total of {len(df)} repos, and on average, they have {df.commits.mean():.1f} commits")

we have a total of 1299 repos, and on average, they have 35.8 commits


- I looked at a top repo, many commits are coming from bot for CIs. We need filter out those bot's commits.
- Activeness can be defined as: most commits in the past year. We need historical commit data for it. Estimated crawl time = 1.1 call * 3s * 1299 repos = 72 minutes < 2 hours... not as bad as I thought. 

## Future plans / discussion



## Historical details (Work in progress)

Not going to pull all data for now. Need to discuss what to pull.

In [8]:
from ospo_stats.gh import get_stargazers, get_commits
from ospo_stats.parser import parse_stargazers, parse_commits

### Example of crawling historical star history in a repo

In [9]:
gs = get_stargazers("ad-freiburg", "qlever")
gs = [parse_stargazers(g) for g in gs]
pd.DataFrame(gs).head(5)


INFO:root:Obtained stargazers: 100 / 252
INFO:root:Obtained stargazers: 200 / 252
INFO:root:Obtained stargazers: 252 / 252


Unnamed: 0,starred_at,user
0,2015-07-16T12:46:16Z,hello009-commits
1,2016-01-20T16:37:51Z,titsuki
2,2016-07-28T13:05:09Z,Buchhold
3,2016-12-22T12:45:22Z,PawelMarc
4,2017-01-10T21:48:18Z,diegoesteves


### Example of crawling historical commit history in a repo

In [10]:
cs = get_commits("jasonlo", "funsearch")
cs = [parse_commits(c) for c in cs]
pd.DataFrame(cs).head(5)

INFO:root:Obtained commits: 51 / 51


Unnamed: 0,committed_at,url,additions,deletions,committer_name,committer_email
0,2024-02-22T18:42:10Z,https://github.com/JasonLo/funsearch/commit/97...,66,50,Jason Lo,lcmjlo@gmail.com
1,2024-02-22T17:23:06Z,https://github.com/JasonLo/funsearch/commit/02...,18,0,Jason Lo,lcmjlo@gmail.com
2,2024-02-22T17:08:18Z,https://github.com/JasonLo/funsearch/commit/52...,441,407,Jason Lo,lcmjlo@gmail.com
3,2024-02-22T17:05:22Z,https://github.com/JasonLo/funsearch/commit/31...,155,115,Jason Lo,lcmjlo@gmail.com
4,2024-02-22T07:08:17Z,https://github.com/JasonLo/funsearch/commit/71...,331,289,Jason Lo,lcmjlo@gmail.com


### To-dos

- Add community metric (health %)
- Get accurate historical data of stars and commits instead of using total no. as of now.
- Get some inspiration from: https://r-universe.dev/search/
- Clarify what details info are needed for outreach? Exemplary OSS repos?
- Categorize repo type (e.g., course content, software) by repo's `description` and `readme` with LLM.