## Efficiently crawl data from GitHub's GraphQL API

Note to self:

- Utilize GraphQL for Efficiency: GraphQL allows you to request exactly the data you need, reducing unnecessary data transfer.
- Public Schema for Data Structure: This [public GH GraphQL schema](https://docs.github.com/en/graphql/overview/public-schema) is quite comprehensive.
- Implement Graceful Retry for Rate Limits: Since GitHub's GraphQL API enforces rate limits, implement a retry mechanism that respects these limits to handle requests gracefully without hitting the rate limit. Refer to the rate limits [documentation]((https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api)) for guidance.
- Batch Queries by Year due to Max Results Limit: The maximum number of results you can fetch in one query is 1,000. To work around this, segment your queries by year or another logical division to retrieve all desired data without exceeding this limit.
- Review and Adjust the Base Query as Necessary: [Existing query](ospo_stats/query.py) may need some changes, especially for handling data updates efficiently. Given the brief runtime of 3 minutes for the entire crawl, the current setup seems fine.



In [1]:
from ospo_stats.gh import crawl

# Around 3 minutes run time from 2010 to 2024 , this will store all raw data to `data/` dir
# Run once only
# crawl("uw-madison", 2010, 2024, overwrite=True)

## Visualizations

- Cumulative repo by year
- Cumulative stars by year? This isn't about the stars received each year, but the total stars a repository has now, with repositories categorized based on the year they were created. TODO: Somewhat misleading, may need fix.
- Cumulative commits by year? Similar to above, it is the total commits a repository has now, grouped by repo creation year. TODO: Somewhat misleading, may need fix.

### Calculate cumulative metrics by year

In [2]:
import pandas as pd
import altair as alt
from ospo_stats.parser import load

df = load("data/")
df['year'] = pd.to_datetime(df['created_at']).dt.year

cumulative_repos = df.groupby('year').name.nunique().cumsum()
plot_df_repo = cumulative_repos.reset_index(name='n')

cumulative_stars = df.groupby('year').stars.sum().cumsum()
plot_df_stars = cumulative_stars.reset_index(name='stars')

cumulative_commits = df.groupby('year').commits.sum().cumsum()
plot_df_commits = cumulative_commits.reset_index(name='commits')

plot_df = plot_df_repo.merge(plot_df_stars, on='year').merge(plot_df_commits, on='year').melt('year', var_name='metric')
plot_df.sample(5)

Unnamed: 0,year,metric,value
1,2012,n,25
29,2012,commits,1771
13,2024,n,1266
4,2015,n,139
10,2021,n,823


### Plot cumulative metrics by year

In [3]:
plot = alt.Chart(plot_df).mark_line(point=True).encode(
    x='year:O', 
    y='value:Q',
    color='metric:N',
    column='metric:N'
).properties(
    title='Open-source repositories affiliated with UW–Madison',
    width=600,
    height=400
).resolve_scale(y='independent')

plot

### Somewhat interesting statistics

- top-10 stared repo
- top-10 commited repo

In [4]:
df.sort_values('stars', ascending=False).head(10)

Unnamed: 0,name,url,description,created_at,pushed_at,stars,issues,commits,readme,year
513,stat453-deep-learning-ss20,https://github.com/rasbt/stat453-deep-learning...,STAT 453: Intro to Deep Learning @ UW-Madison ...,2020-01-20T23:20:22Z,2020-05-01T22:34:04Z,545,2,84,# STAT 453: Introduction to Deep Learning and ...,2020
694,stat453-deep-learning-ss21,https://github.com/rasbt/stat453-deep-learning...,STAT 453: Intro to Deep Learning @ UW-Madison ...,2021-01-28T03:46:42Z,2022-02-03T22:57:35Z,398,4,44,# stat453-deep-learning-ss21\nSTAT 453: Intro ...,2021
514,stat451-machine-learning-fs20,https://github.com/rasbt/stat451-machine-learn...,STAT 451: Intro to Machine Learning @ UW-Madis...,2020-08-06T23:42:45Z,2020-12-03T05:43:06Z,366,1,32,[![Binder](https://mybinder.org/badge_logo.svg...,2020
830,Kaggle-UWMGIT,https://github.com/CarnoZhao/Kaggle-UWMGIT,code for kaggle: UW-Madison GI Tract Image Seg...,2022-05-08T07:39:51Z,2022-10-18T17:34:22Z,87,7,437,### Introduction\n\nHello!\n\nBelow you can fi...,2022
375,prepare-tax-return,https://github.com/mzj14/prepare-tax-return,UW-Madison CS 硕士生报税经历,2019-03-19T00:38:09Z,2019-03-24T15:31:33Z,49,0,8,# prepare-tax-return\n作为UW-Madison CS 硕士生的个人报税...,2019
264,madgrades.com,https://github.com/Madgrades/madgrades.com,Frontend for visualizing UW Madison course gra...,2018-02-27T08:07:01Z,2024-01-10T09:00:30Z,47,10,121,# madgrades.com [![Build and deploy to prod](h...,2018
49,JuliaWorkshop,https://github.com/dmbates/JuliaWorkshop,Materials for a workshop on Julia programming ...,2014-06-17T14:46:20Z,2019-06-10T17:29:01Z,31,0,50,JuliaWorkshop\n=============\n\nMaterials for ...,2014
370,UW-Madison-CS540-Introduction-to-AI,https://github.com/learlinian/UW-Madison-CS540...,UW-Madison CS540: Introduction to Artificial I...,2019-02-10T00:55:59Z,2021-10-09T08:51:56Z,24,0,48,# UW-Madison CS540: Introduction to Artificial...,2019
371,uw-madison-datacience-club-talk-oct2019,https://github.com/rasbt/uw-madison-datacience...,Slides and code for the talk at UW-Madison's D...,2019-10-11T02:44:06Z,2019-10-11T02:49:31Z,20,0,2,# uw-madison-datacience-club-talk-oct2019\nSli...,2019
295,command-line-interpreter,https://github.com/amanchadha/command-line-int...,Unix Shell in C | Support for built-in command...,2018-12-19T06:39:28Z,2018-12-31T23:18:56Z,19,0,15,﻿Command Line Interpreter (Shell)\n===========...,2018


In [5]:
df.sort_values('commits', ascending=False).head(10)

Unnamed: 0,name,url,description,created_at,pushed_at,stars,issues,commits,readme,year
1234,WisconsinDiamondQubitLab,https://github.com/GardillA/WisconsinDiamondQu...,Codebase to run the Wisconsin Diamond Qubit La...,2023-07-13T16:58:49Z,2023-07-18T03:42:14Z,0,0,3180,This codebase is based off of the codebase ori...,2023
50,2014-08-25-wisc,https://github.com/UW-Madison-ACI/2014-08-25-wisc,Software Carpentry Boot Camp at UW-Madison,2014-07-29T02:25:37Z,2014-08-15T18:45:16Z,3,0,1519,Software Carpentry Bootcamps\n================...,2014
318,2018-15-08-UW-Madison,https://github.com/mkamenet3/2018-15-08-UW-Mad...,,2018-08-15T22:07:58Z,2018-08-15T22:12:57Z,0,0,1315,# workshop-template\n\nThis repository is [Sof...,2018
274,2018-06-04-uwmadison-dc,https://github.com/UW-Madison-ACI/2018-06-04-u...,Website for June 2018 Data Carpentry workshop ...,2018-05-01T17:57:52Z,2018-08-29T11:48:25Z,0,0,1160,"## Data Carpentry @ UW Madison\n### June 4-5, ...",2018
279,2018-06-06-uwmadison-swc,https://github.com/UW-Madison-ACI/2018-06-06-u...,Website for June 2018 Software Carpentry works...,2018-05-01T17:58:43Z,2018-06-05T21:19:48Z,0,0,1147,# workshop-template\n\nThis repository is [Sof...,2018
87,2016-01-14-uwmadison,https://github.com/UW-Madison-ACI/2016-01-14-u...,http://uw-madison-aci.github.io/2016-01-14-uwm...,2015-11-24T20:08:35Z,2016-01-15T21:48:46Z,1,0,632,# workshop-template\n\nThis repository is [Sof...,2015
826,Scavenge,https://github.com/Scavenge-UW/Scavenge,Search for food you need at local food pantrie...,2021-02-13T00:52:25Z,2021-05-01T04:23:23Z,2,97,620,# Scavenge\nSearch for food you need at local ...,2021
86,2015-08-26-uw-madison,https://github.com/UW-Madison-ACI/2015-08-26-u...,,2015-07-30T20:10:53Z,2015-08-25T17:28:51Z,0,1,588,# workshop-template\n\nThis repository is [Sof...,2015
88,2016-01-11-uwmadison,https://github.com/UW-Madison-ACI/2016-01-11-u...,http://uw-madison-aci.github.io/2016-01-11-uwm...,2015-11-24T14:44:21Z,2016-01-14T17:47:12Z,0,0,555,=======\n\nThis repository is the [Data Carpen...,2015
105,uw-madison-wp-2015,https://github.com/uwmadisoncals/uw-madison-wp...,,2015-05-18T15:27:02Z,2019-04-12T16:55:46Z,0,11,555,\nA UW-Madison Wordpress Theme\n===\n\nThis is...,2015


## Future plans / discussion

1. Get some inspiration from: https://r-universe.dev/search/
2. What info are needed for outreach? Exemplary OSS repos?
3. Categorization by repo's `description` and `readme` with LLM.