Crawling github data for https://github.com/anvaka/pm/
- Make sure redis is installed and running on default port
- Register github token
and set it into
- Install the crawler:
git clone https://github.com/anvaka/ghcrawl cd ghcrawl npm i
Now we are ready to index.
Find all users with more than 2 followers
This will use a search API and will go through all users on GitHub who have more than two followers. At the moment there are more than 400k users.
Each search request can return up to 100 records per page, which gives us
400,000 / 100 = 4,000 requests to make. Search API is rate limited at 30
requests per minute. Which means the indexing will take
4,000/30 = 133 -
more than two hours:
Find all followers
Now that we have all users who have more than two followers, let's index
those followers. Bad news we will have to make one request per user.
Good news, rate limit is 5,000 requests per hour, which gives us estimated
amount of work:
400,000/5,000 = 80 - more than 80 hours of work:
Time to get the graph
Now that we have all users indexed, we can construct the graph:
node makeFollowersGraph.js > github.dot
Convert graph to binary format:
node --max-old-space-size=4096 ./toBinary.js
Then use ngraph.native for faster graph layout.