Skip to content

Web crawler for OSS projects 📚

License

Notifications You must be signed in to change notification settings

ZJU-SEC/GitHunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHunter

Build

GitHunter is a tiny yet powerful crawler infra to collect OSS projects on GitHub. It queries GitHub search API and persist the data into the Postgres database.

Check here to know what the collected data is.

⚙️ Prerequisite

  • Docker
  • Golang
  • PostgreSQL

💡 Dockerized PostgreSQL

To run a dockerized PostgreSQL, check this.

Start a postgres container, following the example command below:

$ docker run \
  --name postgres -d \
  --restart unless-stopped \
  -e POSTGRES_USER=ZJU-SEC \
  -e POSTGRES_PASSWORD=<YOUR DB PASSWORD> \
  -e POSTGRES_DB=GitHunter \
  -p 5432:5432 postgres

📄 Make the Configurations

Prepare yourself a config.ini configuration according to config.ini.tmpl. Following is the configuration specification:

Name Type In Description
WORKER integer APP Maximum number of parallel workers
QUEUE_SIZE integer APP Maximum number of parallel queue
LANGUAGE string APP Targeted programming language
MIN_STAR integer APP Minimum number of stars a repo gains
GITHUB_TOKEN string WEB GitHub token to unlock API rate limit
TRYOUT integer WEB Maximum number of retrying to request a page
HOST string DB Database host address
USER string DB Database user name
PASSWORD string DB Database user password
DBNAME string DB Database name
PORT integer DB Database port

🛠️ Build

$ go build GitHunter

🚀 Run

To crawl the repositories' metadata:

$ ./GitHunter crawl

To clone the repositories:

$ ./GitHunter clone