Skip to content

Latest commit

 

History

History
37 lines (22 loc) · 2.7 KB

key-concepts.md

File metadata and controls

37 lines (22 loc) · 2.7 KB

Key concepts

Borges producer

A standalone process that reads repository URLs (from RabbitMQ or file) and schedules fetching this repository.

Borges consumer

A standalone process that takes URLs from RabbitMQ, clones remote repository and pushes it to the appropriate Rooted Repository in the storage (local filesystem or HDFS). Downloaded repositories will be packed into siva files so you don't need to run Borges packer (described below) on them.

Borges packer

A standalone process that takes repository paths (or URLs) from a file and packs them into siva files (as a Rooted Repository) in the given output directory.

Rooted Repository

A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.

Root Repository explanatory diagram

Rooted repositories have a few particularities that you should know to work with them effectively:

  • They have no HEAD reference.
  • All references are of the following form: {REFERENCE_NAME}/{REMOTE_NAME}. For example, the reference refs/heads/master of the remote foo would be /refs/heads/master/foo. The remote name in rooted repositories generated by borges is always the id (in UUID form) of the repository in the PostgreSQL database.
  • Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
  • In the repository config, inside each remote section you will find a isfork configuration, that can either be true or false. This indicates whether the repository is a fork or the real one. Note: this does not work with Packer and the results may contain false positives and false negatives due to missing information until all available repositories are fetched, so use this with caution.
  • A rooted repository is simply a repository with all the git objects that are reachable from a root commit. That means a repository with multiple roots may be split across several rooted repositories instead of being in just one.

Dependencies

Consumer and Producer run independently, communicating though a RabbitMQ instance and storing repository meta-data in PostgreSQL.

Packer does not need a RabbitMQ or a PostgreSQL instance and is not meant to be used as a pipeline, that's what consumer and producer are meant for.

Read the borges package godoc for further details on how does borges archive the repositories.