Skip to content

Commit

Permalink
gh-53: simplified data model and architecture to only convey relavant…
Browse files Browse the repository at this point in the history
… concepts
  • Loading branch information
mjpitz committed Aug 6, 2020
1 parent b2fa391 commit f1139ff
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 126 deletions.
61 changes: 13 additions & 48 deletions content/en/docs/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,60 +6,25 @@ aliases:
- /docs/architecture/
---

This page serves as documentation of the open source architecture for the deps.cloud system.

## Overview
The following diagram illustrates the general system deployed on top of a kubernetes cluster.

![arch](/images/arch.png)

While there are several components that make up the ecosystem, each of them serve their own purpose.
### Actors

### Components
[User CLI](https://github.com/depscloud/cli) represents a single type of consumer.
The command line interface (CLI) allows individuals to explore data stored in deps.cloud.
Other types of clients include processes written using one of our SDKs.

[Gateway](https://github.com/depscloud/gateway) is the face of the API services.
It provides a RESTful HTTP interface to the backing gRPC services.

[Tracker](https://github.com/depscloud/tracker) provides several APIs for navigating the graph of information.
This service leverages other storage systems such as SQLite or MySQL to store the graph data.

[Extractor](https://github.com/depscloud/extractor) is responsible for looking at different manifest files and extracting dependency information from them.
This mechanism is easily pluggable to support a large range of different manifest files.

The [indexer](https://github.com/depscloud/indexer) is responsible for fetching repository information, cloning and crawling it, leveraging the extractor and tracker where appropriate.

The [command line interface](https://github.com/depscloud/cli) or CLI provides end users with an easy ability to query the API.
See the [CLI docs](/docs/cli/) for more information.

## Design Decisions

As this system was built out, there were several key decisions that were made along the way.
In this section, I capture several of the frequently asked questions and document the rationale behind them.

### _How should services communicate?_

There are many different ways services can communicate.
REST and [gRPC](https://grpc.io) are simply two of them.
While there are many options out there, there were many benefits that came along with leveraging gRPC.
This includes, but is not limited to:

* contractual API definitions using [Protocol Buffers](https://developers.google.com/protocol-buffers)
* support for multi-language systems
* built in client side load balancing and health checking

In the end, I decided to leverage gRPC.
As a result, it's had a great impact on the ecosystem.
It allowed parts to be prototyped in one language, and rewritten when they didn't scale.
Adding REST support was easy with the help of the [grpc-gateway](https://github.com/grpc-ecosystem/grpc-gateway) project.

### _How should the data be stored?_
It provides both RESTful and gRPC interfaces to clients of the system.
Not all functionality is available over the RESTful interface.

When you think of a dependency graph, it's easy to jump to the conclusion to use one of the existing [graph databases](https://en.wikipedia.org/wiki/Graph_database) out there.
However, when working with folks in the open source community, it's hard to find people with prior experience on graph databases.
Most people are still more familiar with things like [MySQL](https://www.mysql.com/) or [MongoDB](https://www.mongodb.com/).
[Tracker](https://github.com/depscloud/tracker) provides several APIs for navigating the graph.
This service leverages systems like as SQLite, MySQL, or PostgreSQL to store the graph data.

Knowing this layer of the stack was likely to be swapped out with Company X's preferred store, I wanted it to be pluggable.
So the first implementation was on top of an SQL system.
From there, we were able to extract a simple service interface.
This makes it easy to swap the storage technology out for different solutions.
[Extractor](https://github.com/depscloud/extractor) extracts dependency information from manifest files.
This mechanism is easily pluggable to support a large range of manifest files.

For more information on the data layer, see the [Data Model](/docs/data-model/) documentation.
The [indexer](https://github.com/depscloud/indexer) crawls repositories looking for manifest files.
When it discovers manifests, the contents are extracted, stored, and indexed.
77 changes: 13 additions & 64 deletions content/en/docs/concepts/data-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,84 +6,33 @@ aliases:
- /docs/data-model/
---

This page serves as documentation of the open source data model for the deps.cloud system.
The backing data model for deps.cloud is a graph.
Graphs contain two types of data: nodes and edges.
Nodes often represent entities such as people, places, or things.
Edges often represent relationships between two entities.

## Logical Model
## Overview

The logical model is the user facing representation of the data in the system.
It is defined using [protocol buffers](https://developers.google.com/protocol-buffers/).
The complete schema can be found in the [API](https://github.com/depscloud/api) repository.
To summarize, there are four distinct entities in the deps.cloud database.
The following illustrates the various nodes and edges in the deps.cloud ecosystem.

![data-model](/images/data-model.png)

### Nodes

*Sources* represent origins for information.
These can be source control systems like GitHub, GitLab, or BitBucket.
Or they can be artifactories like JFrog Artifactory or Sonatype Nexus.
Sources are keyed by their URL and are represented as nodes in the dependency graph.

*Modules* represent libraries or applications in the dependency graph.
Modules are extracted from [manifest files](/docs/manifests/).
These are the components extracted from [manifest files](/docs/concepts/manifests/).
They are keyed by all their data, and are represented as nodes in the dependency graph.

### Edges

*Manages* represent the relationship between a *source* and a *module*.
It contains information about how a given module is managed such as the toolchain.

*Depends* represents the relationship between two *modules*.
It contains information about how the modules depend on one another.
This includes things like version constraint, scopes, and a reference to the source.

This data can be visualized as such:

![data-model](/images/data-model.png)

## Database Schema

The database schema was inspired by [EdgeStore](https://youtu.be/VZ-zJEWi-Vo?t=588) at [Dropbox](https://dropbox.tech/infrastructure/reintroducing-edgestore).
With a few modifications, we were able to successfully model a dependency graph.
Below, you will find a copy of a create table statement for MySQL.

```mysql
CREATE TABLE IF NOT EXISTS `dts_graphdata` (
`graph_item_type` varchar(55) NOT NULL,
`k1` char(64) NOT NULL,
`k2` char(64) NOT NULL,
`k3` varchar(64) NOT NULL,
`encoding` tinyint DEFAULT NULL,
`graph_item_data` text,
`last_modified` datetime DEFAULT NULL,
`date_deleted` datetime DEFAULT NULL,
PRIMARY KEY (`graph_item_type`,`k1`,`k2`,`k3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
```

This schema is able to represent a dependency graph with the help of a few simple rules.

1. When `k1 == k2`, the row represents a node in the graph
2. When `k1 != k2`, the row represents an edge between `k1` and `k2`
3. `k3` allows for multiple edges to exist between nodes, but is restricted to one per source

To help make this more concrete, consider the following simplified table:

```
| graph_item_type | k1 | k2 | k3 | encoding | graph_item_data |
|-----------------|--------|--------|--------|----------|-----------------|
| depends | msha | osha | ssha | 1 | { ... } |
| manages | ssha | msha | | 1 | { ... } |
| module | msha | msha | | 1 | { ... } |
| module | osha | osha | | 1 | { ... } |
| source | ssha | ssha | | 1 | { ... } |
```

The following statements can be made about the data shown in the table.

* Source `ssha` manages module `msha`.
* Module `msha` depends on module `osha` for source `ssha`.

## Swappable Storage Engines

While only SQL support is available today, it's possible to support NoSQL systems too.
This is made possible by the simplified schema and data abstraction layer.
Current database support:

* [SQLite](https://www.sqlite.org/)
* [MySQL](https://www.mysql.com/)
* [Postgres](https://postgresql.org/)
28 changes: 14 additions & 14 deletions content/en/docs/concepts/manifests.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,19 @@ These dependencies come in many shapes and forms, but the most common dependency
Libraries are packages containing common code that is shared between projects.

Using this information, deps.cloud is able to build a knowledge graph.
The table below demonstrates how this information is extracted from various manifests.
The table below demonstrates how to interpret the information extracted from various manifests.
Since there is no standardization across languages, extraction may vary between implementations.

| Manifest File | Language | System | Example | Organization | Module |
|---------------------------------|----------|------------|-----------------------------|--------------------|------------------|
| `bower.json` | `node` | `bower` | `@depscloud/api` | `depscloud` | `api` |
| `build.gradle, settings.gradle` | `java` | `gradle` | `com.google.guava:guava` | `com.google.guava` | `guava` |
| `cargo.toml` | `rust` | `cargo` | `bytes` | `_` | `bytes` |
| `composer.json` | `php` | `composer` | `symfony/console` | `symfony` | `console` |
| `Godeps.json` | `go` | `godeps` | `github.com/depscloud/api` | `github.com` | `depscloud/api` |
| `go.mod` | `go` | `vgo` | `github.com/depscloud/api` | `github.com` | `depscloud/api` |
| `Gopkg.toml` | `go` | `gopkg` | `github.com/depscloud/api` | `github.com` | `depscloud/api` |
| `ivy.xml` | `java` | `ivy` | `com.google.guava;guava` | `com.google.guava` | `guava` |
| `package.json` | `node` | `npm` | `@depscloud/api` | `depscloud` | `api` |
| `pom.xml` | `java` | `maven` | `com.google.guava;guava` | `com.google.guava` | `guava` |
| `vendor.conf` | `go` | `vendor` | `github.com/depscloud/api` | `github.com` | `depscloud/api` |
| Manifest File | Example | Language | System | Organization | Module |
|---------------------------------|-----------------------------|----------|------------|--------------------|------------------|
| `bower.json` | `@depscloud/api` | `node` | `bower` | `depscloud` | `api` |
| `build.gradle, settings.gradle` | `com.google.guava:guava` | `java` | `gradle` | `com.google.guava` | `guava` |
| `cargo.toml` | `bytes` | `rust` | `cargo` | `_` | `bytes` |
| `composer.json` | `symfony/console` | `php` | `composer` | `symfony` | `console` |
| `Godeps.json` | `github.com/depscloud/api` | `go` | `godeps` | `github.com` | `depscloud/api` |
| `go.mod` | `github.com/depscloud/api` | `go` | `vgo` | `github.com` | `depscloud/api` |
| `Gopkg.toml` | `github.com/depscloud/api` | `go` | `gopkg` | `github.com` | `depscloud/api` |
| `ivy.xml` | `com.google.guava;guava` | `java` | `ivy` | `com.google.guava` | `guava` |
| `package.json` | `@depscloud/api` | `node` | `npm` | `depscloud` | `api` |
| `pom.xml` | `com.google.guava;guava` | `java` | `maven` | `com.google.guava` | `guava` |
| `vendor.conf` | `github.com/depscloud/api` | `go` | `vendor` | `github.com` | `depscloud/api` |
1 change: 1 addition & 0 deletions content/en/docs/contributing/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Be sure to consult all the resources below prior to contributing.
The deps.cloud project leverages a public [GitHub Project](https://github.com/orgs/depscloud/projects/1) for tracking its work items.
For new comers, there's a section that provides some detail around how to get started developing on the project.
If you want to submit an issue, you can open one up under the [deps.cloud](https://github.com/depscloud/deps.cloud/issues/new) project.
A triager will take care of moving it to the right project.

## Contributor Agreements

Expand Down

0 comments on commit f1139ff

Please sign in to comment.