Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add architecture section #111

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions _data/sidebars/home_sidebar.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,50 @@ entries:
url: /basics/dockerhub.html
output: web

- title: Architecture
output: web
subfolderitems:

- title: Overview
url: /architecture/architecture.html
output: web

- title: Components
url: /architecture/components.html
output: web

- title: Perceval
url: /architecture/perceval.html
output: web

- title: Graal
url: /architecture/graal.html
output: web

- title: King Arthur
url: /architecture/kingarthur.html
output: web

- title: SortingHat & HatStall
url: /architecture/sortinghatstall.html
output: web

- title: ELK & Cereslib
url: /architecture/elkceres.html
output: web

- title: Manuscripts
url: /architecture/manuscripts.html
output: web

- title: Sigils, Kidash & Kibiter
url: /architecture/dashboards.html
output: web

- title: Mordred
url: /architecture/mordred.html
output: web

- title: Perceval
output: web
subfolderitems:
Expand Down
67 changes: 67 additions & 0 deletions architecture/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Overview

The overall structure of GrimoireLab is summarized in the figure below. Its core is composed of four components that take care of: extracting software development data, storing it, managing contributor identities and analyzing (visualizing) the data obtained. Additionally, orchestration of all the components is also available. The details of each component are described in the next sections.

![](../assets/grimoirelab-all.png)

*Overview of GrimoireLab. Software data is extracted, processed and visualized via four components, with one of them dedicated to managing identities. A further component allows to perform analysis over a set of target data sources by setting up and orchestrating together the other components.*
## Data retrieval

Data retrieval is handled by three tools.

- [Perceval](https://github.com/chaoss/grimoirelab-perceval) is designed to deal only with fetching the data, so that it can be optimized for that task. Usually, it works as a library, providing a uniform Python API to access software development repositories.

- [Graal](https://github.com/chaoss/grimoirelab-graal) complements Perceval by collecting data from the source code of Git repositories. It provides a mechanism to plug in third party source code analysis tools and libraries.

- [King Arthur](https://github.com/chaoss/grimoirelab-kingarthur) schedules and runs Perceval and Graal executions at scale through distributed queues.

## Identities management

Identity management is a component that handles contributors information and enables analysis where identities, organizations and affiliations are first-class citizens.

Depending on the kind of repository from which the data is retrieved, identities can come in different formats: commit signatures (e.g., full names and email addresses) in Git repositories, email addresses, GitHub or Slack usernames, etc. Furthermore, a given person may use several identities even in the same data source, and in different kinds of data sources. In some cases, an identity can be shared by several contributors (e.g., support email addresses in forums).

SortingHat and Hatstall are the tools used in GrimoireLab for managing identities.

- [Sortinghat](https://github.com/chaoss/grimoirelab-sortinghat) maintains a relational database with identities and related information, including the origin of each identity, for traceability. In the usual pipeline, the data storage component feeds SortingHat with identities found in data sources. SortingHat uses heuristics-based algorithms to merge identities data, and sends the unified data back to the data storage component.
- [HatStall](https://github.com/chaoss/grimoirelab-hatstall) is a web application, which provides an intuitive graphical interface to perform operations over a SortingHat database.

## Data storage

GrimoireLab pipelines usually involve storing the retrieved data, with two main goals: allowing for repeating analysis without needing new data retrieval, and having pre-processed data available, suitable for visualization and analytics. For the first goal, a raw database is maintained, with a copy of all JSON documents produced by Perceval. For the second, an enriched database, which is a flattened summary of the raw documents, is produced and stored.
The tools of the data storage components are ELK and Cereslib, they are described below:

- [ELK](https://github.com/chaoss/grimoirelab-elk) is the tool interacting with the database. The design underlying ELK consists of a feeder that collects the JSON documents produced by the data retrieval component. Next, the documents are stored as the raw database. Dumps of this raw data can be easily created to make any analysis reproducible, or just to conveniently perform the analytics with other technologies beyond the ones provided by GrimoireLab.

Then, the raw data is enriched by including identities information and attributes not directly available in the original data. For example, pair programming information is added to Git data, when it can be extracted from commit messages, or time to solve (i.e., close or merge) an issue or a pull request is added to the GitHub data. The data obtained is finally stored as flat JSON documents, embedding references to the raw documents for traceability.

- [Cereslib](https://github.com/chaoss/grimoirelab-cereslib) is a tool aims at simplifying data processing and it is tailored to GrimoireLab data. It provides interfaces to access ELK raw and enriched data and manipulate it (e.g., split and filter) to produce additional insights about software development.

## Analytics

The analytics component is in charge of presenting the data via static reports and dynamic dashboards. The tools participating in the generation of such artifacts are described below.

- **Reports** are generated by [Manuscripts](https://github.com/chaoss/grimoirelab-manuscripts), a tool that queries the GrimoireLab data storage and produces template-based documents, which ready to be delivered to decision-makers, able in this way to easily identify relevant aspects of their projects.

- **Dashboards** creation involve three tools:

- [Sigils](https://github.com/chaoss/grimoirelab-sigils) is a set of predefined widgets (e.g., visualizations and charts) available as JSON documents.

- [Kidash](https://github.com/chaoss/grimoirelab-kidash) is a tool able to import and export widgets to Kibiter.

- [Kibiter](https://github.com/chaoss/grimoirelab-kibiter) is a downstream of [Kibana](https://github.com/elastic/kibana) which performs the binding between the Sigils widgets and the GrimoireLab data, thus providing web-based dashboards for actionable inspection, drill down, and filtering of the software development data retrieved.

## Orchestration

The orchestration component takes care of coordinating the process leading to the dashboards.

[SirMordred](https://github.com/chaoss/grimoirelab-sirmordred) is the tool which enables the user to easily run GrimoireLab to retrieve data from software repositories, produce raw and enriched data, load predefined widgets and generate dashboards.

SirMordred relies on the `setup.cfg` and `projects.json` files, which have been designed to keep separated sensitive data from the one that can be publicly shared. Thus, the setup file includes credentials and tokens to access the GrimoireLab components and software repositories, while the projects file includes the information about the projects to analyse. Both files are
described below.

- `Setup.cfg` holds the configuration to arrange all process underlying GrimoireLab. It composed of sections which allow to define the general settings such as which components activate and where to store the logs, as well as the location and credentials
for ELK, SortingHat and Kibiter which can be protected to prevent undesired accesses. Furthermore, it also includes other sections to set up the parameters used by the data retrieval component to access the software repositories (e.g., GitHub tokens, gerrit username) and fetch their data.

- `Projects.json` enables the users to list the projects to analyse, divided by data sources such Git repositories, GitHub and GitLab issue trackers and Slack channels.
Furthermore, it also allows to add some meta information to group projects together, which structure is reflected in the dashboards.
14 changes: 14 additions & 0 deletions architecture/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Components

This section provides details about the GrimoireLab tools, which are highlithed in the figure below.

![](../assets/grimoirelab-all-details.png)

- [Perceval](./perceval.html)
- [Graal](./graal.html)
- [King Arthur](./kingarthur.html)
- [SortingHat & HatStall](./sortinghatstall.html)
- [ELK and Cereslib](./elkceres.html)
- [Manuscripts](./manuscripts.html)
- [Sigils, Kidash & Kibiter](./dashboards.html)
- [Mordred](./mordred.html)
1 change: 1 addition & 0 deletions architecture/dashboards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TODO
1 change: 1 addition & 0 deletions architecture/elkceres.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TODO
115 changes: 115 additions & 0 deletions architecture/graal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
## Graal

Graal complements the data extracted with Perceval by providing insights about the source code (e.g, code complexity, licenses).

Graal leverages on the incremental functionalities provided by Perceval and enhances the logic to handle Git repositories
to process their source code. The overall view of Graal and its connection with Perceval is summarized in the figure below: the Git backend creates a local mirror of a Git repository (local or
remote) and fetches its commits in chronological order. Several parameters are available to control the execution; for instance, *from_date* and *to_date* allow to select commits authored since
and before a given date, *branches* allows to fetch commits only from specific branches, and *latest_items* returns only those commits which are new since the last fetch operation.

Graal extends the Git backend by enabling the creation of a working tree (and its pruning), that allows to perform checkout operations which are not possible on a Git mirror. Furthermore,
it also includes additional parameters used to drive the analysis to filter in/out files and directories in the repository (*in_paths* and *out_paths*), set the *entrypoint* and define the *details* level
of the analysis (useful when analyzing large software projects).

![](../assets/graal.png)
*Overview of Graal*

Following the philosophy of Perceval, the output of the Git backend execution is a list of JSON documents (one per
commit). Therefore, Graal intercepts each document, replaces some metadata information (e.g., backend name, category) and
enables the user to perform the following steps: (i) filter, (ii)
analyze and (iii) post-process, which are described below.

- **Filter.**
The filtering is used to select or discard commits based on the information available in the JSON document
and/or via the Graal parameters (e.g., the commits authored by a given user or targeting a given software component).
For any selected commit, Graal executes a checkout on the working tree using the commit hash, thus setting the state of
the working tree at that given revision. The filtering default built-in behavior consists in selecting all commits.

- **Analyze.**
The analysis takes the document and the current working tree and enables the user to set up an ad-hoc
source code analysis by plugging existing tools through system calls or their Python interfaces, when possible. The results of
the analysis are parsed and manipulated by the user and then automatically embedded in the JSON document. In this step,
the user can rely on some predefined functionalities of Graal to deal with the repository snapshot (e.g., listing files, creating
archives). By default, this step does not perform any analysis, thus the input document is returned as it is.

- **Post-process.**
In the final step, the inflated JSON document can be optionally processed to alter (e.g., renaming, removing) its attributes, thus granting the user complete control
over the output of Graal executions. The built-in behavior of this step keeps all attributes as they are.

### Backends
Several backends have been developed to assess the genericity of Graal. Those backends leverage on source code analysis
tools, where executions are triggered via system calls or their Python interfaces. In the current status, the backends
mostly target Python code, however other backends can be easily developed to cover other programming languages. The
currently available backends are:
- **CoCom** gathers data about code complexity (e.g., cyclomatic complexity, LOC) from projects written in popular programming languages such as: C/C++, Java, Scala, JavaScript, Ruby, Python, Lua and Golang. It leverages on [Cloc](http://cloc.sourceforge.net/) and [Lizard](https://github.com/terryyin/lizard). The tool can be exectued at `file` and `repository` levels activated with the help of category: `code_complexity_lizard_file` or `code_complexity_lizard_repository`.
- **CoDep** extracts package and class dependencies of a Python module and serialized them as JSON structures, composed of edges and nodes, thus easing the bridging with front-end technologies for graph visualizations. It combines [PyReverse](https://pypi.org/project/pyreverse/) and [NetworkX](https://networkx.github.io/).
- **CoQua** retrieves code quality insights, such as checks about line-code’s length, well-formed variable names, unused imported modules and code clones. It uses [PyLint](https://www.pylint.org/) and [Flake8](http://flake8.pycqa.org/en/latest/index.html). The tools can be activated by passing the corresponding category: `code_quality_pylint` or `code_quality_flake8`.
- **CoVuln** scans the code to identify security vulnerabilities such as potential SQL and Shell injections, hard-coded passwords and weak cryptographic key size. It relies on [Bandit](https://github.com/PyCQA/bandit).
- **CoLic** scans the code to extract license & copyright information. It currently supports [Nomos](https://github.com/fossology/fossology/tree/master/src/nomos) and [ScanCode](https://github.com/nexB/scancode-toolkit). They can be activated by passing the corresponding category: `code_license_nomos`, `code_license_scancode`, or `code_license_scancode_cli`.
- **CoLang** gathers insights about code language distribution of a git repository. It relies on [Linguist](https://github.com/github/linguist) and [Cloc](http://cloc.sourceforge.net/) tools. They can be activated by passing the corresponding category: `code_language_linguist` or `code_language_cloc`.

## Graal in action
This section describes how to install and use Graal, highlighting its main features.

### Installation

Graal is being developed and tested mainly on GNU/Linux platforms. Thus it is very likely it will work out of the box
on any Linux-like (or Unix-like) platform, upon providing the right version of Python. The listing below shows how to install and uninstall Graal on your system. Currently, the only way of installing Graal consists of cloning the GitHub repository
hosting the [tool](https://github.com/chaoss/grimoirelab-graal) and using the setup script, while uninstalling the tool can be easily achieved by relying on *pip*.

```bash
To install, run:
git clone https://github.com/valeriocos/graal
python3 setup.py build
python3 setup.py install
To uninstall, run:
pip3 uninstall graal
```

### Use

Once installed, Graal can be used as a stand-alone program or Python library. We showcase these two types of executions below.

#### Stand-alone program
Using Graal as stand-alone program does not require much effort, but only some basic knowledge of GNU/Linux shell commands. The listing below shows
how easy it is to fetch code complexity information from a Git repository. As can be seen, the CoCom backend requires the URL where the repository is located (https://github.com/chaoss/grimoirelab-perceval) and the local path where to
mirror the repository (/tmp/graal-cocom). Then, the JSON documents produced are redirected to the file graal-cocom.test. The remaining messages in the listing are prompted to the user
during the execution.

Interesting optional arguments are *from-date*, which is inherited from Perceval and allows to fetch commits from a given date, *worktree-path* which sets the path of the working tree,
and *details* which enables fine-grained analysis by returning complexity information for methods/functions.

```bash
graal cocom https://github.com/chaoss/grimoirelab-perceval --git-path /tmp/graal-cocom > /graal-cocom.test
[2018-05-30 18:22:35,643] - Starting the quest for the Graal.
[2018-05-30 18:22:39,958] - Git worktree /tmp/... created!
[2018-05-30 18:22:39,959] - Fetching commits: ...
[2018-05-31 04:51:56,111] - Git worktree /tmp/... deleted!
[2018-05-31 04:51:56,112] - Fetch process completed: ...
[2018-05-31 04:51:56,112] - Quest completed.
```

#### Python Library
Graal’s functionalities can be embedded in Python scripts. Again, the effort of using Graal is minimum. In this case the user only needs some knowledge of Python
scripting. The listing below shows how to use Graal in a script. The graal.backends.core.cocom module is imported at the beginning of the file, then the repo uri and repo dir variables
are set to the URI of the Git repository and the local path where to mirror it. These variables are used to initialize a CoCom class object. In the last line of the script, the commits
inflated with the result of the analysis are retrieved using the fetch method. The fetch method inherits its argument from Perceval, thus it optionally accepts two Datetime objects to
gather only those commits after and before a given date, a list of branches to focus on specific development activities, and a flag to collect the commits available after the last execution.

```python
#! /usr/bin/env python3
from graal.backends.core.cocom import CoCom

# URL for the git repo to analyze
repo_uri = ’http://github.com/chaoss/grimoirelab-perceval’
# directory where to mirror the repo
repo_dir = ’/tmp/graal-cocom’

# Cocom object initialization
cc = CoCom(uri=repo_url, gitpath=repo_dir)
# fetch all commits
commits = [commit for commit in cc.fetch()]
```

## Example
TODO
Loading