Backend and crawler for the OSS catalog of Developers Italia

Overview: how the crawler works

The crawler finds and retrieves the publiccode.yml files from the organizations registered on Github/Bitbucket/Gitlab, listed in the whitelist. It then creates YAML files used by the Jekyll build chain to generate the static pages of developers.italia.it.

Dependencies and other related software

These are the dependencies and some useful tools used in conjunction with the crawler.

Elasticsearch 6.8.7 for storing the data. Elasticsearch should be active and ready to accept connections before the crawler gets started
Kibana 6.8.7 for internal data visualization (optional)
Prometheus 6.8.7 for collecting metrics (optional, currently supported but not used in production)

Tools

This is the list of tools used in the repository:

Setup and deployment processes

The crawler can either run directly on the target machine, or it can be deployed in form of Docker container, possibly using an orchestrator, such as Kubernetes.

Up to now, the crawler and its dependencies have run in form of Docker containers on a virtual machine. Elasticsearch and Kibana have been deployed using a fork of the main project, called search guard. This is still deployed in production and what we'll call in the readme "legacy deployment process".

With the idea of making the legacy installation more scalable and reliable, a refactoring of the code has been recently made. The readme refers to this approach as the new deployment process. This includes using the official version of Elasticsearch and Kibana, and deploying the Docker containers on top of Kubernetes, using helm-charts. While the crawler has it's own helm-chart, Elasticsearch and Kibana are deployed using their official helm-charts. The new deployment process uses a docker-compose.yml file to only bring up a local development and test environment.

The paragraph starts describing how to build and run the crawler, directly on a target machine. The procedure described is the same automated in the Dockerfile. The -legacy and new- Docker deployment procedures are then described below.

Manually configure and build the crawler

cd crawler
Fill the domains.yml file with configuration values (i.e. host basic auth tokens)
Rename the config.toml.example file to config.toml and fill the variables

NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"

Build the crawler binary: make
Start the crawler: bin/crawler crawl whitelist/*.yml
Configure the crontab as desired

Run the crawler

bin/crawler updateipa downloads IPA data and writes them into Elasticsearch
bin/crawler download-whitelist downloads organizations and repositories from the onboarding portal repository and saves them to a whitelist file

Docker: the legacy deployment process

The paragraph describes how to setup and deploy the crawler, following the legacy deployment process.

Rename .env-search-guard.example to .env and adapt its variables as needed
Rename elasticsearch-searchguard/config/searchguard/sg_internal_users.yml.example to elasticsearch/-searchguard/config/searchguard/sg_internal_users.yml and insert the correct passwords. Hashed passwords can be generated with:
```
docker exec -t -i developers-italia-backend_elasticsearch elasticsearch-searchguard/plugins/search-guard-6/tools/hash.sh -p <password>
```
Insert the kibana password in kibana-searchguard/config/kibana.yml

Configure the Nginx proxy for the elasticsearch host with the following directives:

limit_req_zone $binary_remote_addr zone=elasticsearch_limit:10m rate=10r/s;

server {
    ...
    location / {
        limit_req zone=elasticsearch_limit burst=20 nodelay;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_pass http://localhost:9200;
        proxy_ssl_session_reuse off;
        proxy_cache_bypass $http_upgrade;
        proxy_redirect off;
    }
}

You might need to type sysctl -w vm.max_map_count=262144 and make this permanent in /etc/sysctl.conf in order to start elasticsearch, as documented here
Start Docker: make up

Docker: the new deployment process

The repository has a Dockerfile, used to also build the production image, and a docker-compose.yml file to facilitate the local deployment.

The containers declared in the docker-compose.yml file leverage some environment variables that should be declared in a .env file. A .env.example file has some exemplar values. Before proceeding with the build, copy the .env.example into .env and modify the environment variables as needed.

To build the crawler container, download its dependencies and start them all, run:

docker-compose up [-d] [--build]

where:

-d execute the containers in background
--build forces the containers build

To destroy the containers, use:

docker-compose down

Xpack

By default, the system -specifically Elasticsearch- doesn't make use of xpack, so passwords and certificates. To do so, the Elasticsearch container mounts this configuration file. This will make things work out of the box, but it's not appropriate for production environments.

An alternative configuration file that enables xpack is available here. In order to use it, you should

Generate appropriate certificates for elasticsearch, save them in the elasticsearch folder, and make sure that their name matches the one contained in the elasticsearch-xpack configuration file.
Optionally change the elasticsearch-xpack.yml configuration file as desired
Rename the elasticsearch-xpack.yml configuration file to elasticsearch.yml
Change the environment variables in your .env file to make sure that crawler, elasticsearch, and kibana configurations have matching passwords

At this point you can bring up the environment with docker-compose.

Troubleshooting Q/A

From docker logs seems that Elasticsearch container needs more virtual memory and now it's Stalling for Elasticsearch...

Increase container virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode
When trying to make build the crawler image, a fatal memory error occurs: "fatal error: out of memory"

Probably you should increase the container memory: docker-machine stop && VBoxManage modifyvm default --cpus 2 && VBoxManage modifyvm default --memory 2048 && docker-machine stop

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.circleci		.circleci
.github/PULL_REQUEST_TEMPLATE		.github/PULL_REQUEST_TEMPLATE
crawler		crawler
docs		docs
elasticsearch-searchguard		elasticsearch-searchguard
elasticsearch		elasticsearch
kibana-searchguard/config		kibana-searchguard/config
prometheus		prometheus
.dockerignore		.dockerignore
.env-search-guard.example		.env-search-guard.example
.env.example		.env.example
.gitignore		.gitignore
.gometalinter.json		.gometalinter.json
AUTHORS		AUTHORS
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
docker-compose-es-searchguard.yml		docker-compose-es-searchguard.yml
docker-compose.yml		docker-compose.yml

License

faunalia/developers-italia-backend

Folders and files

Latest commit

History

Repository files navigation

Backend and crawler for the OSS catalog of Developers Italia

Overview: how the crawler works

Dependencies and other related software

Tools

Setup and deployment processes

Manually configure and build the crawler

Run the crawler

Docker: the legacy deployment process

Docker: the new deployment process

Xpack

Troubleshooting Q/A

See also

Authors

About

Resources

License

Stars

Watchers

Forks

Languages