This tool allows you to schedule repo crawler jobs quickly on any standard linux system
Additional info can be found here
You'll need to clone this git repository (or download/extract) on to a basic linux system and run the install.sh
script.
git clone https://github.com/cyral-quickstart/quickstart-crawler-express.git
cd quickstart-crawler-express
sudo ./install.sh
Configuration can be done in a few simple steps
- Login to your Control Plane
- Get an API Key
- From the bottom left select
API Access Keys
- Select the
+
to add a new key - Give it a name and select the following permissions
- View Datamaps
- Modify Policies
- View Sidecars and Repositories
- Modify Sidecars and Repositories
- Repo Crawler
- Save the produced ID/Secret
- From the bottom left select
- Setup a Data Repo
- If you haven't already, add a Data Repo
- Get an API Key
- SSH to the Instance you installed the Crawler on
- Run
crawler
- Configure the control plane information
- Configure the repo
- Configure Data / Account jobs
- Run
Once the Job has successfully run you can see if the job successfully reporting by going to Data Repos > Your Repo > Data Map > Auto Updates
- Control Plane Configuration - This is the info required to communicate the results back to the control plane.
- Repo Configuration - This is related to the Data Repo configuration on the control plane and where the results will be pushed
- Database Discovery Jobs - This is configuration related to specific databases that live on the Data Repo for Data Classification
- Local Account Discovery Jobs - This will scan the Data Repo for any defined local accounts and will populate them in the Control Plane under that Data Repo
- Worker ID - A unique id to track which crawler ran a job.
All config files will by default end up at ~/.local/cyral
File | Description |
---|---|
controlplane.env |
Contains all of the control plane connection info |
<repo name>/repo.config.env |
directory contains the Repo configuration |
<repo name>/<db name>.env |
Contains the DB name for data classification discovery |
There are a few environment variables that can be used for on demand runs to help with diagnosing errors or modifying how the crawler runs.
Variable | Description |
---|---|
CRAWLER_LOG_LEVEL | This can be set to trace to increase the logging level. Default is info |
CRAWLER_NETWORK_MODE | The can be used to control the network mode for the crawler container. Default is host |