General purpose Crawler(Initially created for GlusterFS)
Go
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE
README.rst
crawler.go

README.rst

Crawler

Fast(hopefully) crawler for GlusterFS brick backend(can be used with any filesystem).

Install

Download binary from here and copy to /usr/local/bin(or any other directory, which is available in PATH).

wget https://github.com/aravindavk/crawler/releases/download/v0.8/crawler
sudo cp crawler /usr/local/bin/
chmod +x /usr/local/bin/crawler

If you have golang installed, then(make sure $GOPATH/bin is available in PATH)

go get github.com/aravindavk/crawler

Usage

usage: crawler.py [-h] [--dirs DIRS] [--workers WORKERS] [--mtime MTIME]
                  [--missing MISSING] [--ignore IGNORE] [--outfile OUTFILE]
                  PATH

Crawler for GlusterFS brick backend

positional arguments:
  PATH               Crawl Start Path

optional arguments:
  -h, --help         show this help message and exit
  --dirs DIRS        Directories list
  --workers WORKERS  Number of Workers
  --mtime MTIME      mtime filter, gets files modified after this time
  --missing MISSING  Check missing files in this Dir/Mount
  --ignore IGNORE    Ignore this directory
  --outfile OUTFILE  Output File

Basic usage, Start Crawl from a given directory

crawler <CRAWL_DIR>
crawler /exports/bricks/b1

Specify number of Workers to Crawl(default is 4)

crawler -workers <NUM> <CRAWL_DIR>
crawler -workers 2 /exports/bricks/b1

Ignore a directory(Only one is supported now, will enhance in future to ignore files/dirs). Ignore path is relative to the -path

crawler -ignore <DIR> <CRAWL_DIR>
crawler -ignore .glusterfs /exports/bricks/b1

List only files which are greater than given time

crawler -mtime <TS> <CRAWL_DIR>
crawler -mtime 1422649923 /exports/bricks/b1

Specify a output file instead of printing to screen

crawler -output <OUTFILE> <CRAWL_DIR>
crawler -output output.txt /exports/bricks/b1

List only those files which are missing in the given mount/dir

crawler -missing <MOUNT/DIR PATH> <CRAWL_DIR>
crawler -missing /mnt/volumes/gv2 /exports/bricks/b1

List only those files/dirs which are missing in the given mount/dir and modified after a given time

crawler -mtime <TS> -missing <MOUNT/DIR PATH> <CRAWL_DIR>
crawler -mtime 1422649923 -missing /mnt/volumes/gv2 /exports/bricks/b1

If you know the directories to be crawled, we can avoid stat on each file while Crawling. Also we can limit crawling only to those dirs. Don't forget to include Root directory(relative to -path so, ".") in the list if required.

crawler -dirs <DIRECTORIES_LIST_FILE> <CRAWL_DIR>
crawler -dirs ~/mydirs.txt /exports/bricks/b1

Source

git clone https://github.com/aravindavk/crawler.git

Blogs

  1. crawler - A utility to Crawl GlusterFS brick backend