Skip to content

DRAT Statistics

Chris Mattmann edited this page Jul 20, 2017 · 3 revisions

DRAT (Distributed Release Audit Tool) Statistics

What

This is a simple utility, written in Python, which uses DRAT to scan multiple code repositories sequentially, collect statistics and dumps into both, Apache Solr ("statistics" core) and user defined directory.


What Statistics

  • Crawl start time
  • Crawl end time
  • Index start time
  • Index end time
  • Mapper start time
  • Mapper end time
  • Reducer start time
  • Reducer end time
  • Notes (count from RatAggregator)
  • Binaries (count from RatAggregator)
  • Archives (count from RatAggregator)
  • Standards (count from RatAggregator)
  • Apache (count from RatAggregator)
  • Generated (count from RatAggregator)
  • Unknown (count from RatAggregator)
  • Mimetypes (count from "drat" core by doing a facet on "mimetype")

All license types are stored as "license_*" and mimetypes as "mime_*"


Why

As we know that DRAT runs on single code repository and generates the output. But what if we have a large number of repositories to be scanned and record their individual statistics. This utility can be leveraged to such large-scale tasks. The Solr core gives the advantage to understand and visualize the statistics through amazing function and facet queries.


How To Use

  1. Set the following environment variables:
  1. Run the script as below:
python dratstats.py <path to list of repository URLs> <path to output directory>

The details are as below:

  • Path to a flat file containing a list of repositories to traverse. Each line in the file represents the absolute path to one source code repository. Eg: the entries below provide examples of paths referencing Apache Tika and Apache Nutch codebases on a local file system.
/apacheSvn/tika ApacheTika http://github.com/apache/tika.git The digital babel fish.
/apacheSvn/nutch ApacheNutch http://github.com/apache/nutch.git The open source web crawler.

A sample repos.txt file is available.

  • Path to the output directory where the contents of ${DRAT_HOME}/data will be copied to, for each repository. Each folder in the output directory follow standard naming conventions i.e.
    • Remove the first character i.e. ‘/’
    • All ‘/’ will be replaced with ‘_’
    • And it will be appended with the current timestamp. Example - An output directory of ‘/apacheSvn/tika’ repository can be written as apacheSvn_tika_2016-01-15T23:14:39Z