Testing Basis Technology Software with Cloudera Search
The following table shows the versions of elasticsearch and Rosette used to build this plugin.
|CDH 5 (5.1)||2.3.0|
These directions presume that you are using the Cloudera Starter VM for CDH 5.
First, install the contents of this git repositiory.
- Login as the user cloudera.
- cd ~cloudera
- mkdir basis
- cd basis
- git clone https://github.com/bt-kaglidden/basis-cloudera-tests.git
Next, install RBL-JE. You can either unpack the contents of the RBL-JE package or run the script ~cloudera/basis/basis-cloudera-tests/install-rblje.sh. Here is an example usage of this installation script. Note that sudo is used as we are installing the package in /opt:
sudo install-rblje.sh -r /opt/rblje-2.3.0 -s rbl-je-2.3.0.zip -l rlp-license.xml
NB that “/opt/rblje-2.3.0” in the example above is the root directory of the RBL-JE installation. The rblje-*.sh scripts, desribed below, set the variable RBLJE_ROOT to this value. If you install RBL-JE somewhere else, edit the rblje-*.sh scripts to match.
Running the tests
Scripts are provide that run map-reduce index jobs. These scripts are named using this convention:
rblje-<corpus-name>.sh, where <corpus-name> indicates the data that will be indexed.
If you want to undo what these scripts have done, run undo-solr-collection.sh, passing it rblje-<corpus-name>.
The <language>-plain-text copora (e.g. eng-plain-text), where <language> is a three letter language code, contain plain text files in the given language.
Here is an example of indexing English documents:
- cd ~cloudera/basis/basis-cloudera-tests
To undo this, i.e. remove the resultant collection from Solr and clean up intermediate files, run:
Note that the rblje scripts that copy data from the documents directory have commands to copy files that are not provided in this git repository. These refer to files in corpora that are proprietry to Basis. Inclusion of these commands is harmless.
Viewing Test Results
The test scripts will load data into Solr collections. These can be viewed in the Solr Admin UI at this url:
Preparing your own Plain Text files for indexing
Use the helper script prepare-plain-text.sh to add your own text files for testing. Essentially it prepends your files with the name of the file in the format expected by the morphlines configuration files ~cloudera/basis/basis-cloudera-tests/config/rblje-<corpus-name>-morphlines.conf.