Files used to configure and test Basis Technology software with Cloudera Search
Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config
documents
README.textile
install-rblje.sh
prepare-plain-text.sh
rblje-eng-plain-text.sh
rblje-spa-plain-text.sh
rblje-zho-plain-text.sh
undo-solr-collection.sh

README.textile

Testing Basis Technology Software with Cloudera Search

Introduction

This document describes how to integrate and test Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) with Cloudera Search.

Compatibility

The following table shows the versions of elasticsearch and Rosette used to build this plugin.

Cloudera RBL-JE
CDH 5 (5.1) 2.3.0

Installation

These directions presume that you are using the Cloudera Starter VM for CDH 5.

First, install the contents of this git repositiory.

  1. Login as the user cloudera.
  2. cd ~cloudera
  3. mkdir basis
  4. cd basis
  5. git clone https://github.com/bt-kaglidden/basis-cloudera-tests.git

Next, install RBL-JE. You can either unpack the contents of the RBL-JE package or run the script ~cloudera/basis/basis-cloudera-tests/install-rblje.sh. Here is an example usage of this installation script. Note that sudo is used as we are installing the package in /opt:

sudo install-rblje.sh -r /opt/rblje-2.3.0 -s rbl-je-2.3.0.zip -l rlp-license.xml

NB that “/opt/rblje-2.3.0” in the example above is the root directory of the RBL-JE installation. The rblje-*.sh scripts, desribed below, set the variable RBLJE_ROOT to this value. If you install RBL-JE somewhere else, edit the rblje-*.sh scripts to match.

Running the tests

Scripts are provide that run map-reduce index jobs. These scripts are named using this convention:

rblje-<corpus-name>.sh, where <corpus-name> indicates the data that will be indexed.

If you want to undo what these scripts have done, run undo-solr-collection.sh, passing it rblje-<corpus-name>.

The <language>-plain-text copora (e.g. eng-plain-text), where <language> is a three letter language code, contain plain text files in the given language.

Here is an example of indexing English documents:

  1. cd ~cloudera/basis/basis-cloudera-tests
  2. ./rblje-eng-plain-text.sh

To undo this, i.e. remove the resultant collection from Solr and clean up intermediate files, run:

./undo-solr-collection.sh rblje-eng-plain-text

Note that the rblje scripts that copy data from the documents directory have commands to copy files that are not provided in this git repository. These refer to files in corpora that are proprietry to Basis. Inclusion of these commands is harmless.

Viewing Test Results

The test scripts will load data into Solr collections. These can be viewed in the Solr Admin UI at this url:

http://quickstart.cloudera:8983/solr/#/

Preparing your own Plain Text files for indexing

Use the helper script prepare-plain-text.sh to add your own text files for testing. Essentially it prepends your files with the name of the file in the format expected by the morphlines configuration files ~cloudera/basis/basis-cloudera-tests/config/rblje-<corpus-name>-morphlines.conf.