SIP 1

kimballa edited this page Sep 14, 2010 · 9 revisions
SIP 1
Title Providing multiple entry-points into Sqoop
Author Aaron Kimball (aaron at cloudera dot com)
Created April 16, 2010
Status Accepted
Discussion http://github.com/cloudera/sqoop/issues/issue/7
Implementation kimballa/sqoop@faf04bc4

Abstract

This SIP proposes breaking the monolithic “sqoop” program entry point into a number of separate programs (e.g., sqoop-import and sqoop-export), to provide users with a clearer set of arguments that affect each operation.

Problem statement

The single “sqoop” program takes a large number of arguments which control all aspects of its operation. This has now grown into several different operations. This single program can:

  • generate code
  • import from an RDBMS to HDFS
  • export from HDFS to an RDBMS
  • manipulate Hive metadata and interact with HDFS files
  • query databases for metadata (e.g., --list-tables)

As the number of different operations supported by sqoop grows, additional arguments will be required to handle each of these new operations. Furthermore, as each operation provides the user with more fine-grained control, the number of operation-specific arguments will increase. It is growing difficult to know which arguments one should set to effect a particular operation.

Therefore, I propose that we break out operations into separate program entry points (e.g., sqoop-import, sqoop-export, sqoop-codegen, etc).

Specification

User-facing changes

An abstract class called SqoopTool represents a single operation that can be performed. It has a run() method that returns a status integer, where 0 is success and non-zero is a failure code. A static, global mapping from strings to SqoopTool instances determines what “subprogram” to run.

The main bin/sqoop program will accept as its first positional argument the name of the subprogram to run. e.g., bin/sqoop import --table ... --connect .... This will dispatch to the SqoopTool implementation bound to the string "import".

A help subprogram will list all available tools.

For convenience, users will be provided with git-style bin/sqoop-tool programs (e.g., bin/sqoop-import, bin/sqoop-export, etc).

API changes

Currently the entire set of arguments passed to Sqoop are manually processed in the SqoopOptions class. This class also validates the arguments, prints the usage message for Sqoop, as well as stores the argument values.

Different subprograms will require different (but overlapping) sets of arguments. Rather than create several different FooOptions classes which redundantly store argument values, the single SqoopOptions should remain as the store for all program initialization state. However, configuration of the SqoopOptions will be left to the SqoopTool implementations. Each SqoopTool should expose a configureOptions() method which configures a Commons-CLI Options object to retrieve arguments from the command line. It then generates the appropriate SqoopOptions based on the actual arguments in its applyOptions() method. Finally, these values are validated in the validateOptions() method; mutually exclusive arguments, arguments with out-of-range values, etc. are signalled here.

Common methods from an abstract base class (BaseSqoopTool) will configure, apply and verify all options which are common to all Sqoop programs (e.g., the connect string, username, and password for the database).

Clarifications

  • The current Sqoop program will perform possibly multiple operations in series. e.g., the default control-flow will both generate code as well as perform an import. By splitting Sqoop into multiple subprograms, it should still be possible for one subprogram to invoke another.
  • Options are also currently extracted using Hadoop’s GenericOptionsParser as applied through ToolRunner. Sqoop will still use ToolRunner to process all generic options before yielding to the internal options parsing logic. Due to subtleties in how ToolRunner interacts with some argument parsing, it is recommended to run Sqoop through the public static int Sqoop.runSqoop(Sqoop sqoop, String [] argv) method which captures some arguments which ToolRunner and GenericOptionsParser would inadvertently strip out (specifically, the "--" argument which separates normal arguments from subprogram arguments would be lost). runSqoop() then uses ToolRunner.run() in the usual way. If subprogram arguments are not required, then ToolRunner can be used directly.

Listings

The proposed SqoopTool API can be summarized thusly:

import org.apache.hadoop.sqoop.cli.ToolOptions;

public abstract class SqoopTool {

  /** Main body of code to run the tool. */  
  public abstract int run(SqoopOptions options);

  /** Configure the command-line arguments we expect to receive.
   * ToolOptions is a wrapper around org.apache.commons.cli.Options
   * that contains some additional information about option families.
   */
  public void configureOptions(ToolOptions opts);

  /** Generate the SqoopOptions containing actual argument values from
   * the Options extracted into a CommandLine. 
   * @param in the CLI CommandLine that contain the user's arguments.
   * @param out the SqoopOptions with all fields applied.
   */
  public void applyOptions(CommandLine in, SqoopOptions out);

  /** Print a usage message for this tool to the console.
   * @param options the configured options for this tool.
   */
  public void printHelp(ToolOptions options);

  /**
   * Validates options and ensures that any required options are
   * present and that any mutually-exclusive options are not selected.
   * @throws InvalidOptionsException if there's a problem.
   */
  public void validateOptions(SqoopOptions options)
      throws InvalidOptionsException;
}

This uses the Apache Commons CLI argument parsing library. The plan is to use v1.2 (the current stable release), as this is already included with and used by Hadoop.

Compatibility Issues

This is an incompatible change. Existing program arguments to Sqoop will cease to function as expected. For example, we currently run imports and exports like this:

bin/sqoop --connect jdbc://someserver/db --table sometable # import
bin/sqoop --connect jdbc://someserver/db --table othertable --export-dir /otherdata # export

Both of these commands will return an error after this change, because we have not specified the subprogram to run. After the conversion, these operations would be effected with:

bin/sqoop import --connect jdbc://someserver/db --table sometable
bin/sqoop export --connect jdbc://someserver/db --table othertable --export-dir /otherdata

As Sqoop has not had a formal “release,” I believe this incompatibility to be acceptable. As this proposes to restructure the argument-parsing semantics of Sqoop, I do not believe it would be in the best interest of the project to maintain the existing argument structure for backwards-compatibility reasons.

Unversioned releases of Sqoop have been bundled with releases of Cloudera’s Distribution for Hadoop. The version of Sqoop packaged in the still-beta CDH3 would be affected by this change. The version of Sqoop packaged in the now-stable CDH2 would not see this backported.

Test Plan

A large number of the presently-available unit tests are end-to-end tests of a Sqoop subprogram; they actually pass an array of arguments to the Sqoop.main() method to control its execution. These existing tests cover all functionality. The functionality of Sqoop is not expected to change as a result of this refactoring. Therefore, the existing tests will have their argument lists updated to the new argument sets; when all tests pass, then all currently-available functionality should be preserved.

Discussion

Please provide feedback and comments at http://github.com/cloudera/sqoop/issues/issue/7