kimballa edited this page Sep 14, 2010 · 8 revisions
Title Public API for Sqoop v1.0.0
Author Aaron Kimball (aaron at cloudera dot com)
Created May 14, 2010
Status Accepted


This SIP defines the public API to be exposed in the first release of Sqoop. The org.apache.hadoop.sqoop.lib package contains the public API relied-upon by external clients of Sqoop. Generated code produced by Sqoop depends on these modules. Clients of imported data may also rely on additional modules specified here.

Problem statement

To deal with the unique table schemas of each database a Sqoop user imports, Sqoop’s current design requires that it generate a per-table class. This class is used to interact with the data after it is imported to Hadoop; data can be stored in SequenceFiles, requiring this class to deserialize records. Subsequent re-exports of the data rely on this class to push records back to the RDBMS. And the generated class includes support for parsing text-based representations of the data.

This class, however, relies on reusable code modules provided with Sqoop. These code modules are all placed in the org.apache.hadoop.sqoop.lib package. Clients of generated code must be able to rely on previously-generated code to work with later versions of Sqoop. While code regeneration is possible, Sqoop users should see the lib package as the most stable API provided by Sqoop.

Sqoop also provides a file format for large object data; while large objects can be manipulated in the context of their encapsulating records (e.g., through BlobRef or ClobRef references to the data), the large object file store may be inspected directly.

This SIP defines the official “surface area” of the public packages which will be maintained. In order to ensure that future versions remain backwards compatible, some existing class definitions must be modified. It is hoped that these sorts of “breaking changes” will occur only before incrementing the major version number (1.0, 2.0, etc.), and are thus infrequent disruptions to Sqoop users. Sqoop clients who target only the APIs specified may be confident that their programs will work properly with all subsequent Sqoop releases in the 1.0 series (in accordance with the compatibility and deprecation policy specified in SIP-2).


lib package

As of 5/14/2010, the lib package contains the following classes:

  • BigDecimalSerializer
  • BlobRef
  • ClobRef
  • FieldFormatter
  • JdbcWritableBridge
  • LargeObjectLoader
  • LobRef
  • LobSerializer
  • RecordParser
  • TaskId

and the following interface:

  • SqoopRecord

Classes generated by Sqoop fulfill the interface of SqoopRecord. The first change necessary in this package is to transform SqoopRecord from an interface into an abstract class. This way, subsequent releases in the 1.0 series can introduce additional methods required by SqoopRecords along with a default implementation for previously-generated clients.

The TaskId class is improperly placed in this package. This class is Sqoop-internal and should be moved to the util package.

We should add a class called DelimiterSet which encapsulates the parameters regarding formatting of delimiters around fields: the field terminator, the record terminator, the escape character, the enclosing character, and whether the latter of these is optional. This would allow sets of delimiters to be manipulated easily. The SqoopRecord class could then be extended with a toString(DelimiterSet) method that allowed users to format output with alternate delimiters than the ones specified during codegen time.

LobRef is an abstract base class that encapsulates common code in BlobRef and ClobRef. The constructors for LobRef are marked as protected. Clients of Sqoop should not subclass LobRef directly.

Classes in the lib package may depend on classes elsewhere in Sqoop’s implementation. Clients should not do so directly.

io package

Clients of Sqoop who have imported large objects into HDFS may have large object files holding their data; this file format is defined in SIP-3. The large objects may be manipulated by iterating over their encapsulating records and calling {B,C}lobRef.getDataStream(), which will retrieve the data for a large object from its underlying store. However, the large objects may also be directly retrieved from their underlying LobFile storage.

The class is considered part of the public API. Clients of Sqoop may depend on the LobFile.Writer and LobFile.Reader APIs. Clients should never instantiate subclasses of Writer and Reader directly; instead they should use the static methods LobFile.create() and respectively. The underlying concrete Writer and Reader implementation classes are considered private.

To allow users to verify the compression formats available in LobFiles, the CodecMap.getCodecNames() method is also public.

Entry-points to Sqoop

A well-defined programmatic entry-point to Sqoop is not defined by this specification. The only method of org.apache.hadoop.sqoop.Sqoop considered stable is its main() method; all others are currently internal. This restriction will be relaxed in a future specification, allowing programmatic client interaction with Sqoop.

Base package

The base package in Sqoop is currently org.apache.hadoop.sqoop. To reflect Sqoop’s migration from an Apache Hadoop subproject to its own project, the class hierarchy should be moved to com.cloudera.sqoop.

Compatibility Issues

The modification of SqoopRecord from interface to class will cause existing generated code to break. Such a change is expected prior to the 1.0.0 release. This is the last interface in Sqoop; once it is transitioned to an abstract class, subsequent changes to the SqoopRecord API should be backwards-compatible.

Test Plan

The changes required to implement this specification are minimal; the existing unit test suite should cover all necessary testing.


Please provide feedback and comments at