Eclectic-ORC is a Java object writer for creating ORC files by simply annotating your class files as necessary. The framework uses runtime code generation to create a fast customized ORC writer taking care of all the low-level details.
- Declarative Schema Definition
- Annotated column specification (use
@Orc
or JPA@Column
annotations)
Download the eclectic-orc jar from Maven central:
<dependency>
<groupId>com.eclecticlogic</groupId>
<artifactId>eclectic-orc</artifactId>
<version>1.0.9</version>
</dependency>
Minimum dependencies that you need to provide in your application:
- Java 8 or above (the design leverages method references and lambdas extensively)
- slf4j (over logback or log4j) v1.7.23 or higher
Consider a simple class that you want to serialize to an ORC file:
public class Student {
int year;
String name;
public String getName() {
...
}
public int getYear() {
...
}
...
}
To write a collection of Students to an ORC file, you first have to provide a schema definition. The eclectic-orc library makes doing this trivial:
import com.eclecticlogic.orc.Factory;
import com.eclecticlogic.orc.Schema;
...
public void schemaSetup() {
Schema<Student> schema = Factory.createSchema(Student.class)
.column(Student::getName) //
.column(Student::getYear);
}
The above schema definition implicitly does three things:
- It defines the order of the columns (first name then year)
- It defines the data types of the columns (String, int)
- It defines the names of the columns (name, year)
The library allows you to customize aspects of the schema. Let us start with column names. If you want the year column to be called graduationYear, simply change the schema column definition.
Schema<Student> schema = Factory.createSchema(Student.class)
.column(Student::getName) //
.column("graduationYear", Student::getYear);
You can also define columns based on properties of other classes that are referenced. If the Student
class referenced a Club class as shown below:
public class Club {
String name;
public String sanitizedClubName() {
return ...
}
}
public class Student {
Club club;
public Club getClub() {
return club;
}
}
You can reference the club name in your schema definition by chaining the call as getClub().sanitizedClubName()
.
The astute reader would have noticed that sanitizedClubName() is not a java-bean compliant getter. That is right.
eclectic-orc does not restrict you to just java-bean getters. Any method that takes no parameters and returns a non-void
type can be used for a column definition. A schema to incorporate the above definition would look like this.
Schema<Student> schema = Factory.createSchema(Student.class)
.column(Student::getName) //
.column("graduationYear", Student::getYear)
.column(it -> it.getClub().sanitizedClubName());
We've now defined a third column of type String
and given it an implicit name of "santitizedClubName." Of course, just like
before you can choose to change the name to something else. The same definition in Groovy could be written as:
Schema schema = Factory.createSchema(Student)
.column { it.name }
.column('graduationYear') { it.year }
.column { it.club.santizedClubName() }
To write a collection of Student
objects, we simply create an OrcHandle reference, configure it, open it to get an OrcWriter
reference and write our collection.
import org.apache.hadoop.fs.Path
// First get an OrcHandle reference.
OrcHandle<Student> handle = Factory.createWriter(schema);
// Customize it by calling one of the withXYZ() methods. This is optional as defaults are provided.
// Create an OrcWriter by calling open.
Path path = new Path("/home/kabram/temp/dp/graduate.orc");
OrcWriter<Student> writer = handle.open(path);
List<Student> students = ...
// The write method may be called multiple times if you are retrieving objects in batches.
writer.write(students);
writer.close();
In simple cases, the above code can be written as:
Factory.createWriter(schema) //
.open(new Path("/home/kabram/temp/dp/graduate.orc")) //
.write(students) //
.close();
The following data types are supported in the current release:
- Java primitive types -
boolean
,char
,byte
,short
,int
,long
,float
,double
. These map to their corresponding counterparts with the exception ofchar
which maps tovarchar(1)
The exception forchar
is because AWS Athena is currently unable to handlechar
column types. BigDecimal
mapping to ORCDecimal
type.LocalDate
mapping to ORCDate
type.Date
,LocalDateTime
,ZonedDateTime
mapping to ORCTimestamp
type unless there is either a JPA@Temporal
or@OrcTemporal
annotation that defines theTemporalType
(orOrcTemporalType
) asDATE
.String
mapping to ORCstring
type.- Any derivative of
Iterable
mapping to ORCList
type, currently supporting only simple types as the member. See below for how to use lists.
The following data types are not supported in the current release:
Binary
data type.Map
Union
- Sub-structures (
Struct
within your table, map of structs, list of structs, etc.)
To specify the number of characters for a String column type, simply use the @Orc
annotation. If the framework finds
an existing JPA @Column
annotation, it will use the length property of that as well. If both annotations are present,
the @Orc
annotation takes precedence. The @Orc
annotation is only supported on methods.
public class Student {
String name;
@Orc(length = 50)
public String getName() {
return name;
}
}
You can also specify the precision and scale of BigDecimal
data type by using the JPA @Column
or @Orc
annotations.
By default, the precision is 38 and scale is 10. This can be changed via annotation:
public class Employee {
BigDecimal salary;
@Orc(precision = 10, scale = 2)
public BigDecimal getSalary() {
}
}
There may be times you want to write a data type that is not a supported type. For example, you may have a birthday property
that only records the year and month using the java.time.YearMonth
class. You can handle these column types by defining a type
converter, a class that implements the Converter
interface. In our example, to convert YearMonth
to LocalDate
,
defaulting to the first day of the month, we could write:
public class YearMonthConverter implements Converter<YearMonth, LocalDate> {
@Override
public Class<LocalDate> getConvertedClass() {
return LocalDate.class;
}
@Override
public LocalDate convert(YearMonth yearMonth) {
return yearMonth.atDay(1);
}
}
We can now annotate the YearMonth
accessor with the @OrcConverter
annotation:
public class Employee {
YearMonth birthday;
@OrcConverter(YearMonthConverter.class)
public YearMonth getBirthday() {
...
}
}
...
Schema<Employee> schema = Factory.createSchema(Employee.class) //
.column(Employee::getBirthday) // This is now a LocalDate data type.
Java Enums require special handling to convert them to a specific data type. There are three ways to handle enums.
- Do nothing: If your schema column is an
Enum
derivative, then the column will be treated as aString
with thename()
method being called to get the value. - Annotation: Annotate a custom enum method with
@Orc
. If you have a method in yourEnum
class that provides the value you would like to store, you can add the@Orc
annotation to it. - Converter: Annotate your accessor method that returns an
Enum
with@OrcConverter
specifying a converter that takes your enum and returns a supported data-type.
Eclectic-orc supports creation of list columns that can hold a single scalar data type. To include a list column in the schema
definition, annotate the accessor method with the @OrcList
annotation. Strictly speaking, any derivative of java.lang.Iterable
is supported. The @OrcList
annotation requires you to specify the Class
of the entries of the Iterable
. This is because the
type information is lost at runtime due to type-erasure. You also need to specify the average number of entries you expect to
see in the list. This is a technical implementation detail due to the way lists are stored in ORC files. Finally, there is
a converter attribute you can use to convert each item of the Iterable
to a different type. Note: If you annotate the list
accessor with @OrcConverer
, you will be modifying the List
/Iterable
itself into some other data type.
If your Iterable
consists of Enum
instances, the existing strategy for enums is automatically used - using an enum method
annotated with @Orc
or calling name()
.
If your collection member class does not have a method that gets you a column value that you need, i.e., you need to compute the value on the fly based on existing methods in the class, you can create a delegate class that accepts the collection member class as a constructor parameter and then implement your logic in the delegate class and use that method in the column definition.
Schema<Student> schema = Factory.createSchema(Student.class)
.withDelegate(StudentDelegate.class)
.delegatedColumn("someProperty", StudentDelegate::getLastFirstName)
...
The StudentDelegate class would be something like this
class StudentDelegate {
Student delegate
StudentDelegate(Student delegate) {
this.delegate = delegate
}
String getLastFirstName() {
return delegate.getLastName + ", " + delegate.getFirstName());
}
}
- Reverted usage of JOOR and brought back Javassist since JOOR cannot handle fat-jar that spring boot generates.
- Temporary fix for compiler classpath issue with JOOR.
- Switch to JOOR for runtime compilation (better support for Java 9+)
- Fixed bug in array allocation for list columns.
- Added delegate concept for computed columns.
- Bug fix in bootstrap - incorrectly caching instance instead of class.
- Bug fix in OrcWriter.withOptions() method.
- Initial release