Bulk CSV loading through map reduce

Phoenix v 2.1 provides support for loading CSV data into a new/existing Phoenix table using Hadoop Map-Reduce. This provides a means of bulk loading CSV data in parallel through map-reduce, yielding better performance in comparison with the existing psql csv loader.

####Sample input CSV data:

12345, John, Doe
67890, Mary, Poppins

####Compatible Phoenix schema to hold the above CSV data:

 CREATE TABLE ns.example (
    my_pk bigint not null,
    m.first_name varchar(50),
    m.last_name varchar(50) 
    CONSTRAINT pk PRIMARY KEY (my_pk))

Row Key	Column Family (m)
my_pk BIGINT	first_name VARCHAR(50)	last_name VARCHAR(50)
12345	John	Doe
67890	Mary	Poppins

####How to run?

1- Please make sure that Hadoop cluster is working correctly and you are able to run any job like this.

2- Copy latest phoenix-[version].jar to hadoop/lib folder on each node or add it to Hadoop classpath.

3- Run the bulk loader job using the script /bin/csv-bulk-loader.sh as below:

./csv-bulk-loader.sh <option value>

<option>  <value>
-i        CSV data file path in hdfs (mandatory)
-s        Phoenix schema name (mandatory if not default)
-t        Phoenix table name (mandatory)
-sql      Phoenix create table sql file path (mandatory)
-zk       Zookeeper IP:<port> (mandatory)
-mr       MapReduce Job Tracker IP:<port> (mandatory)
-hd       HDFS NameNode IP:<port> (mandatory)
-o        Output directory path in hdfs (optional)
-idx      Phoenix index table name (optional, not yet supported)
-error    Ignore error while reading rows from CSV ? (1-YES | 0-NO, default-1) (optional)
-help     Print all options (optional)

Example

./csv-bulk-loader.sh -i hdfs://server:9000/mydir/data.csv -s ns -t example -sql ~/Documents/createTable.sql -zk server:2181 -hd hdfs://server:9000 -mr server:9001

This would create the phoenix table "ns.example" as specified in createTable.sql and will then load the CSV data from the file "data.csv" located in HDFS into the table.

Notes

You must provide an explicit column family name in your CREATE TABLE statement for your non primary key columns, as the default column family used by Phoenix is treated specially by HBase because it starts with an underscore.
The current bulk loader does not support the migration of index related data yet. So, if you have created your phoenix table with index, please use the psql CSV loader.
In case you want to further optimize the map-reduce performance, please refer to the current map-reduce optimization params in the file "src/main/config/csv-bulk-load-config.properties". In case you modify this list, please re-build the phoenix jar and re-run the job as described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk CSV loading through map reduce

Notes

Clone this wiki locally