Salesforce2hadoop allows you to import data from Salesforce and put it in HDFS (or your local filesystem), serialised as Avro. Despite its boring name, it's a powerful tool that helps you get all relevant data of your business in one place. It only needs access to the Salesforce API using a username/password combination, and the Enterprise WSDL of your Salesforce Organisation.
- Choose the type(s) of records you want to import
- Data types are preserved by looking at the Enterprise WSDL of your Salesforce Organisation
- Data is stored in Avro format, providing great compatibility with a number of tools
- Do a complete import of your data, or incrementally import only the records that have been changed since your last import
- Salesforce2hadoop keeps track for you of the last time each record type was imported.
- Stores data into any filesystem that Hadoop/KiteSDK supports. Can be HDFS but also a local filesystem.
- Built with the help of KiteSDK, Salesforce WSC and our own wsdl2avro library.
- Built for the JVM, so works on any system that has Java 7 or greater installed
More background information: http://dandydev.net/blog/getting-salesforce-data-into-hadoop
You can find compiled binaries here on Github. Just download, unpack and you're good to go.
$ git clone https://github.com/datadudes/salesforce2hadoop.git $ sbt assembly
The resulting jar file can be found in the
sf2hadoop is a command line application that is really simple to use. If you have Java 7 or higher installed, you
can just use
java -jar sf2hadoop.jar to run the application. To see what options are available, run:
$ java -jar sf2hadoop.jar --help
In order for salesforce2hadoop to understand the structure of your Salesforce data, it has to read the Enterprise WSDL of your Salesforce organisation. You can find out here how to generate and download it for your organisation.
sf2hadoop has 2 commands for importing data from Salesforce: init and update.
Initial data import
Use init to do an initial data import from Salesforce. For each record type a dataset will be created in HDFS (or your local filesystem) and a full import of the desired record types will be done. This can take some time. You can do an inital import like this:
$ java -jar sf2hadoop.jar init -u <salesforce-username> -p <salesforce-password> -b /base/path -w /path/to/enterprise.wsdl -s /path/to/state-file recordtype1 recordtype2 ...
As you can see, it needs your Salesforce username & password to login to the Salesforce API to fetch your data.
Data import directory
It also needs a basePath where it will store all the imported data. This must be in the form of a URI that Hadoop can understand. Currently salesforce2hadoop supports storing data in either HDFS or your local filesystem. URIs for these options have the following format:
hdfs://hostname-of-namenode:port/path/to/dirPort can be left out. Example URI when running
sf2hadoopon the namenode of your Hadoop cluster:
hdfs://localhost/imports/salesforce, this will store all data on HDFS in the
- Local filesystem:
file:///path/to/dirThis will store all data on your local filesystem in the
Imported data will be stored under the provided base path. Data for each record type will be stored under its own directory, which is the record type in lowercase.
sf2hadoop with the types of records you want to import from Salesforce. Specify the types by their Salesforce API names. Examples:
Opportunity, etc. Custom record types usually end with
__c, for example:
Salesforce2hadoop will keep track of what record types have been imported at what point in time. This allows you to switch to incremental imports after the initial import. To make this possible, it will save the name of each record type that is imported, together with an import date. When doing an incremental update of the data (see below), it will read back the import states for each record type, and only request data from Salesforce that has been created/updated since that moment. After the incremental import, it will update the import date for each record type that was imported.
The statefile can be stored either in HDFS or on your local filesystem. As with the basePath, you have to specify the path to the statefile as a URI, following the guidelines above. The statefile will be created automatically, including all non-existing parent directories.
It is advised to store the statefile on HDFS, so you can run
sf2hadoop from any machine, without having to worry if
the proper statefile is present.
Use the update command to do an incremental update of record types that have previously been imported:
$ java -jar sf2hadoop.jar update -u <salesforce-username> -p <salesforce-password> -b /base/path -w /path/to/enterprise.wsdl -s /path/to/state-file recordtype1 recordtype2 ...
The parameters are the same as with the init command.
Record types that have not been previously initialized (no initial import has been done), will be skipped, and the output will tell you so.
Due to the limitations of the Salesforce API,
sf2hadoop can only go back 30 days in the past when doing an incremental
update. Trying an incremental import over a longer timespan, might result in errors or incomplete data, so it is advised
to do an update at least monthly.
Here are some recipies to make the most out of salesforce2hadoop.
Import data into Hive/Impala
Once you've imported Salesforce data into HDFS using
sf2hadoop, you can then create a Hive table backed by the
imported data (in Avro format) by running the following command in the Hive shell:
CREATE EXTERNAL TABLE <tablename> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/base/path/recordtype_in_lowercase' TBLPROPERTIES ('avro.schema.url'='hdfs:///base/path/recordtype_in_lowercase/.metadata/schema.avsc');
This table will also be available in Impala (you might have to do a
INVALIDATE METADATA for the table to show up).
The reason to create it in Hive instead of in Impala directly, is that Hive can infer the table's schema from the Avro
schema for you.
Update Impala table after incremental import
Each time you do an incremental import of data for which you have created a Hive/Impala table, you have to tell Impala
that new data is available by running the following command in the Impala shell:
Some random TODOs:
- Allow creating Hive tables directly by using the facilities provided by KiteSDK
- Support other Hadoop-compliant filesystems, like S3
You're more than welcome to create issues for any bugs you find and ideas you have. Contributions in the form of pull requests are also very much appreciated!
Salesforce2hadoop was created with passion by: