Skip to content

Bulk_Data_Load

Brad Bebee edited this page Feb 13, 2020 · 1 revision

DataLoader utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.

Command line:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*
parameter definition
-quiet Suppress all stdout messages.
-verbose Show additional messages detailing the load performance.
-closure Compute the RDF(S)+ closure.
-namespace The namespace of the KB instance.
propertyFile The configuration file for the database instance.
fileOrDir Zero or more files or directories containing the data to be loaded.

Examples:

1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader /opt/data/upload/journal.properties /opt/data/upload/

2. Load an archive /opt/data/data.nt.gz using /opt/data/upload/journal.properties properties file into a specified namespace:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader -namespace someNameSpace /opt/data/upload/journal.properties /opt/data/data.nt.gz

If you are loading data with an enabled inferencing, then a temporary file will be created to compute the delta in entailments. The temporary file could grow extremely in case of loading a large data set. It may cause "no space left on device" error and, as a consequence, the data loading process will be interrupted. To avoid such a situation, it is strongly recommended to specify the DataLoader.Options.CLOSURE property as ClosureEnum.None in the properties file:

com.bigdata.rdf.store.DataLoader.closure=None

You may need to specify Java heap size to match data size. In most cases 6G will be enough (add java parameter: -Xmx6g). Also beware of setting more than 8G heap due to garbage collector pressure.

Then load the data using the DataLoader and pass it the -closure option:

java -Xmx6g -cp *:*.jar com.bigdata.rdf.store.DataLoader -closure /opt/data/upload/journal.properties /opt/data/upload/

The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk".

Clone this wiki locally