Skip to content
Permalink
Browse files
More on xloader
  • Loading branch information
afs committed Dec 9, 2021
1 parent 547d454 commit e52d3c0574242cb2d75539f8401886879f95675e
Showing 3 changed files with 70 additions and 37 deletions.
@@ -13,7 +13,7 @@ title: TDB Command-line Utilities
- [TDB Commands](#tdb-commands)
- [Store description](#store-description)
- [tdbloader](#tdbloader)
- [tdbloader2](#tdbloader2)
- [TDB xloader](#tdb-xloader)
- [tdbquery](#tdbquery)
- [tdbdump](#tdbdump)
- [tdbstats](#tdbstats)
@@ -98,10 +98,37 @@ are loaded into the dataset according to the name or the default graph.
Bulk loader and index builder. Performs bulk load operations more
efficiently than simply reading RDF into a TDB-back model.

### tdb.xloader

`tdb1.xloader` and `tdb2.xloader` are bulk loaders for very large data for TDB1
and TDB2.

See [TDB xloader](./tdb-xloader.html) for more information. These loaders only
work on Linux and Mac OS/X since it relies on some Unix system utilities.

### `tdbquery`

Invoke a SPARQL query on a store. Use `--time` for timing
information. The store is attached on each run of this command so
timing includes some overhead not present in a running system.

Details about query execution can be obtained -- see notes on the
[TDB Optimizer](optimizer.html#investigating-what-is-going-on).

### `tdbdump`

Dump the store in
[N-Quads](http://www.w3.org/TR/n-quads/)
format.

### `tdbstats`

Produce a statistics for the dataset. See the
[TDB Optimizer description.](optimizer.html#statistics-rule-file).

### `tdbloader2`

Bulk loader and index builder. Faster than `tdbloader` but only works
on Linux and Mac OS/X since it relies on some Unix system utilities.
*This has been replace by [TDB xloader](./tdb-xloader.html).*

This bulk loader can only be used to create a database. It may
overwrite existing data. It requires accepts the `--loc` argument and a
@@ -130,23 +157,3 @@ If you are building a large dataset (i.e. gigabytes of input data) you may
wish to have the [PipeViewer](http://www.ivarch.com/programs/pv.shtml)
tool installed on your system as this will provide extra progress information
during the indexing phase of the build.

### `tdbquery`

Invoke a SPARQL query on a store. Use `--time` for timing
information. The store is attached on each run of this command so
timing includes some overhead not present in a running system.

Details about query execution can be obtained -- see notes on the
[TDB Optimizer](optimizer.html#investigating-what-is-going-on).

### `tdbdump`

Dump the store in
[N-Quads](http://www.w3.org/TR/n-quads/)
format.

### tdbstats

Produce a statistics for the dataset. See the
[TDB Optimizer description.](optimizer.html#statistics-rule-file).
@@ -4,6 +4,7 @@ title: TDB FAQs

## FAQs

- [What are TDB1 and TDB2?](#tdb1-tdb2)
- [Does TDB support Transactions?](#transactions)
- [Can I share a TDB dataset between multiple applications?](#multi-jvm)
- [What is the *Impossibly Large Object* exception?](#impossibly-large-object)
@@ -18,6 +19,15 @@ title: TDB FAQs
- [What is the *Unable to check TDB lock owner, the lock file contents appear to be for a TDB2 database. Please try loading this location as a TDB2 database* error?](#tdb2-lock)
- [My question isn't answered here?](#not-answered)

<a name="tdb1-tdb2></a>
## TDB1 and TDB2

TDB2 is a later generation of database for Jena. It is more robust and can
handle large update transactions.

These are different databases systems - the have different on-disk file formats
and databases for one are not compatible with other database engine.

<a name="transactions"></a>
## Does TDB support transactions?

@@ -37,11 +47,11 @@ transactionally.
## Can I share a TDB dataset between multiple applications?

Multiple applications, running in multiple JVMs, using the same
file databases is **not** supported and has a high risk of data corruption. Once corrupted a database cannot be repaired
file databases is **not** supported and has a high risk of data corruption. Once corrupted, a database cannot be repaired
and must be rebuilt from the original source data. Therefore there **must** be a single JVM
controlling the database directory and files.

From 1.1.0 onwards TDB includes automatic prevention of multi-JVM usage which prevents this under most circumstances and helps
TDB includes automatic prevention of multi-JVM usage which prevents this under most circumstances and helps
protect your data from corruption.

If you wish to share a TDB dataset between applications use our [Fuseki](../fuseki2/) component which provides a
@@ -77,11 +87,22 @@ As noted above to resolve this problem you **must** rebuild your database from t
be repaired. This is why we **strongly** recommend you use [transactions](tdb_transactions.html) since this protects your dataset against
corruption.

## What is `tdb.xloader`?

`tdb1.xloader` and `tdb2.xloader` are bulk loaders for very large datasets that
take several hours to load.

See [TDB xloader](./tdb-xloader.html) for more information.

<a name="tdbloader-vs-tdbloader2"></a>
## What is the different between `tdbloader` and `tdbloader2`?

`tdbloader2` has been replaced by `tdb1.xloader` and `tdb2.xloader` for TDB1 and TDB2 respectively.


`tdbloader` and `tdbloader2` differ in how they build databases.


`tdbloader` is Java based and uses the same TDB APIs that you would use in your own Java code to perform the data load. The advantage of this is that
it supports incremental loading of data into a TDB database. The downside is that the loader will be slower for initial database builds.

@@ -7,16 +7,19 @@ is stability and reliability for long running loading, running on modest and

xloader is not a replacement for regular TDB1 and TDB2 loaders.

"tdb1.xloader" was called "tdbloader2" and has some improvements.
There are two scripts to load data using the xloader subsystem.

"tdb1.xloader", which was called "tdbloader2" and has some improvements.

It is not as fast as other TDB loaders on dataset where the general loaders work
on without encountering progressive slowdown.

The xloaders for TDB1 and TDB2 are not identical. The TDB2 is more capable; it
is based on the same design approach with further refinements to building the
node table and to reduce the total amount of temporary file space used.
The xloaders for TDB1 and TDB2 are not identical. The TDB2 xloader is more
capable; it is based on the same design approach with further refinements to
building the node table and to reduce the total amount of temporary file space
used.

The xloader does not run on MS Windows. It uses and external sort program from
The xloader does not run on MS Windows. It uses an external sort program from
unix - `sort(1)`.

The xloader only builds a fresh database from empty.
@@ -30,22 +33,24 @@ or

`tdb1.xloader --loc DIRECTORY` FILE...

Additioally, there is an argument `--tmpdir` to use a different directory for
Additionally, there is an argument `--tmpdir` to use a different directory for
temporary files.

`FILE` is any RDF syntax supported by Jena.
`FILE` is any RDF syntax supported by Jena. Syntax is detemined by file
extension and can include an addtional ".gz" or ".bz2" for compresses files.

### Advice

`xloader` uses a lot of temporary disk space.

To avoid a load failing due to a syntax or other data error, it is advisable to
run `riot --check` on the data first. Parsing is faster than loading.

If desired, the data can be converted to [RDF Thrift](../io/rdf-binary.html) at
this stage by adding `--stream rdf-thrift` to the riot checking run.
Parsing RDF Thrift is faster than parsing N-Triples although the bulk of the loading process is not limited by parser speed.
The TDB databases will take up a lot of disk space and in addition during
loading `xloader` uses a significant amout of temporary disk space.

If desired, the data can be converted to [RDF Thrift](../io/rdf-binary.html) at
this stage by adding `--stream rdf-thrift` to the riot checking run. Parsing
RDF Thrift is faster than parsing N-Triples although the bulk of the loading
process is not limited by parser speed.

Do not capture the bulk loader output in a file on the same disk as the database
or temporary directory; it slows loading down.

0 comments on commit e52d3c0

Please sign in to comment.