Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/build-docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash
# Builds the Antora docs site with the current git commit stamped on the home page.
# Usage: ./build-docs.sh
# Output: target/site/
#
# To publish to the tika-site SVN repo:
# ./build-docs.sh --publish /path/to/tika-site/publish

set -euo pipefail
cd "$(dirname "$0")"

COMMIT=$(git rev-parse --short HEAD)
DATE=$(date -u +%Y-%m-%d)

# Inject commit into playbook, build, restore
sed -i "/tika-stable-version/a\\ git-commit: '${COMMIT} (${DATE})'" antora-playbook.yml
trap 'git checkout antora-playbook.yml' EXIT

# Pass remaining args to Maven (filter out our --publish flag)
PUBLISH_DIR=""
MVN_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--publish)
PUBLISH_DIR="$2"
shift 2
;;
*)
MVN_ARGS+=("$1")
shift
;;
esac
done

../mvnw antora:antora "${MVN_ARGS[@]}"

echo "Site built at: target/site/"
echo "Commit: ${COMMIT} (${DATE})"

if [[ -n "${PUBLISH_DIR}" ]]; then
# Flatten: skip the 'tika/' component directory so URLs are /docs/4.0.0-SNAPSHOT/
# Copy UI assets one level above docs/ since HTML uses ../../_/ relative paths
DOCS_DIR="${PUBLISH_DIR}/docs"
mkdir -p "${DOCS_DIR}"
cp -r target/site/tika/* "${DOCS_DIR}/"
cp -r target/site/_/ "${PUBLISH_DIR}/_/"
# Fix the root redirect to match flattened layout
sed 's|tika/||g' target/site/index.html > "${DOCS_DIR}/index.html"
sed 's|/docs/tika/|/docs/|g' target/site/sitemap.xml > "${DOCS_DIR}/sitemap.xml"
cp target/site/404.html "${DOCS_DIR}/"
cp target/site/search-index.js "${DOCS_DIR}/"
echo "Published to: ${DOCS_DIR}/"
fi
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,8 @@ Training performs two passes:
2. **Calibration pass** — re-scores training sentences to compute per-language
μ and σ (Welford's online algorithm), stored for z-score computation at runtime.

The corpus can be in Wikipedia dump format (`corpusDir/{code}/sentences.txt`)
or flat format (`corpusDir/{code}` with one sentence per line).
The corpus can be in Wikipedia dump format (`corpusDir/\{code}/sentences.txt`)
or flat format (`corpusDir/\{code}` with one sentence per line).
Use `--max-per-lang N` (default 500,000) to cap sentences per language.

== Evaluation Tools
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ curl -s -X PUT -H "Accept: application/json" -T testPDF.pdf http://localhost:999

*Expected:* JSON object with metadata only (no content).

=== Test 8: PUT /meta/{field}
=== Test 8: PUT /meta/\{field}

[source,bash]
----
Expand Down Expand Up @@ -379,7 +379,7 @@ The following endpoints were tested and verified working:
|`/tika/xml` |PUT |PASS
|`/tika/json` |PUT |PASS
|`/meta` |PUT |PASS
|`/meta/{field}` |PUT |PASS
|`/meta/\{field}` |PUT |PASS
|`/rmeta` |PUT |PASS
|`/rmeta/text` |PUT |PASS
|`/language/stream` |PUT |PASS
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ new file, and remove the old binary.

Results on the https://github.com/facebookresearch/flores[FLORES-200] dev set
(204 test languages, 997 sentences each). All scores are macro-averaged F1.
Raw eval output: xref:advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].
Raw eval output: link:{attachmentsdir}/advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].

==== Coverage-adjusted accuracy (each detector on its own supported languages)

Expand Down
5 changes: 2 additions & 3 deletions docs/modules/ROOT/pages/advanced/language-detection.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ numeric check, keeping the language detection hot path fast.

The language detector draws on several well-established techniques.

[bibliography]
[bibliography%unordered]
- [[[cavnar1994]]] W. B. Cavnar and J. M. Trenkle,
"N-Gram-Based Text Categorization,"
in _Proceedings of the Third Annual Symposium on Document Analysis and
Expand Down Expand Up @@ -344,8 +344,7 @@ The language detector draws on several well-established techniques.
Current models (v7+) use Wikipedia dumps as the primary corpus. +
https://aclanthology.org/L12-1154/

- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,
H. Adam, and D. Kalenichenko,
- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko,
"Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference,"
in _Proceedings of the IEEE Conference on Computer Vision and Pattern
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/developers/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ with custom parsers, detectors, and other components.

== Topics

* xref:serialization.adoc[Serialization and Configuration] - JSON configuration,
* xref:developers/serialization.adoc[Serialization and Configuration] - JSON configuration,
@TikaComponent annotation, and creating custom components

== Coming Soon
Expand Down
4 changes: 4 additions & 0 deletions docs/modules/ROOT/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,7 @@ xref:using-tika/index.adoc[Using Tika] to choose your integration method.

Apache Tika is an Apache Software Foundation project, formerly a subproject of Apache Lucene.

ifdef::git-commit[]
[.small]#Built from commit: `{git-commit}`#
endif::[]

52 changes: 42 additions & 10 deletions docs/modules/ROOT/pages/maintainers/site.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,18 @@ mvn antora:antora

The generated site will be at `docs/target/site/`.

To stamp the build with the current commit hash (shown on the home page),
add `git-commit` to the attributes in `antora-playbook.yml`:

[source,yaml]
----
asciidoc:
attributes:
git-commit: 'abc1234'
----

Or pass it on the command line when you have a playbook that supports CLI attributes.

=== Previewing the Site

**Option 1: Python HTTP server (recommended)**
Expand Down Expand Up @@ -99,7 +111,29 @@ Documentation versions are managed through Git branches with the `docs/` prefix.

The playbook (`antora-playbook.yml`) is configured to build all `docs/*` branches automatically.

=== Publishing a New Release
=== Publishing to the Site

Use `build-docs.sh` with the `--publish` flag to build and copy to the site SVN checkout:

[source,bash]
----
cd docs
./build-docs.sh --publish /path/to/tika-site/publish

# Then in the SVN checkout:
cd /path/to/tika-site
svn add publish/docs publish/_ --force
svn commit -m "Publish 4.0.0-SNAPSHOT docs"
----

This builds the Antora site, stamps the git commit on the home page, and copies
the output to the site with the correct directory layout:

* `publish/docs/4.0.0-SNAPSHOT/` -- the documentation pages
* `publish/_/` -- CSS, JS, fonts (shared across versions)
* `publish/docs/index.html` -- redirect to latest version

=== Publishing a Release

When releasing a new version (e.g., 4.0.0):

Expand All @@ -116,14 +150,13 @@ sed -i "s/4.0.0-SNAPSHOT/4.0.0/" docs/antora.yml
git commit -am "Set docs version to 4.0.0"
git push origin docs/4.0.0

# 4. Build the site
# 4. Build and publish
cd docs
mvn antora:antora
./build-docs.sh --publish /path/to/tika-site/publish

# 5. Publish to SVN
cp -r target/site/* ~/tika-site/4.x/
cd ~/tika-site
svn add 4.x --force
# 5. Commit to SVN
cd /path/to/tika-site
svn add publish/docs publish/_ --force
svn commit -m "Publish 4.0.0 docs"
----

Expand All @@ -145,9 +178,8 @@ git push origin docs/4.0.0

# 4. Rebuild and republish
cd docs
mvn antora:antora
cp -r target/site/* ~/tika-site/4.x/
cd ~/tika-site
./build-docs.sh --publish /path/to/tika-site/publish
cd /path/to/tika-site
svn commit -m "Update 4.0.0 docs"
----

Expand Down
72 changes: 67 additions & 5 deletions docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -120,10 +120,16 @@ WARNING: The configuration options for `PDFParser` and `TesseractOCRParser` have
significantly in 4.x. The automatic converter will migrate your parameter names, but you
should review the updated documentation to ensure your configuration is optimal.

See:
See the xref:configuration/index.adoc[Configuration] section for full details, including:

* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration] - Updated options for PDF parsing
* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser Configuration] - Updated OCR options
* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration]
* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser Configuration]
* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process) Configuration]
* xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini, OpenAI)]
* xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg, exiftool, etc.)]

For the general serialization model and how JSON configuration works, see
xref:developers/serialization.adoc[Serialization and Configuration].

=== Full Configuration Example

Expand All @@ -147,8 +153,64 @@ a full table of changes and code migration examples.

== API Changes

// TODO: Document API changes
=== TikaConfig replaced by TikaLoader

`TikaConfig` has been removed. Use `TikaLoader` from `tika-serialization` instead.

**3.x:**
[source,java]
----
TikaConfig config = new TikaConfig(getClass().getClassLoader());
Parser parser = config.getParser();
Detector detector = config.getDetector();
AutoDetectParser autoDetect = new AutoDetectParser(config);
----

**4.x:**
[source,java]
----
// Default configuration (SPI-discovered components)
TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader());

// Or from a JSON config file
TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));

// Access components
Parser parser = loader.loadParsers();
Detector detector = loader.loadDetectors();
Parser autoDetect = loader.loadAutoDetectParser();
ParseContext context = loader.loadParseContext();
----

NOTE: `TikaLoader` is in the `tika-serialization` module. Add `tika-serialization`
as a dependency if you were previously only depending on `tika-core`.
See xref:developers/serialization.adoc[Serialization and Configuration] for
the full `TikaLoader` API.

For simple use cases, the `Tika` facade and `DefaultParser` still work without
`TikaLoader`:

[source,java]
----
// Simple facade (unchanged from 3.x)
Tika tika = new Tika();
String text = tika.parseToString(file);

// Direct parser use (unchanged from 3.x)
Parser parser = new DefaultParser();
----

=== ExternalParser

The legacy `ExternalParser` and `CompositeExternalParser` have been removed.
External parsers must now be explicitly configured via JSON. See
xref:configuration/parsers/external-parser.adoc[External Parser Configuration]
for details.

== Deprecations and Removals

// TODO: Document deprecated and removed features
* `TikaConfig` -- replaced by `TikaLoader`
* `CompositeExternalParser` -- external parsers now require explicit JSON configuration
* `ExternalParsersFactory` and XML-based external parser auto-discovery
* DOM-based OOXML extractors (`XWPFWordExtractorDecorator`, `XSLFPowerPointExtractorDecorator`)
-- SAX-based extractors are now the only implementation
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/pipes/unpack-config.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ This limits extraction to 100MB total.

== Key Base Strategies

`DEFAULT`:: Output key is `{containerKey}-{embeddedIdPrefix}{id}{suffix}`
`DEFAULT`:: Output key is `\{containerKey}-\{embeddedIdPrefix}\{id}\{suffix}`
`CUSTOM`:: Output key uses `emitKeyBase` as the prefix.

== Safety Limits
Expand Down