Mirror of Apache Tika
Java HTML Python Matlab Groovy JavaScript
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
tika-app TIKA-2687 -- remove code used to generate test files Jul 13, 2018
tika-batch TIKA-2692 -- minimal upgrades to allow building w Java 11-ea Jul 26, 2018
tika-bundle TIKA-2710 - Change Tika OSGi Execution Environment to 1.8 Aug 18, 2018
tika-core improve xml reading Aug 3, 2018
tika-deployment Update snapcraft.yaml Aug 16, 2017
tika-dl TIKA-2672 -- remove hard coded input dimensions Aug 14, 2018
tika-dotnet Prep pom.xmls for release - remove all SCM tags except for tika-paren… Jan 25, 2016
tika-eval TIKA-2695 -- upgrade Lucene to something more modern Aug 9, 2018
tika-example TIKA-2695 -- upgrade Lucene to something more modern Aug 9, 2018
tika-java7 TIKA-2535 -- fix test in tika-java7 to handle new sis StoreTypeDetector Jan 23, 2018
tika-langdetect TIKA-2660 -- enable building with Java 10 Jun 14, 2018
tika-nlp TIKA-2692 -- minimal upgrades to allow building w Java 11-ea Jul 26, 2018
tika-parent TIKA-2707 -- upgrade to commons-compress 1.18 Aug 16, 2018
tika-parsers TIKA-2667 upgrade jmatio Aug 14, 2018
tika-serialization TIKA-2662 add a streaming writer for the RecursiveParserWrapper Jun 7, 2018
tika-server TIKA-2692 -- minimal upgrades to pass ossindex-maven module -- except… Jul 26, 2018
tika-translate TIKA-2634 upgrade jackson to 2.9.5 Apr 19, 2018
tika-xmp TIKA-1974 -- remove deprecated metadata properties/keys for Tika 2.0 Jan 26, 2018
.gitattributes TIKA-431: Tika currently misuses the HTTP Content-Encoding header, an… Jul 8, 2012
.gitignore Ignore vim temp files Mar 13, 2018
CHANGES.txt Add info about TIKA-2683 fix Jul 18, 2018
HEADER.txt Add svn:eol-style Oct 2, 2009
KEYS update key with signatures May 30, 2017
LICENSE.txt TIKA-2341 -- upgrade commons-compress to 1.14, added capabilities for… Jun 1, 2017
NOTICE.txt added reference to PRONOM / TNA and the Open Government License to NO… Sep 16, 2017
README.md Note on Java 7, and suggest new users just download the binaries Jan 23, 2018
assembly.xml not sure why pom.xml.releaseBackup files are now included after last … Jul 8, 2017
pom.xml TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for rel… Mar 7, 2018

README.md

Welcome to Apache Tika http://tika.apache.org/

Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika is a project of the Apache Software Foundation.

Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation.

Getting Started

Pre-build binaries of Apache Tika standalone applications are available from http://tika.apache.org/download.html . Pre-build binaries of all the Tika jars can be fetched from Maven Central or your favourite Maven mirror.

Tika is based on Java 7 and uses the Maven 3 build system. To build Tika from source, use the following command in this directory:

mvn clean install

The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:

java -jar tika-app/target/tika-app-*.jar --help

Contributing via Github

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/browse/TIKA
- you will get issue id TIKA-xxx where xxx is the issue ID.
2. git clone http://github.com/apache/tika.git 
3. cd tika
4. git checkout -b TIKA-xxx
5. edit files
6. git status (make sure it shows what files you expected to edit)
7. git add <files>
8. git commit -m “fix for TIKA-xxx contributed by <your username>”
9. git fork
10. git push -u <your git username> TIKA-xxx
11. git pull-request

License (see also LICENSE.txt)

Collective work: Copyright 2011 The Apache Software Foundation.

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file.

Export control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

Mailing Lists

Discussion about Tika takes place on the following mailing lists:

Notification on all code changes are sent to the following mailing list:

The mailing lists are open to anyone and publicly archived.

You can subscribe the mailing lists by sending a message to [LIST]-subscribe@tika.apache.org (for example user-subscribe@...). To unsubscribe, send a message to [LIST]-unsubscribe@tika.apache.org. For more instructions, send a message to [LIST]-help@tika.apache.org.

Issue Tracker

If you encounter errors in Tika or want to suggest an improvement or a new feature, please visit the Tika issue tracker. There you can also find the latest information on known issues and recent bug fixes and enhancements.