Skip to content

bitsgalore/tikadetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

About

Simple demo script to demonstrate how the Apache Tika API can be called from Python for doing mime type detection. Access to the Java API is done using PyJnius.

Adapted from:

http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html

Note that this not intended as a production-ready tool! My main reason for writing this was to get more familiar with PyJnius and the Tika API. So far I've only managed to test this using Python 2.7 running under Linux Mint. Other platforms may work ... or maybe not!

##Installation of PyJnius and its dependencies This script uses PyJnius for accessing the Tika Java classes. I found several guides on the installation of PyJnius and its dependencies, and none of them quite worked for me (python-dev in particular isn't explicitly mentioned anywhere). After some experimentation the following did the trick for me under Linux Mint (haven tried under Windows yet):

###Step 1: install Cython

sudo apt-get install cython

###Step 2: install python-dev

sudo apt-get install python-dev

###Step 3: clone & install pyjnius

git clone https://github.com/kivy/pyjnius.git
cd pyjnius
sudo python setup.py install

###Step 4: download & install Apache Tika Download the latest runnable jar from:

https://tika.apache.org/download.html

Then save it wherever you prefer.

Done!

##Configuration Open config.py in a text editor and update tikaJar to the location of the Tika JAR on your system (see above).

##Command line use

###Usage

python tikadetect.py [-h] [--magiconly] directory

This will result in a recursive scan of directory and all its subdirectories. Output is written to stdout, using the following format dfor each analysed file:

/path/to/file.ext: mimetype

###Positional arguments

directory: directory that will be analysed

###Optional arguments

-h, --help: show help message and exit --magiconly: establish mimetype from magic bytes only (ignoring filename extension)

Note that by default mimetype detection is done using a combination of magic bytes and filename extensions (the latter can be disabled using the --magiconly switch).

##Documentation of Tika methods See this link (describes Tika 1.5), and have a look at the detect methods (which are called in the script):

https://tika.apache.org/1.5/api/org/apache/tika/Tika.html

About

Mime type detection using Apache Tika

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages