# Extracting table contents with Apache Tika

#### [Apache Tika](https://tika.apache.org/) is a Java-based content analysis toolkit.

> The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Tika is an executable application executed as a Java jar file.

We can call Tika from the command-line directly via [`subprocess`](https://docs.python.org/3/whatsnew/3.13.html#subprocess) in Python, capturing and parsing the output from `stdout` and `stderr`.

The arguments to Tika we show here are:
* `m`: print file metadata
* `t`: print contents of file as plaintext
* `J`: print both metadata and contents of the file

Regular expressions can be used to parse / post-process the output from the Tika executable.

In [1]:
!ls -la samples

total 996
drwxr-xr-x 2 so_olliphant so_olliphant   4096 Apr  1 01:24 .
drwxr-xr-x 6 so_olliphant so_olliphant   4096 Apr  2 05:50 ..
-rw-r--r-- 1 so_olliphant so_olliphant 186261 Mar 29 01:59 Press_release_car_registrations_February_2025.pdf
-rw-r--r-- 1 so_olliphant so_olliphant 591185 Apr  1 01:24 Press_release_car_registrations_February_2025.pdf.png
-rw-r--r-- 1 so_olliphant so_olliphant  38429 Feb 13 06:53 saintmarc-hd_20250213.pdf
-rw-r--r-- 1 so_olliphant so_olliphant  78716 Apr  1 00:36 saintmarc-hd_20250213.pdf.png
-rw-r--r-- 1 so_olliphant so_olliphant  52238 Mar 29 02:00 saintmarc-hd_20250313.pdf
-rw-r--r-- 1 so_olliphant so_olliphant  53209 Apr  1 00:45 saintmarc-hd_20250313.pdf.png


----

In [2]:
import subprocess

### Saint-marc HD PDF for 2025-Jan 月次売上情報

![Saint-marc HD PDF for 2025-Jan 月次売上情報](samples/saintmarc-hd_20250213.pdf.png "Saint-marc HD PDF for 2025-Jan 月次売上情報")

* Text in this PDF is 512 chars.
* Note that Tika only extracts the PDF text; all document formatting is generally lost.

In [3]:
# how about the metadata/text of the PDF from 2025-Feb?
args = [
    'java',
    '-jar',
    'jars/tika-app-3.1.0.jar',
    '-m',
    'samples/saintmarc-hd_20250213.pdf'
]
result = subprocess.run(args, capture_output=True, text=True)

print("Saint-marc HD PDF for 2025-Feb\n")
print(result.stdout)
#print(result.stderr)

Saint-marc HD PDF for 2025-Feb

Content-Length: 38429
Content-Type: application/pdf
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By: org.apache.tika.parser.pdf.PDFParser
X-TIKA:versionCount: 0
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_faithful: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:creator: ãâ ä
dc:format: application/pdf; version=1.7
dc:title:!ò
ÅHP(	.xlsx
dcterms:created: 2025-02-01T01:02:28Z
dcterms:modified: 2025-02-01T01:02:28Z
pdf:PDFVersion: 1.7
pdf:charsPerPage: 512
pdf:docinfo:created: 2025-02-01T01:02:28Z
pdf:docinfo:creator: ãâ ä
pdf:docinfo:modified: 2025-02-01T01:02:28Z
pdf:docinfo:producer: Microsoft: Print To PDF
pdf:docinfo:title:!ò
ÅHP(	.xlsx
pdf:encrypted: false
pdf:eofOffsets: 38429
pdf:hasCollection: false
pdf

### Saint-marc HD PDF for 2025-Feb 月次売上情報

![Saint-marc HD PDF for 2025-Feb 月次売上情報](samples/saintmarc-hd_20250313.pdf.png "Saint-marc HD PDF for 2025-Feb 月次売上情報")

* <span style="background-color:#aaffff;">_Text in this PDF is 0 chars!_</span>
* This PDF is actually created from an image file, so Tika is not able to return any text.

In [4]:
# OK, now how about the metadata/text of the PDF from 2025-Mar?
args = [
    'java',
    '-jar',
    'jars/tika-app-3.1.0.jar',
    '-m',
    'samples/saintmarc-hd_20250313.pdf'
]
result = subprocess.run(args, capture_output=True, text=True)

print("Saint-marc HD PDF for 2025-Mar\n")
print(result.stdout)
#print(result.stderr)

Saint-marc HD PDF for 2025-Mar

Content-Length: 52238
Content-Type: application/pdf
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By: org.apache.tika.parser.pdf.PDFParser
X-TIKA:versionCount: 0
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_faithful: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.7
pdf:PDFVersion: 1.7
pdf:charsPerPage: 0
pdf:encrypted: false
pdf:eofOffsets: 52238
pdf:hasCollection: false
pdf:hasMarkedContent: false
pdf:hasXFA: false
pdf:hasXMP: false
pdf:incrementalUpdateCount: 0
pdf:unmappedUnicodeCharsPerPage: 0
resourceName: saintmarc-hd_20250313.pdf
xmpTPg:NPages: 1



----

### ACEA Press Release, 2025-Feb

![ACEA Press Release, 2025-Feb](samples/Press_release_car_registrations_February_2025.pdf.png "ACEA Press Release, 2025-Feb")

* In the table on page 3, in the row for the Romania data, notice that <span style="background-color:#aaffff;">_the 3 blank values for Plug-in Hybrid are completely and absolutely missing from the Tika-parsed content!_</span>
* We output both the metadata and contents as plain text, splitting on `\n\n\n\n` page delimiters.

In [5]:
# Multi-page PDF from ACEA, target table on page 3
args = [
    'java',
    '-jar',
    'jars/tika-app-3.1.0.jar',
    '-t',
    '-J',
    'samples/Press_release_car_registrations_February_2025.pdf'
]
result = subprocess.run(args, capture_output=True, text=True)

#print("ACEA PDF for 2025-Feb\n")
#print(result.stdout)
#print(result.stderr)

In [6]:
import json

o = json.loads(result.stdout)

content = o[0]['X-TIKA:content']

lines = content.strip().split('\n\n\n\n')
print(lines[2])

 

www.acea.auto         Page 3 of 6 
 

 

NEW CAR REGISTRATIONS BY MARKET AND POWER SOURCE  

MONTHLY2 

 
 

 
 
1 Includes full and mild hybrids 
2 Includes fuel-cell electric vehicles, natural gas vehicles, LPG, E85/ethanol, and other fuels 

February February % change February February % change February February % change February February % change February February % change February February % change February February % change

2025 2024 25/24 2025 2024 25/24 2025 2024 25/24 2025 2024 25/24 2025 2024 25/24 2025 2024 25/24 2025 2024 25/24

Austria 4,233 3,322 +27.4 1,613 1,335 +20.8 5,549 4,691 +18.3 0 0 5,736 6,527 -12.1 2,488 4,135 -39.8 19,619 20,010 -2.0

Belgium 13,040 9,385 +38.9 3,070 8,385 -63.4 5,383 4,282 +25.7 267 415 -35.7 17,280 18,918 -8.7 1,121 2,337 -52.0 40,161 43,722 -8.1

Bulgaria 126 122 +3.3 34 31 +9.7 105 73 +43.8 0 0 2,781 2,868 -3.0 348 510 -31.8 3,394 3,604 -5.8

Croatia 53 50 +6.0 140 94 +48.9 1,629 1,455 +12.0 101 110 -8.2 1,644 1,898 -13.4 678 923 -26.5

----

## Conclusion

* Apache Tika is mostly useful, but requires clever post-processing (regular expressions, etc.).
* However, as with the case with the Saint-marc HD PDF for 2025-Mar and the case for ACEA 2025-Feb, there are still many cases where Apache Tika is not enough.