### Apache-Tika

 - [Docker Apache-Tika with Python](https://medium.com/analytics-vidhya/data-extraction-from-pdf-documents-using-apache-tika-and-python-b56e4bc79245)
 - [python-tika package](https://medium.com/@justinboylantoomey/fast-text-extraction-with-python-and-tika-41ac34b0fe61)
  - [and another](https://cbrownley.wordpress.com/2016/06/26/parsing-pdfs-in-python-with-tika/)
  
 - [configure python-tika](https://stackoverflow.com/a/48153647/4538066) to work with local tika server

In [1]:
import requests
import os

**envs**

In [100]:
# set env vars
tika = "http://localhost:9998/"
path = os.getcwd()
files = os.listdir(os.path.join(path,'../pdf-samples'))

print(files)

['normal.pdf', 'scanned.pdf']


In [101]:
path

'C:\\Github\\training.data_eng\\apache-tika\\notebooks'

In [102]:
test_file = os.path.join(path,files[0])
test_file

'C:\\Github\\training.data_eng\\apache-tika\\notebooks\\normal.pdf'

### tika details

In [103]:
# connect to Tika
print(requests.get(tika+'tika').text)

This is Tika Server (Apache Tika 1.22). Please PUT



In [104]:
# show detectors 
print(requests.get(tika+'detectors', headers={'Accept': 'application/json'}).text)

{"children":[{"composite":false,"name":"org.apache.tika.detect.OverrideDetector"},{"composite":false,"name":"org.apache.tika.parser.microsoft.POIFSContainerDetector"},{"composite":false,"name":"org.apache.tika.parser.pkg.ZipContainerDetector"},{"composite":false,"name":"org.gagravarr.tika.OggDetector"},{"composite":false,"name":"org.apache.tika.mime.MimeTypes"}],"composite":true,"name":"org.apache.tika.detect.DefaultDetector"}


### [using Docker tika](https://cwiki.apache.org/confluence/display/tika/TikaJAXRS#TikaJAXRS-UsingprebuiltDockerimage)

All services that take files use HTTP "PUT" requests. When "PUT" is used, the original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

Additionally, TikaResource, Metadata and RecursiveMetadata Services accept POST multipart/form-data requests, where the original file is sent as a single attachment.

Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

Resources may return following HTTP codes:

    200 Ok - request completed sucessfully
    204 No content - request completed sucessfully, result is empty
    422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
    500 Error - Error while processing document 

In [105]:
dpath = 'C:\Github\training.data_eng\apache-tika\pdf-samples'

In [106]:
# get file meta data
response = requests.put(tika+'meta',
                        os.path.join(dpath,files[1]),
                        headers={'Accept': 'application/json'}
                       )

print(f'{response.status_code} : {response.reason}')
response.json()

200 : OK


{'Content-Encoding': 'ISO-8859-1',
 'Content-Type': 'text/plain; charset=ISO-8859-1',
 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
  'org.apache.tika.parser.csv.TextAndCSVParser'],
 'language': 'en'}

In [116]:
# get file meta data
response = requests.put(tika+'tika/main',
                        os.path.join(dpath,files[1]),
                        #headers={'Accept': 'text/plain'}
                       )

print(f'{response.status_code} : {response.reason}')

200 : OK


In [117]:
response.text

''

In [110]:
dp = str(os.path.join(dpath,files[1]))
dp
#!curl -T dp tika+'meta'

'C:\\Github\training.data_eng\x07pache-tika\\pdf-samples\\scanned.pdf'

In [94]:
os.path.join(dpath,files[1])

'C:\\Github\training.data_eng\x07pache-tika\\pdf-samples\\scanned.pdf'