# How To Parse Multilingual PDF file (Chinese + English) Using Python 3 and Tika

Main Reference:  https://github.com/chrismattmann/tika-python

## Installing Tika

You can run the following command: `pip install tika`

In [15]:
# !pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/10/75/b566e446ffcf292f10c8d84c15a3d91615fe3d7ca8072a17c949d4e84b66/tika-1.19.tar.gz
Collecting requests (from tika)
[?25l  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
[K     |████████████████████████████████| 61kB 11.4MB/s eta 0:00:01
[?25hCollecting idna<2.9,>=2.5 (from requests->tika)
[?25l  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
[K     |████████████████████████████████| 61kB 9.7MB/s  eta 0:00:01
Collecting chardet<3.1.0,>=3.0.2 (from requests->tika)
[?25l  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
[K     |████████████████████████████████| 143kB 13.4MB/s eta 

## Set the PDF file path.  The example PDF is included in this Git Repo.

In [None]:
pdfFile = "How to Extract Words from PDFs with Python.pdf"

Initialize Tika and Tika Parser as per documentation

In [17]:
import tika
tika.initVM()
from tika import parser

Start parsing PDF file

In [18]:
parsed = parser.from_file(pdfFile)
print(parsed["metadata"])
print(parsed["content"])

2019-07-05 04:24:09,793 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /tmp/tika-server.jar.
2019-07-05 04:24:41,711 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /tmp/tika-server.jar.md5.
2019-07-05 04:24:42,380 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


{'Author': 'ElleryL', 'Content-Type': 'application/pdf', 'Creation-Date': '2019-07-05T03:40:18Z', 'Last-Modified': '2019-07-05T03:40:18Z', 'Last-Save-Date': '2019-07-05T03:40:18Z', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:parse_time_millis': '1306', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'created': '2019-07-05T03:40:18Z', 'creator': 'ElleryL', 'date': '2019-07-05T03:40:18Z', 'dc:creator': 'ElleryL', 'dc:format': 'application/pdf; version=1.5', 'dc:language': 'zh-HK', 'dcterms:created': '2019-07-05T03:40:18Z', 'dcterms:modified': '2019-07-05T03:40:18Z', 'language': 'zh-HK', 'meta:author': 'ElleryL', 'meta

Let's print out the whole output

In [22]:
import json
print(json.dumps(parsed, indent=4))

{
    "status": 200,
    "content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow to Extract Words from PDFs \n\nwith Python \n\nAs I mentioned in my previous article: How to Connect to \n\nGoogle Sheets with Python, I\u2019ve been working with a client \n\nto help them parse through hundreds of PDF files to extract \n\nkeywords in order to make them searchable. \n\nPart of solving the problem was figuring out how to extract \n\ntextual data from all these PDF files. You might be \n\nsurprised to learn that it\u2019s not that simple. You see, PDFs \n\nare a proprietary format by Adobe that come with their own \n\nlittle quirks when it comes to automating the process of \n\nextracting information from each file. \n\nAdobe\uff1a\u5275\u610f\u3001\u884c\u92b7\u548c\u6587\u4ef6\u7ba1\u7406\u89e3\u6c7a\u65b9\u6848 \n\nhttps://www.adobe.com/hk_zh/ \n\n \n\nAdobe \u6b63\u5728\u900f\u904e\u6578\u4f4d\u9ad4\u9a57\u6539\u8b8a\u4e16\u754c\u3002\u6211

### This is the PDF in terms of imaeg for easy reference:

![Page 1](001.jpg)
![Page 2](002.jpg)

### Points to note:

1. As you can see, the hyperlink is placed after text
2. It did not show which link belongs to which text
3. Chinese text is properly displayed
4. Line break is mostly repsected, but not all