Permalink
Switch branches/tags
Nothing to show
Find file Copy path
1280 lines (1279 sloc) 50.8 KB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Mining OCR PDFs - Using *pdftabextract* to liberate tabular data from scanned documents\n",
"\n",
"This is an example on how to use *pdftabextract* for data mining in scanned and OCR-processed documents with rather complex tables and/or few \"optical features\" like column or row borders, which often make it impossible to use like tools like [Tabula](http://tabula.technology/).\n",
"\n",
"In this example, only a table from a single page will extracted for demonstration purposes. For a full example that covers several pages, see the `catalog_30s.py` script. There are also some more demonstrations in the `examples` directory.\n",
"\n",
"The page that we will process looks like this:\n",
"\n",
"![PDF Excerpt p. 3](data/ALA1934_RR-excerpt.pdf-3_1.png)\n",
"\n",
"The page has been scanned and processed with Optical Character Recognition (OCR) software like *ABBYY FineReader* or *tesseract* and produced a \"sandwich\" PDF with the scanned document image and the recognized text boxes. Although some software, like FineReader allows to extract tables, this often fails and some more effort in order to liberate the data is necessary. We can use *pdftabextract* together with some other other tools for this.\n",
"\n",
"We will use a combination of the following tools in order to reach our goal:\n",
"\n",
"* the `pdftohtml` command from *poppler-utils* to extract the texts and scanned images from the PDF\n",
"* *pdf2xml-viewer* to inspect the text boxes and the generated table grid (more on that later)\n",
"* `pdftabextract` to write a script that estimates the positions of columns and rows, generates a table grid and fits the text boxes into this grid in order to extract the tabular data\n",
"\n",
"The following steps will be performed and described in detail:\n",
"\n",
"1. Extract the scanned page images and generate an XML with the OCR texts of the PDF with `pdftohtml`\n",
"2. View the text boxes and scanned pages with *pdf2xml-viewer*\n",
"3. Load the XML describing the pages and text boxes\n",
"4. Detect straight lines in the scanned pages, find out a possible page skew or rotation and fix it\n",
"5. Detect clusters of vertical lines for identifying the columns of a table\n",
"6. Find out the row positions of the table by analyzing the y-coordinates' distribution of text boxes\n",
"7. Create a grid of columns and lines\n",
"8. Match the text boxes into the grid and hence extract the tabular data in order to export it as Excel and CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Extract the scanned page images and generate an XML with the OCR texts of the PDF with `pdftohtml`\n",
"\n",
"The tool `pdftohtml` is part of the software package *[poppler-utils](https://poppler.freedesktop.org/)*. It's available on most Linux distributions and also for OSX via Homebrew or MacPorts. Using the argument `-xml`, it will generate an XML file in *pdf2xml* format with from a sandwich PDF:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Page-1\n",
"Page-2\n",
"Page-3\n",
"Page-4\n"
]
}
],
"source": [
"!cd data/ && pdftohtml -c -hidden -xml ALA1934_RR-excerpt.pdf ALA1934_RR-excerpt.pdf.xml"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ALA1934_RR-excerpt.pdf\r\n",
"ALA1934_RR-excerpt.pdf-1_1.png\r\n",
"ALA1934_RR-excerpt.pdf-2_1.png\r\n",
"ALA1934_RR-excerpt.pdf-3_1.png\r\n",
"ALA1934_RR-excerpt.pdf-4_1.png\r\n",
"ALA1934_RR-excerpt.pdf.xml\r\n"
]
}
],
"source": [
"!ls -1 data/"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n",
"<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\r\n",
"\r\n",
"<pdf2xml producer=\"poppler\" version=\"0.41.0\">\r\n",
"<page number=\"1\" position=\"absolute\" top=\"0\" left=\"0\" height=\"1261\" width=\"892\">\r\n",
"\t<fontspec id=\"0\" size=\"23\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"1\" size=\"41\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"2\" size=\"20\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"3\" size=\"22\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"4\" size=\"7\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"5\" size=\"7\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"6\" size=\"13\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"7\" size=\"5\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"8\" size=\"13\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"9\" size=\"4\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"10\" size=\"4\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"11\" size=\"5\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"12\" size=\"7\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"13\" size=\"7\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"14\" size=\"16\" family=\"Times\" color=\"#000000\"/>\r\n",
"\t<fontspec id=\"15\" size=\"11\" family=\"Times\" color=\"#000000\"/>\r\n",
"<image top=\"0\" left=\"0\" width=\"893\" height=\"1262\" src=\"ALA1934_RR-excerpt.pdf-1_1.png\"/>\r\n",
"<text top=\"43\" left=\"545\" width=\"14\" height=\"26\" font=\"0\">\\</text>\r\n",
"<text top=\"102\" left=\"298\" width=\"353\" height=\"41\" font=\"1\"><b>Deutsches Reich</b></text>\r\n",
"<text top=\"175\" left=\"363\" width=\"234\" height=\"21\" font=\"2\">Politische Zeitungen </text>\r\n",
"<text top=\"209\" left=\"436\" width=\"83\" height=\"22\" font=\"3\"><b>Anhalt</b></text>\r\n",
"<text top=\"248\" left=\"499\" width=\"8\" height=\"8\" font=\"4\">'S</text>\r\n",
"<text top=\"256\" left=\"490\" width=\"16\" height=\"8\" font=\"4\">„ s</text>\r\n",
"<text top=\"262\" left=\"492\" width=\"15\" height=\"8\" font=\"4\">Ö rj</text>\r\n",
"<text top=\"251\" left=\"520\" width=\"85\" height=\"8\" font=\"4\">A n z e i g e n t e i l</text>\r\n"
]
}
],
"source": [
"!head -n 30 data/ALA1934_RR-excerpt.pdf.xml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We generated an XML which consists of several `<page>` elements, containing an `<image>` (the \"background\" image, i.e. the scanned page) and several text boxes (`<text>`) with coordinates and the respective text box value. The images themselves have also been extracted from the PDF."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. View the text boxes and scanned pages with *pdf2xml-viewer*\n",
"\n",
"You can download the [pdf2xml-viewer from its github page](https://github.com/WZBSocialScienceCenter/pdf2xml-viewer). It basically consists of an HTML page that allows you to inspect an XML file in pdf2xml format in your browser. A copy of it resides also in the directory of this example.\n",
"\n",
"Change to the directory where pdf2xml-viewer resides (where its *index.html* or *pdf2xml-viewer.html* file is). You should also copy the extracted XML file and images to this location. Now let's start up a minimal local webserver. This can be done very easily with Python:\n",
"\n",
"With Python 2.x: `python -m SimpleHTTPServer 8080`\n",
"\n",
"Or with Python 3: `python3 -m http.server 8080 --bind 127.0.0.1`\n",
"\n",
"Now you open your browser and go to the address http://127.0.0.1:8080. The viewer shows up and you can now enter the\n",
"file name of your file to load (it must be relative to the directory in which pdf2xml-viewer resides)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!python3 -m http.server 8080 --bind 127.0.0.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you execute the above line you can start up pdf2xml-viewer by visiting http://127.0.0.1:8080/pdf2xml-viewer.html. **However, please note that this will prevent further code execution in this Notebook as long as the minimal webserver is running.** So it's generally a better idea to execute this in a separate Terminal window."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"In pdf2xml-viewer you will now be able to browse through the pages. In the background, you can see the scanned image page and on top of that are the text boxes that were detected during OCR:\n",
"\n",
"![pdf2xml-viewer p.3 displayed](nb_images/pdf2xml-viewer-page.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Load the XML describing the pages and text boxes\n",
"\n",
"We can now start to use `pdftabextract` in Python code in order to load the XML file. By now, you should have installed pdftabextract via *pip* in the Terminal with the command `pip install pdftabextract`.\n",
"\n",
"Let's define some constants first that we will need throughout the script."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"DATAPATH = 'data/'\n",
"OUTPUTPATH = 'generated_output/'\n",
"INPUT_XML = 'ALA1934_RR-excerpt.pdf.xml'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can load the the XML, parse it and have a look at the third page (the page from which we later want to extract the data)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"from pdftabextract.common import read_xml, parse_pages\n",
"\n",
"# Load the XML that was generated with pdftohtml\n",
"xmltree, xmlroot = read_xml(os.path.join(DATAPATH, INPUT_XML))\n",
"\n",
"# parse it and generate a dict of pages\n",
"pages = parse_pages(xmlroot)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['number', 'width', 'texts', 'image', 'xmlnode', 'height'])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pages[3].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each page consists of an *xmlnode* which points to the original XML page element, a page number, the page dimensions, an image (the scanned page) and the text boxes:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number 3\n",
"width 892\n",
"height 1261\n",
"image ALA1934_RR-excerpt.pdf-3_1.png\n",
"the first three text boxes:\n",
"[{'bottom': 107.0,\n",
" 'bottomleft': array([ 359., 107.]),\n",
" 'bottomright': array([ 404., 107.]),\n",
" 'height': 14,\n",
" 'left': 359.0,\n",
" 'right': 404.0,\n",
" 'top': 93.0,\n",
" 'topleft': array([ 359., 93.]),\n",
" 'topright': array([ 404., 93.]),\n",
" 'value': 'Baden',\n",
" 'width': 45,\n",
" 'xmlnode': <Element 'text' at 0x7f0227e4b688>},\n",
" {'bottom': 104.0,\n",
" 'bottomleft': array([ 737., 104.]),\n",
" 'bottomright': array([ 745., 104.]),\n",
" 'height': 13,\n",
" 'left': 737.0,\n",
" 'right': 745.0,\n",
" 'top': 91.0,\n",
" 'topleft': array([ 737., 91.]),\n",
" 'topright': array([ 745., 91.]),\n",
" 'value': '3',\n",
" 'width': 8,\n",
" 'xmlnode': <Element 'text' at 0x7f0227e4b6d8>},\n",
" {'bottom': 144.0,\n",
" 'bottomleft': array([ 58., 144.]),\n",
" 'bottomright': array([ 76., 144.]),\n",
" 'height': 8,\n",
" 'left': 58.0,\n",
" 'right': 76.0,\n",
" 'top': 136.0,\n",
" 'topleft': array([ 58., 136.]),\n",
" 'topright': array([ 76., 136.]),\n",
" 'value': 'Ein\\xad',\n",
" 'width': 18,\n",
" 'xmlnode': <Element 'text' at 0x7f0227e4b778>}]\n"
]
}
],
"source": [
"from pprint import pprint\n",
"\n",
"p_num = 3\n",
"p = pages[p_num]\n",
"\n",
"print('number', p['number'])\n",
"print('width', p['width'])\n",
"print('height', p['height'])\n",
"print('image', p['image'])\n",
"print('the first three text boxes:')\n",
"pprint(p['texts'][:3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The text boxes in `p['texts']` contain the coordinates and dimensions of each text box as well as the content (`value`) and a reference to the original XML node."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Detect straight lines in the scanned pages, find out a possible page skew or rotation and fix it\n",
"\n",
"We can see on the scanned page image, that columns and table headers are marked with straight lines. We can detect these in order to a) find out if and how much the page is skewed or rotated and b) later use the detected vertical lines to estimate the position of the columns in the table.\n",
"\n",
"In order to do so, we can use the image processing module (`pdftabextract.imgproc`) which uses [OpenCV's Hough transform](http://docs.opencv.org/2.4/doc/tutorials/imgproc/imgtrans/hough_lines/hough_lines.html) to detect the lines."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page 3: detecting lines in image file 'data/ALA1934_RR-excerpt.pdf-3_1.png'...\n",
"> found 69 lines\n"
]
}
],
"source": [
"import numpy as np\n",
"from pdftabextract import imgproc\n",
"\n",
"# get the image file of the scanned page\n",
"imgfilebasename = p['image'][:p['image'].rindex('.')]\n",
"imgfile = os.path.join(DATAPATH, p['image'])\n",
"\n",
"print(\"page %d: detecting lines in image file '%s'...\" % (p_num, imgfile))\n",
"\n",
"# create an image processing object with the scanned page\n",
"iproc_obj = imgproc.ImageProc(imgfile)\n",
"\n",
"# calculate the scaling of the image file in relation to the text boxes coordinate system dimensions\n",
"page_scaling_x = iproc_obj.img_w / p['width'] # scaling in X-direction\n",
"page_scaling_y = iproc_obj.img_h / p['height'] # scaling in Y-direction\n",
"\n",
"# detect the lines\n",
"lines_hough = iproc_obj.detect_lines(canny_kernel_size=3, canny_low_thresh=50, canny_high_thresh=150,\n",
" hough_rho_res=1,\n",
" hough_theta_res=np.pi/500,\n",
" hough_votes_thresh=round(0.2 * iproc_obj.img_w))\n",
"print(\"> found %d lines\" % len(lines_hough))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The important thing is that we created an `ImageProc` instance using the scanned page image file that is referred to in the `image` key of the page `p`. ImageProc will identify the dimensions of the image file which allows us to calculate the scaling between the image dimensions and the text boxes' coordinate system. It is important to understand that the coordinate system in image space has a different scaling than the coordinate system used to position the text boxes. For example, the image could be scanned with a size of 1000x3000 pixels (`iproc_obj.img_w` by `iproc_obj.img_h`) while the text boxes of a page are positioned on a canvas of size 500x1500 units (`p['width']` by `p['height']`).\n",
"\n",
"Now the most crucial step is to detect the lines and use the right parameters for the image processing step. At first, our image will be converted into a binary image with white pixels marking the edges in the image (i.e. the regions with a big change in color intensity) and black pixels showing regions of homogenous color, i.e. low change in color intensity. To achieve this, the [Canny algorithm](http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_canny/py_canny.html) is used. The kernel size is the size of the [Gaussian Filter](https://en.wikipedia.org/wiki/Canny_edge_detector#Gaussian_filter) used to smooth the image and remove pixel noise. The low and high thresholds are used for detecting \"strong\" and \"weak\" edge pixels during [Hysteresis Thresholding](https://en.wikipedia.org/wiki/Canny_edge_detector#Double_threshold).\n",
"\n",
"After this, the actual Hough transform is taking place. All white pixels (i.e. edge pixels) are transformed into \"Hough space\" which is a descrete approximation of a [polar coordinate system](https://en.wikipedia.org/wiki/Polar_coordinate_system) with *theta* being the polar angle and *rho* being the distance from the origin or \"pole\". This Hough space can be seen as a 2D map with *theta* being on the y-axis and *rho* on the x-axis. The map must have a certain resolution or size which is given by `hough_rho_res` and `hough_theta_res`. During Hough transform, each edge pixel *p* is then converted to polar coordinate space and *n = PI / `hough_theta_res`* lines (in the above example 500 lines) are simulated to pass through that pixel in *n* different degrees. This forms a sinusoid which will intersect with other sinusoids when other edge pixels are on the same straight line (because this pixels share the same angle or *theta* and distance or *rho* of this straight line). So for each edge pixel, a number of potential lines going through that pixel is simulated and then accumulated on the Hough space map, a process which is called *voting*. The more intersections appear on a certain spot in the Hough map the higher is the *vote*. All votes that are above a certain threshold here given by `hough_votes_thresh` are then considered being straight lines.\n",
"\n",
"So `hough_theta_res` defines the minimum line angle that can be detected (in our case: *PI/500 = 0.36°*) and `hough_rho_res` the minimum distance delta. `hough_votes_thresh` is the minimum number of intersections (and hence about the number of edge pixels on a straight line) in the Hough map for a straight line to be detected as such.\n",
"\n",
"We should now have a look at the detected lines in order to verify that we chose the right parameters. But at first, let's define a helper function for that because we will need it another time."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"\n",
"# helper function to save an image \n",
"def save_image_w_lines(iproc_obj, imgfilebasename):\n",
" img_lines = iproc_obj.draw_lines(orig_img_as_background=True)\n",
" img_lines_file = os.path.join(OUTPUTPATH, '%s-lines-orig.png' % imgfilebasename)\n",
" \n",
" print(\"> saving image with detected lines to '%s'\" % img_lines_file)\n",
" cv2.imwrite(img_lines_file, img_lines)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> saving image with detected lines to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-lines-orig.png'\n"
]
}
],
"source": [
"save_image_w_lines(iproc_obj, imgfilebasename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is our image with the detected lines:\n",
"\n",
"![image after Hough transform](generated_output/ALA1934_RR-excerpt.pdf-3_1-lines-orig.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the horizontal lines (green) are not straight, whereas the vertical lines (red) are. This means that the page is skewed so that the text boxes in the left are slightly shifted down in relation to those on the right of the same row (this can also be seen in pdf2xml-viewer). In such a dense table, this would bring serious problems during row detection, so we will need to fix this. The method `find_rotation_or_skew` can be used for that."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> deskewing in direction 'sy' by 0.719995°\n"
]
}
],
"source": [
"from math import radians, degrees\n",
"\n",
"from pdftabextract.common import ROTATION, SKEW_X, SKEW_Y\n",
"from pdftabextract.geom import pt\n",
"from pdftabextract.textboxes import rotate_textboxes, deskew_textboxes\n",
"\n",
"# find rotation or skew\n",
"# the parameters are:\n",
"# 1. the minimum threshold in radians for a rotation to be counted as such\n",
"# 2. the maximum threshold for the difference between horizontal and vertical line rotation (to detect skew)\n",
"# 3. an optional threshold to filter out \"stray\" lines whose angle is too far apart from the median angle of\n",
"# all other lines that go in the same direction (no effect here)\n",
"rot_or_skew_type, rot_or_skew_radians = iproc_obj.find_rotation_or_skew(radians(0.5), # uses \"lines_hough\"\n",
" radians(1),\n",
" omit_on_rot_thresh=radians(0.5))\n",
"\n",
"# rotate back or deskew text boxes\n",
"needs_fix = True\n",
"if rot_or_skew_type == ROTATION:\n",
" print(\"> rotating back by %f°\" % -degrees(rot_or_skew_radians))\n",
" rotate_textboxes(p, -rot_or_skew_radians, pt(0, 0))\n",
"elif rot_or_skew_type in (SKEW_X, SKEW_Y):\n",
" print(\"> deskewing in direction '%s' by %f°\" % (rot_or_skew_type, -degrees(rot_or_skew_radians)))\n",
" deskew_textboxes(p, -rot_or_skew_radians, rot_or_skew_type, pt(0, 0))\n",
"else:\n",
" needs_fix = False\n",
" print(\"> no page rotation / skew found\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that a skew in direction *sy* (meaning vertical direction) is detected. To fix this, `deskew_textboxes` is called with our page `p` as first parameter, then the negated skew (negated in order to \"skew back\"), the skew type (direction) and the origin point about which the deskewing is done (top left corner).\n",
"\n",
"We fixed the text boxes in the XML, but we should also fix the actual lines that were detected. This can be done with the method `apply_found_rotation_or_skew`. Additionally, we save the image with the repaired lines."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> saving image with detected lines to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-repaired-lines-orig.png'\n"
]
}
],
"source": [
"if needs_fix:\n",
" # rotate back or deskew detected lines\n",
" lines_hough = iproc_obj.apply_found_rotation_or_skew(rot_or_skew_type, -rot_or_skew_radians)\n",
"\n",
" save_image_w_lines(iproc_obj, imgfilebasename + '-repaired')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![image with repaired lines](generated_output/ALA1934_RR-excerpt.pdf-3_1-repaired-lines-orig.png)\n",
"\n",
"As we can see, only the horizontal lines are straightened. Please note that the deskewing is not applied to the original image because this is not necessary for our further processing and hence the repaired lines are now a bit off from the table borders in the original image.\n",
"\n",
"We should also save the corrected text boxes' XML so that we can inspect it in pdf2xml-viewer:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"saving repaired XML file to 'generated_output/ALA1934_RR-excerpt.pdf.repaired.xml'...\n"
]
}
],
"source": [
"# save repaired XML (i.e. XML with deskewed textbox positions)\n",
"output_files_basename = INPUT_XML[:INPUT_XML.rindex('.')]\n",
"repaired_xmlfile = os.path.join(OUTPUTPATH, output_files_basename + '.repaired.xml')\n",
"\n",
"print(\"saving repaired XML file to '%s'...\" % repaired_xmlfile)\n",
"xmltree.write(repaired_xmlfile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Detect clusters of vertical lines for identifying the columns of a table\n",
"\n",
"We now want to identify the columns of the table on our page. We already obtained the vertical lines by the means of image processing. Now as there are always many close lines detected for single column separator line, we need to cluster these sets of close lines so that we can later calculate the column positions.\n",
"\n",
"We can use the `find_clusters` method to which we need to pass the direction of lines that we want to cluster (vertical lines) and a clustering method. Here, we will use a simple technique that orders the lines and arranges them to separate clusters when the gap between them is too big (i.e. exceeds `dist_thresh`). We use the half of the minimum column width for this threshold. We can measure this minimum column width in advance with a graphics editor like *GIMP*.\n",
"You can also use other, more sophisticated clustering techniques here too, which are implemented in `pdftabextract.clustering`, e.g. hierarchical clustering. But usually this approach here is sufficient."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> found 17 clusters\n",
"> saving image with detected vertical clusters to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-vertical-clusters.png'\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pdftabextract.clustering import find_clusters_1d_break_dist\n",
"\n",
"MIN_COL_WIDTH = 60 # minimum width of a column in pixels, measured in the scanned pages\n",
"\n",
"# cluster the detected *vertical* lines using find_clusters_1d_break_dist as simple clustering function\n",
"# (break on distance MIN_COL_WIDTH/2)\n",
"# additionally, remove all cluster sections that are considered empty\n",
"# a cluster is considered empty when the number of text boxes in it is below 10% of the median number of text boxes\n",
"# per cluster section\n",
"vertical_clusters = iproc_obj.find_clusters(imgproc.DIRECTION_VERTICAL, find_clusters_1d_break_dist,\n",
" remove_empty_cluster_sections_use_texts=p['texts'], # use this page's textboxes\n",
" remove_empty_cluster_sections_n_texts_ratio=0.1, # 10% rule\n",
" remove_empty_cluster_sections_scaling=page_scaling_x, # the positions are in \"scanned image space\" -> we scale them to \"text box space\"\n",
" dist_thresh=MIN_COL_WIDTH/2)\n",
"print(\"> found %d clusters\" % len(vertical_clusters))\n",
"\n",
"# draw the clusters\n",
"img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)\n",
"save_img_file = os.path.join(OUTPUTPATH, '%s-vertical-clusters.png' % imgfilebasename)\n",
"print(\"> saving image with detected vertical clusters to '%s'\" % save_img_file)\n",
"cv2.imwrite(save_img_file, img_w_clusters)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see the detected clusters marked with different colors:\n",
"\n",
"![vertical clusters of p.3](generated_output/ALA1934_RR-excerpt.pdf-3_1-vertical-clusters.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we simply calculate the centers of the clusters, by default taking the median of the cluster values. We also need to divide by the page scaling because the cluster positions are in image space but we need the column positions in \"text box space\"."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found 17 column borders:\n",
"[ 39.16765831 92.43709677 350.3668275 401.12338442 425.85806452\n",
" 452.11451613 478.01129032 503.5483871 528.36612903 554.55839054\n",
" 579.45175965 604.34512877 630.55830948 658.97309748 690.66941713\n",
" 716.2070073 740.26100487]\n"
]
}
],
"source": [
"from pdftabextract.clustering import calc_cluster_centers_1d\n",
"\n",
"page_colpos = np.array(calc_cluster_centers_1d(vertical_clusters)) / page_scaling_x\n",
"print('found %d column borders:' % len(page_colpos))\n",
"print(page_colpos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Find out the row positions of the table by analyzing the y-coordinates' distribution of text boxes\n",
"\n",
"For the row or line positions in the table, we can't rely on optical features because there are no row separator lines. However, we can exploit the distinctive distribution of y-coordinates of the text boxes and cluster them in order to get an estimation of the row positions. Another challenge is to find the start and the end of the table. We can do so by analyzing the specific contents of the text boxes for hints that the first row starts (it must start with a population number on the left) or that the text box is the table footer (it has something to do with \"Ala Anzeigen\").\n",
"\n",
"We start by finding text boxes of \"usual\" height in the first two columns, because the text boxes in these columns are most reliable in determining the row position."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# right border of the second column\n",
"col2_rightborder = page_colpos[2]\n",
"\n",
"# calculate median text box height\n",
"median_text_height = np.median([t['height'] for t in p['texts']])\n",
"\n",
"# get all texts in the first two columns with a \"usual\" textbox height\n",
"# we will only use these text boxes in order to determine the line positions because they are more \"stable\"\n",
"# otherwise, especially the right side of the column header can lead to problems detecting the first table row\n",
"text_height_deviation_thresh = median_text_height / 2\n",
"texts_cols_1_2 = [t for t in p['texts']\n",
" if t['right'] <= col2_rightborder\n",
" and abs(t['height'] - median_text_height) <= text_height_deviation_thresh]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we get the text boxes' top and bottom border positions, cluster them, and calculate the cluster centers."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of line positions: 100\n"
]
}
],
"source": [
"from pdftabextract.clustering import zip_clusters_and_values\n",
"from pdftabextract.textboxes import border_positions_from_texts, split_texts_by_positions, join_texts\n",
"from pdftabextract.common import all_a_in_b, DIRECTION_VERTICAL\n",
"\n",
"# get all textboxes' top and bottom border positions\n",
"borders_y = border_positions_from_texts(texts_cols_1_2, DIRECTION_VERTICAL)\n",
"\n",
"# break into clusters using half of the median text height as break distance\n",
"clusters_y = find_clusters_1d_break_dist(borders_y, dist_thresh=median_text_height/2)\n",
"clusters_w_vals = zip_clusters_and_values(clusters_y, borders_y)\n",
"\n",
"# for each cluster, calculate the median as center\n",
"pos_y = calc_cluster_centers_1d(clusters_w_vals)\n",
"pos_y.append(p['height'])\n",
"\n",
"print('number of line positions:', len(pos_y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now try to find the first table row by checking when we find the first text box that contains a string which matches a typical row pattern."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"208.80321434348292"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"# a (possibly malformed) population number + space + start of city name\n",
"pttrn_table_row_beginning = re.compile(r'^[\\d Oo][\\d Oo]{2,} +[A-ZÄÖÜ]')\n",
"\n",
"# 1. try to find the top row of the table\n",
"texts_cols_1_2_per_line = split_texts_by_positions(texts_cols_1_2, pos_y, DIRECTION_VERTICAL,\n",
" alignment='middle',\n",
" enrich_with_positions=True)\n",
"\n",
"# go through the texts line per line\n",
"for line_texts, (line_top, line_bottom) in texts_cols_1_2_per_line:\n",
" line_str = join_texts(line_texts)\n",
" if pttrn_table_row_beginning.match(line_str): # check if the line content matches the given pattern\n",
" top_y = line_top\n",
" break\n",
"else:\n",
" top_y = 0\n",
"\n",
"top_y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We find the last table row by checking if we find a \"big\" text box in the lower page area that matches certain words."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1089.8901693310179"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# hints for a footer text box\n",
"words_in_footer = ('anzeige', 'annahme', 'ala')\n",
"\n",
"# 2. try to find the bottom row of the table\n",
"min_footer_text_height = median_text_height * 1.5\n",
"min_footer_y_pos = p['height'] * 0.7\n",
"# get all texts in the lower 30% of the page that have are at least 50% bigger than the median textbox height\n",
"bottom_texts = [t for t in p['texts']\n",
" if t['top'] >= min_footer_y_pos and t['height'] >= min_footer_text_height]\n",
"bottom_texts_per_line = split_texts_by_positions(bottom_texts,\n",
" pos_y + [p['height']], # always down to the end of the page\n",
" DIRECTION_VERTICAL,\n",
" alignment='middle',\n",
" enrich_with_positions=True)\n",
"# go through the texts at the bottom line per line\n",
"page_span = page_colpos[-1] - page_colpos[0]\n",
"min_footer_text_width = page_span * 0.8\n",
"for line_texts, (line_top, line_bottom) in bottom_texts_per_line:\n",
" line_str = join_texts(line_texts)\n",
" has_wide_footer_text = any(t['width'] >= min_footer_text_width for t in line_texts)\n",
" # check if there's at least one wide text or if all of the required words for a footer match\n",
" if has_wide_footer_text or all_a_in_b(words_in_footer, line_str):\n",
" bottom_y = line_top\n",
" break\n",
"else:\n",
" bottom_y = p['height']\n",
"\n",
"bottom_y"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> page 3: 93 lines between [208.803214, 1089.890169]\n"
]
}
],
"source": [
"# finally filter the line positions so that only the lines between the table top and bottom are left\n",
"page_rowpos = [y for y in pos_y if top_y <= y <= bottom_y]\n",
"\n",
"print(\"> page %d: %d lines between [%f, %f]\" % (p_num, len(page_rowpos), top_y, bottom_y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Create a grid of columns and lines\n",
"\n",
"From the column and row positions that we detected, we can now generate a \"page grid\" which should resemble the table layout as close as possible. We then save the grid information as JSON file so that we can display it in pdf2xml-viewer."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> page 3: grid with 92 rows, 16 columns\n"
]
}
],
"source": [
"from pdftabextract.extract import make_grid_from_positions\n",
"\n",
"grid = make_grid_from_positions(page_colpos, page_rowpos)\n",
"n_rows = len(grid)\n",
"n_cols = len(grid[0])\n",
"print(\"> page %d: grid with %d rows, %d columns\" % (p_num, n_rows, n_cols))\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 39.16765831, 208.80321434],\n",
" [ 92.43709677, 217.83563592]])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the page grid is a 2D matrix that contains the rectangular table cells\n",
"# this is the top left table cell:\n",
"grid[0][0]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"saving page grids JSON file to 'generated_output/ALA1934_RR-excerpt.pdf.pagegrids_p3_only.json'\n"
]
}
],
"source": [
"from pdftabextract.common import save_page_grids\n",
"\n",
"page_grids_file = os.path.join(OUTPUTPATH, output_files_basename + '.pagegrids_p3_only.json')\n",
"print(\"saving page grids JSON file to '%s'\" % page_grids_file)\n",
"save_page_grids({p_num: grid}, page_grids_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should now evaluate our page grid using pdf2xml-viewer. At first, we will need to load the XML of the repaired (deskewed) pages and then the JSON file containing the page grid information for page 3. When we select page 3 we should see an output like this:\n",
"\n",
"![page grids of p.3 in pdf2xml-viewer](nb_images/pdf2xml-viewer-pagegrid.png)\n",
"\n",
"We can see that the rows and columns were detected correctly (decrease the background image visibility for better display) and the deskewed text boxes mostly lie in the correct table cells. So the only thing left now is to finally extract the data by fitting the text boxes in the correct table cells."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Match the text boxes into the grid and hence extract the tabular data in order to export it as Excel and CSV file\n",
"\n",
"We can use `fit_texts_into_grid` to fit the text boxes into the grid and then transform it to a [pandas](http://pandas.pydata.org/) *DataFrame*."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col01</th>\n",
" <th>col02</th>\n",
" <th>col03</th>\n",
" <th>col04</th>\n",
" <th>col05</th>\n",
" <th>col06</th>\n",
" <th>col07</th>\n",
" <th>col08</th>\n",
" <th>col09</th>\n",
" <th>col10</th>\n",
" <th>col11</th>\n",
" <th>col12</th>\n",
" <th>col13</th>\n",
" <th>col14</th>\n",
" <th>col15</th>\n",
" <th>col16</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>157 000</td>\n",
" <td>K a r l s r u h e i . B . , Generalanzei...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td></td>\n",
" <td>d eu tsch la n d ..............................</td>\n",
" <td>23 300</td>\n",
" <td>7</td>\n",
" <td>22</td>\n",
" <td>12</td>\n",
" <td>7</td>\n",
" <td>N</td>\n",
" <td>90</td>\n",
" <td>50</td>\n",
" <td>c</td>\n",
" <td></td>\n",
" <td>420 280</td>\n",
" <td>5040</td>\n",
" <td>+</td>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td></td>\n",
" <td>B e i gleichz. A u fg . in B ez.-A u sg ....</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td></td>\n",
" <td>u. K in zig -B ote, H ardt-Anzeiger, K eM ,...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td></td>\n",
" <td>8 R p f, R ekl. mm SO l i p f Siehe A nze...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td></td>\n",
" <td>K aärlsruher&lt;B ürgerzeitung......................</td>\n",
" <td></td>\n",
" <td>1</td>\n",
" <td>50</td>\n",
" <td>4</td>\n",
" <td>*</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>D</td>\n",
" <td></td>\n",
" <td>295 186</td>\n",
" <td></td>\n",
" <td>+</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td></td>\n",
" <td>» V i S . R M 9 6 ,—</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td></td>\n",
" <td>K arlsruh eü-'Tagbatt . . . S iehe A nzei...</td>\n",
" <td>13 000</td>\n",
" <td>7</td>\n",
" <td>22</td>\n",
" <td>12</td>\n",
" <td>6</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>30</td>\n",
" <td>B</td>\n",
" <td>420</td>\n",
" <td>2 87</td>\n",
" <td>5040</td>\n",
" <td>+</td>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td></td>\n",
" <td>R esidenZ-A nzeiger . . . . Siehe A nzeige...</td>\n",
" <td>7 300</td>\n",
" <td>6</td>\n",
" <td>22</td>\n",
" <td>12</td>\n",
" <td>4</td>\n",
" <td>N</td>\n",
" <td>96</td>\n",
" <td>25</td>\n",
" <td>B</td>\n",
" <td>450</td>\n",
" <td>2 92</td>\n",
" <td>5400</td>\n",
" <td>+</td>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td></td>\n",
" <td>R h e i n - R ö t e ....................... Si...</td>\n",
" <td>1 450</td>\n",
" <td>6</td>\n",
" <td>46</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>N</td>\n",
" <td>96</td>\n",
" <td>15</td>\n",
" <td>B</td>\n",
" <td>450</td>\n",
" <td>2 92</td>\n",
" <td>2700</td>\n",
" <td>+</td>\n",
" <td>25</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col01 col02 col03 col04 \\\n",
"0 157 000 K a r l s r u h e i . B . , Generalanzei... \n",
"1 d eu tsch la n d .............................. 23 300 7 \n",
"2 B e i gleichz. A u fg . in B ez.-A u sg .... \n",
"3 u. K in zig -B ote, H ardt-Anzeiger, K eM ,... \n",
"4 8 R p f, R ekl. mm SO l i p f Siehe A nze... \n",
"5 K aärlsruher<B ürgerzeitung...................... 1 \n",
"6 » V i S . R M 9 6 ,— \n",
"7 K arlsruh eü-'Tagbatt . . . S iehe A nzei... 13 000 7 \n",
"8 R esidenZ-A nzeiger . . . . Siehe A nzeige... 7 300 6 \n",
"9 R h e i n - R ö t e ....................... Si... 1 450 6 \n",
"\n",
" col05 col06 col07 col08 col09 col10 col11 col12 col13 col14 col15 col16 \n",
"0 \n",
"1 22 12 7 N 90 50 c 420 280 5040 + 25 \n",
"2 \n",
"3 \n",
"4 \n",
"5 50 4 * D 295 186 + 54 \n",
"6 \n",
"7 22 12 6 N 68 30 B 420 2 87 5040 + 25 \n",
"8 22 12 4 N 96 25 B 450 2 92 5400 + 25 \n",
"9 46 6 5 N 96 15 B 450 2 92 2700 + 25 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pdftabextract.extract import fit_texts_into_grid, datatable_to_dataframe\n",
"\n",
"datatable = fit_texts_into_grid(p['texts'], grid)\n",
" \n",
"df = datatable_to_dataframe(datatable)\n",
"\n",
"df.head(n=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course you would now usually do a lot of parsing, data clean up and validation because with this kind of messy data it will never be 100% perfect. Still this is a good basis to work on. We can export the data now as CSV and Excel:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"saving extracted data to 'generated_output/ALA1934_RR-excerpt.pdf-p3_only.csv'\n",
"saving extracted data to 'generated_output/ALA1934_RR-excerpt.pdf-p3_only.xlsx'\n"
]
}
],
"source": [
"csv_output_file = os.path.join(OUTPUTPATH, output_files_basename + '-p3_only.csv')\n",
"print(\"saving extracted data to '%s'\" % csv_output_file)\n",
"df.to_csv(csv_output_file, index=False)\n",
"\n",
"excel_output_file = os.path.join(OUTPUTPATH, output_files_basename + '-p3_only.xlsx')\n",
"print(\"saving extracted data to '%s'\" % excel_output_file)\n",
"df.to_excel(excel_output_file, index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}