# Parsing PDF Files

**Portable Document Format ([PDF](https://acrobat.adobe.com/au/en/products/about-adobe-pdf.html))**, invented by Adobe, is "*a file format used to present and exchange documents reliably, independent of software, hardware, or operating system.*" It is a great format for representing digital documents since each PDF file encapsulates a complete description of the layout of the original document (i.e., the text, fonts, graphics, and other meta information of the document). However, it’s a document representation format and **not a data format that is machine readable**, like CSV, JSON, and XML. Unfortunately, much of real world data is stored in PDF files, particularly the data published by some government agencies and finance institutions. 
Here we would also like to point out that **if you can <font color='red'>avoid having to extract
data from PDFs</font>, you should**.

For data analysis, PDF is not a preferred storage or presentation format. However, sometimes we do not have any other choice. Throughout this chapter, you are going to learn two different ways of scraping data from PDFs with examples. We will cover how to write your own Python scripts, how to use some existing tools, and finally how to save the parsed data into a CSV file.

The example used in this chapter is "[Table 2: Nutrition](http://www.unicef.org/sowc2014/numbers/documents/english/EN-FINAL%20Table%202.pdf)" from Unicef's report on [The State of the Worlds Children](http://www.unicef.org/sowc2014/numbers/) for 2014. Click the link to download the pdf file, named "EN-FINAL Table 2.pdf" and save it into the same folder as where you stored this notebook. It is the same data as that used in the previous chapter, but in PDF format. The following screenshot shows what the the first page of the PDF file looks like. 
![](./EN_FINAL_Table_2_page_1.jpg)

PDFs are more difficult to work with than Excel files because different PDFs can have different formats that are unpredictable. For those curious why it is so difficult to extract data from PDFs, you might be interested in reading the story from [ProPublica](https://www.propublica.org/nerds/item/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult) (Read Section "PDFs Considered Harmful" 📖 ). There are many ways of extracting data from PDFs. Just to name a few, here is a list of tools:

* [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/): A tool for extracting text, images, object coordinates, metadata from PDF documents. It includes a PDF converter and an extensible PDF parser. 
* [pdftables](https://github.com/chrisdev/pdftables): A tool for extracting tables from PDF files, it uses pdfminer to get information on the locations of text elements. Each row in the table is extracted and stored in a list.
* [slate](https://pypi.python.org/pypi/slate): A Python package that simplifies the process of extracting text from PDF files. It is a small Python module that wraps pdfminer's API.
* [PyPDF2](http://mstamy2.github.io/PyPDF2/): A Python library built for manipulating PDFs, such as extracting document information, splitting, merging, and cropping pages, etc.
* [Tabula](http://tabula.technology/): A simple tool for extracting data tables out of PDF files. It is quite simple to use.

Try to find more tools on Internet! Note that you should search for PDF parsing tools that are capable of extracting data from PDFs, as some parsing tools are not suitable for data extraction.

Besides these tools, you can also scrape data from PDF files with many programming languages, like Python. After searching for online tutorials, documentation, and blog post, such as 
* [Get Started With Scraping – Extracting Simple Tables from PDF Documents](http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/) 📖 . It dicusses how to use pdftohtml to extract tables from PDFs.

In this chapter we will demonstrate how to use pdfminder and pdftables to extract data tables out from the downloaded PDF file and save the extracted data into a CSV file.
You are also required to try Tabula on the same PDF file as an exercise.
* * *

## 1. Scraping data from PDFs with PDFMiner
We start with a crude approach in which: 
1. first converts PDF to text, 
2. and then extracts data for the text using, for example, regular expression. 

This approach is better if you have a very large PDF file or a series of PDF files that corresponds to a set of consistent documents. We will also show the drawbacks of this approach later in this section. 


### 1.1 Converting PDF to Text
To convert the downloaded PDF file to a text file, we are going to use **`pdf2txt.py`**, a command that comes with pdfminer. Let's install pdfminer. In your command line window, type either of the following scripts:
```shell
    pip install pdfminer
```
or 
```shell
    conda install -c https://conda.anaconda.org/hargup pdfminer
```

If you do not have *pip* or another Python package manager installed, You can also download the pdfminer package directly from its website, and install it using the Makefile as follows:
```shell
    make install
```

Now we have `pdfminer` installed and are ready to convert our PDF to text by running the following command:
```shell
    pdf2txt.py -o en_final_table_2.txt EN_FINAL_Table_2.pdf
```
The argument `-o` is the text file we want to create, the second argument is the PDF file that we want to convert. After running the above command, we have a text version of the PDF file, i.e., `en_final_table_2.txt`.

Take a moment to skim the txt file and the original PDF file, and have a comparison. What do you find? 

The text file is quite messy. All the tables have been converted into text form, and the nice table layout shown in the PDF file is lost. Now how can we extract data tables and reconstruct the layout? In the following section, you will learn how to gradually develop a Python script for scraping data from our converted text file. 

First, let's read the new text file into python.

In [21]:
pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
# loop over all the lines
for line in pdf_txt:
    # repr() is a built-in Python fuction that returns a string 
    # containing a printable representation of an object.
    print (repr(line))

'TABLE 2     NUTRITION\n'
'\n'
'Countries  \n'
'and areas\n'
'Afghanistan \n'
'Albania \n'
'Algeria \n'
'Andorra \n'
'Angola \n'
'Antigua and Barbuda \n'
'Argentina \n'
'Armenia \n'
'Australia \n'
'Austria \n'
'Azerbaijan \n'
'Bahamas \n'
'Bahrain \n'
'Bangladesh \n'
'Barbados \n'
'Belarus \n'
'Belgium \n'
'Belize \n'
'Benin \n'
'Bhutan \n'
'Bolivia (Plurinational \n'
'   State of) \n'
'Bosnia and Herzegovina \n'
'Botswana \n'
'Brazil \n'
'Brunei Darussalam \n'
'Bulgaria \n'
'Burkina Faso \n'
'Burundi \n'
'Cabo Verde \n'
'Cambodia \n'
'Cameroon \n'
'Canada \n'
'Central African Republic \n'
'Chad \n'
'Chile \n'
'China \n'
'Colombia \n'
'Comoros \n'
'Congo \n'
'\n'
'Low  \n'
'\n'
'birthweight  \n'
'\n'
'(%) \n'
'\n'
'2008–2012*\n'
'\n'
'–  \n'
'4  \n'
'6 x \n'
'–  \n'
'12 x \n'
'5 x \n'
'7  \n'
'8  \n'
'7 x \n'
'7 x \n'
'10 x \n'
'11 x \n'
'–  \n'
'22 x \n'
'12  \n'
'4 x \n'
'–  \n'
'11  \n'
'15 x \n'
'10  \n'
'\n'
'6  \n'
'3  \n'
'13 x \n'
'8  \n'
'–  \n'
'9  \n'
'14  \n'
'13  \n'
'6 x 

'–  \n'
'66  \n'
'\n'
'–  \n'
'77 x \n'
'55  \n'
'58 x \n'
'23  \n'
'–  \n'
'–  \n'
'–  \n'
'24 x \n'
'–  \n'
'13  \n'
'26  \n'
'54  \n'
'82  \n'
'–  \n'
'26 x \n'
'\n'
'48 x \n'
'–  \n'
'15  \n'
'35  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'61  \n'
'77  \n'
'–  \n'
'68  \n'
'46  \n'
'–  \n'
'53 x \n'
'47  \n'
'–  \n'
'–  \n'
'\n'
'–  \n'
'–  \n'
'65  \n'
'\n'
'–  \n'
'43 x \n'
'18  \n'
'–  \n'
'9  \n'
'–  \n'
'–  \n'
'–  \n'
'3  \n'
'–  \n'
'3  \n'
'4  \n'
'16  \n'
'–  \n'
'–  \n'
'5  \n'
'\n'
'27  \n'
'–  \n'
'–  \n'
'13  \n'
'15  \n'
'4 x \n'
'–  \n'
'–  \n'
'–  \n'
'36 x \n'
'13  \n'
'12  \n'
'17  \n'
'19  \n'
'–  \n'
'–  \n'
'20  \n'
'–  \n'
'3  \n'
'\n'
'–  \n'
'–  \n'
'3  \n'
'\n'
'–  \n'
'16 x \n'
'5  \n'
'–  \n'
'4  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'1  \n'
'1  \n'
'4  \n'
'–  \n'
'–  \n'
'1  \n'
'\n'
'7  \n'
'–  \n'
'–  \n'
'2  \n'
'2  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'3  \n'
'2  \n'
'3  \n'
'5  \n'
'–  \n'
'–  \n'
'4  \n'
'–  \n'
'–  \n'
'\n'
'–  \n'

'Thailand \n'
'The former Yugoslav \n'
'   Republic of Macedonia \n'
'Timor-Leste \n'
'Togo \n'
'Tonga \n'
'Trinidad and Tobago \n'
'Tunisia \n'
'Turkey \n'
'Turkmenistan \n'
'Tuvalu \n'
'Uganda \n'
'Ukraine \n'
'United Arab Emirates \n'
'United Kingdom \n'
'United Republic \n'
'   of Tanzania \n'
'United States \n'
'Uruguay \n'
'Uzbekistan \n'
'Vanuatu \n'
'\n'
'Low  \n'
'\n'
'birthweight  \n'
'\n'
'(%) \n'
'\n'
'2008–2012*\n'
'\n'
'–  \n'
'11  \n'
'8 x \n'
'7 x \n'
'–  \n'
'13 x \n'
'–  \n'
'–  \n'
'–  \n'
'–  \n'
'17 x \n'
'9  \n'
'–  \n'
'14  \n'
'9  \n'
'–  \n'
'–  \n'
'10  \n'
'10 x \n'
'7  \n'
'\n'
'6  \n'
'12 x \n'
'11  \n'
'3 x \n'
'10  \n'
'7  \n'
'11  \n'
'4 x \n'
'6 x \n'
'12  \n'
'4 x \n'
'6  \n'
'8 x \n'
'\n'
'8  \n'
'8 x \n'
'9  \n'
'5 x \n'
'10 x \n'
'\n'
'Early initiation  \n'
'of breastfeeding \n'
'\n'
'(%)\n'
'\n'
'–  \n'
'45  \n'
'–  \n'
'–  \n'
'–  \n'
'75 x \n'
'26 x \n'
'61 x \n'
'–  \n'
'–  \n'
'80 x \n'
'–  \n'
'–  \n'
'45  \n'
'55  \n'
'–  \n'
'–  \n'
'46  \n'

'\n'
'\x0c'


The above code read the text file **line-by-line** and printed each line. You should notice that we have converted each line into a printable representation of a string object using Python's build-in function, **`repr()`**, as it will help us discover some patterns that can be used to extract those data tables.

### 1.2 Collecting all the country names
We start with collecting all the country names, because the country names are going to **be the unique identifier of each record** in our final dataset, i.e., indices in Pandas's DataFrame. To do so, let's open up the text file, i.e., `en_final_table_2.txt`, and search for blocks of text that contain country names. Can you identify any pattern?

We can find the following patterns that are consistent for all blocks of text that contain country names.

* Country names start after the line containing **"and areas"**. For example,
    ```
        3 'Countries  \n'
        4 'and areas\n' <––
        5 'Afghanistan \n'
        6 'Albania \n'
        7 'Algeria \n'
        8 'Andorra \n'
    ```
* The last country name in the name block is **followed by a line containing just a new line character (`\n`)**. For example,
    ```
        41 'China \n'
        42 'Colombia \n'
        43 'Comoros \n'
        44 'Congo \n'
        45 '\n'      <––
        46 'Low  \n'
    ```

Thus, to extract the country names, we need to create a Boolean variable **to indicate the start and end of each name block**. This Boolean variable should be set to `True` when we hit the "and areas" line, and to `false` when we hit the line containing only a new line character. We then update our python script with the Boolean variable accordingly.

In [22]:
pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False

for line in pdf_txt:
    if isCountryName:
        print(repr(line))
        
    # Search for the line that starts with 'and areas'. 
    # If the line starts with 'and areas', 
    # we set isCountryName to True
    if line.startswith('and areas'): #hit the start of the country list
        isCountryName = True
        
    # If isCountryName is True, 
    # and the line is equal to a new line character '\n',
    # Set isCountryName to False.
    
    elif isCountryName and line == '\n': #hit the end of the country list
        isCountryName = False

'Afghanistan \n'
'Albania \n'
'Algeria \n'
'Andorra \n'
'Angola \n'
'Antigua and Barbuda \n'
'Argentina \n'
'Armenia \n'
'Australia \n'
'Austria \n'
'Azerbaijan \n'
'Bahamas \n'
'Bahrain \n'
'Bangladesh \n'
'Barbados \n'
'Belarus \n'
'Belgium \n'
'Belize \n'
'Benin \n'
'Bhutan \n'
'Bolivia (Plurinational \n'
'   State of) \n'
'Bosnia and Herzegovina \n'
'Botswana \n'
'Brazil \n'
'Brunei Darussalam \n'
'Bulgaria \n'
'Burkina Faso \n'
'Burundi \n'
'Cabo Verde \n'
'Cambodia \n'
'Cameroon \n'
'Canada \n'
'Central African Republic \n'
'Chad \n'
'Chile \n'
'China \n'
'Colombia \n'
'Comoros \n'
'Congo \n'
'\n'
'Cook Islands \n'
'Costa Rica \n'
'Côte d’Ivoire \n'
'Croatia \n'
'Cuba \n'
'Cyprus \n'
'Czech Republic \n'
'Democratic People’s \n'
'   Republic of Korea \n'
'Democratic Republic \n'
'   of the Congo \n'
'Denmark \n'
'Djibouti \n'
'Dominica \n'
'Dominican Republic \n'
'Ecuador \n'
'Egypt \n'
'El Salvador \n'
'Equatorial Guinea \n'
'Eritrea \n'
'Estonia \n'
'Ethiopia \n'
'Fiji \n'
'Finl

Now, when we run the above script, we get what looks like all the lines with country names returned. However, if we look closely at the output, we will find that our script is not adequately parsing the lines with country name. The following issues can be identified:

1. Line breaks with or without white spaces. For example, at the end of the output, you will find 
    ```
        'Viet Nam \n'
        'Yemen \n'
        'Zambia \n'
        'Zimbabwe \n'
        ' \n'
        '\n'
    ```
    The script we have written so far cannot exclude the lines equal to `'\n'` and handle the lines containining only while spaces. Note that line breaks, as shown above, are difficult to find with the naked eye. That is why we used `repr()` to print out each line. 
2. Countries with names spreading over more than one line. For example,
   ```
       'Bolivia (Plurinational \n'
       '   State of) \n'
   ```
3. All the country names end with `'\n'` and some country names containing special characters, for example
    ```
        'Democratic People\xe2\x80\x99s \n'
        '   Republic of Korea \n'
    ```
    We need to clean those names to make them readable.

First, we start with excluding all the lines that are equal to either `'\n'` or `'\n'` with leading white spaces. Here we choose to use regular expressions:
```python
    import re
    reg = re.complie(r"^\s*$")
    for line in pdf_txt:
        reg.match(line) != None
```
This regular expression matches all empty lines that contain zero or more space characters. Inserting those code into our script, we have

## (1) Exclude the lines equal to `'\n'` or `'  \n'`

In [50]:
import re
reg = re.compile(r"^\s*$") #Match pattern which both starts and ends with whitespace(s) 

pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False
for line in pdf_txt:
    #Print out all the country names and exclude line breaks
    if isCountryName and reg.match(line) == None: # This line is Not all whitespace, print out
        print (repr(line))
    # Set the switch
    if line.startswith('and areas'): #Following are country names, turn True
        isCountryName = True
    # Set the boolean variable to False, if we reach a line break
    elif isCountryName and reg.match(line) != None: # This line is all whitespace
        isCountryName = False

'Afghanistan \n'
'Albania \n'
'Algeria \n'
'Andorra \n'
'Angola \n'
'Antigua and Barbuda \n'
'Argentina \n'
'Armenia \n'
'Australia \n'
'Austria \n'
'Azerbaijan \n'
'Bahamas \n'
'Bahrain \n'
'Bangladesh \n'
'Barbados \n'
'Belarus \n'
'Belgium \n'
'Belize \n'
'Benin \n'
'Bhutan \n'
'Bolivia (Plurinational \n'
'   State of) \n'
'Bosnia and Herzegovina \n'
'Botswana \n'
'Brazil \n'
'Brunei Darussalam \n'
'Bulgaria \n'
'Burkina Faso \n'
'Burundi \n'
'Cabo Verde \n'
'Cambodia \n'
'Cameroon \n'
'Canada \n'
'Central African Republic \n'
'Chad \n'
'Chile \n'
'China \n'
'Colombia \n'
'Comoros \n'
'Congo \n'
'Cook Islands \n'
'Costa Rica \n'
'Côte d’Ivoire \n'
'Croatia \n'
'Cuba \n'
'Cyprus \n'
'Czech Republic \n'
'Democratic People’s \n'
'   Republic of Korea \n'
'Democratic Republic \n'
'   of the Congo \n'
'Denmark \n'
'Djibouti \n'
'Dominica \n'
'Dominican Republic \n'
'Ecuador \n'
'Egypt \n'
'El Salvador \n'
'Equatorial Guinea \n'
'Eritrea \n'
'Estonia \n'
'Ethiopia \n'
'Fiji \n'
'Finland \

## (2) Solve countries with names spreading over more than one line

To resolve the second issue in the list, let's look at all the countries names that spread over two lines. 
```
'Bolivia (Plurinational \n'
'   State of) \n'
'Democratic People\xe2\x80\x99s \n'
'   Republic of Korea \n'
'Democratic Republic \n'
'   of the Congo \n'
'Lao People\xe2\x80\x99s \n'
'   Democratic Republic \n'
'Micronesia (Federated \n'
'   States of) \n'
'Saint Vincent and \n'
'   the Grenadines \n'
'The former Yugoslav \n'
'   Republic of Macedonia \n'
'United Republic \n'
'   of Tanzania \n'
'Venezuela (Bolivarian \n'
'   Republic of) \n'
```
It is clear that there is a consistent pattern that the second line of each of those names **starts with a couple of white spaces**. To find all the lines starting with white spaces, we can use the following regular expression
```python
    re.match(r"^\s+", line) != None
```
The regular expression matches strings that **start with 1 or more white spaces**. Now you should see the difference between `'*'` and `'+'` in regular expression. 

However, this regular expression can only identify every second line in the above list. **Our final goal is to merge**, for example, 'Bolivia (Plurinational \n' and '   State of) \n' **into one line**. To do so, we will create a variable, called `previous_line`, to **temporarily store** 'Bolivia (Plurinational \n' before we hit '   State of) \n'.
The updated script is as follows.

In [51]:
import re
reg = re.compile(r"^\s*$") #Match pattern which both starts and ends with whitespace(s) 
pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False

previous_line =''#cache the preceding line
for line in pdf_txt:
    if isCountryName and reg.match(line) == None: # This line is Not all whitespace
        if re.match(r"^\s+", line) != None:       #Check whether it starts with one or more white spaces
            line = ''.join([previous_line, line]) # Join two strings
            print (repr(line))
        else: # No whitespace ahead
            print (repr(line))
    
    if line.startswith('and areas'): # check whether it starts with 'and areas'
        isCountryName = True
    elif isCountryName and reg.match(line) !=None: # chech whether this line is full of whitespaces
        isCountryName = False
    previous_line = line # Cache the line right before the current line.

'Afghanistan \n'
'Albania \n'
'Algeria \n'
'Andorra \n'
'Angola \n'
'Antigua and Barbuda \n'
'Argentina \n'
'Armenia \n'
'Australia \n'
'Austria \n'
'Azerbaijan \n'
'Bahamas \n'
'Bahrain \n'
'Bangladesh \n'
'Barbados \n'
'Belarus \n'
'Belgium \n'
'Belize \n'
'Benin \n'
'Bhutan \n'
'Bolivia (Plurinational \n'
'Bolivia (Plurinational \n   State of) \n'
'Bosnia and Herzegovina \n'
'Botswana \n'
'Brazil \n'
'Brunei Darussalam \n'
'Bulgaria \n'
'Burkina Faso \n'
'Burundi \n'
'Cabo Verde \n'
'Cambodia \n'
'Cameroon \n'
'Canada \n'
'Central African Republic \n'
'Chad \n'
'Chile \n'
'China \n'
'Colombia \n'
'Comoros \n'
'Congo \n'
'Cook Islands \n'
'Costa Rica \n'
'Côte d’Ivoire \n'
'Croatia \n'
'Cuba \n'
'Cyprus \n'
'Czech Republic \n'
'Democratic People’s \n'
'Democratic People’s \n   Republic of Korea \n'
'Democratic Republic \n'
'Democratic Republic \n   of the Congo \n'
'Denmark \n'
'Djibouti \n'
'Dominica \n'
'Dominican Republic \n'
'Ecuador \n'
'Egypt \n'
'El Salvador \n'
'Equatorial Gu

After joining the previous line with the current line, we have not yet removed the previous line from the printout. Next we will remove those redundant lines and store all the country names in a list, `countryNames`.

In [52]:
import pprint
import re
reg = re.compile("^\s*$")

pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False
previous_line =''
countryNames = [] # for storing the country names
for line in pdf_txt:
    if isCountryName and reg.match(line) == None: 
        if re.match(r"^\s+", line) != None:
            line = ''.join([previous_line, line])
            del countryNames[-1] # Delete the previous one
            countryNames.append(line) # Add the merge one
        else:
            countryNames.append(line)
    
    if line.startswith('and areas'):
        isCountryName = True
    elif isCountryName and reg.match(line) !=None:
        isCountryName = False
    previous_line = line
pprint.pprint(countryNames)

['Afghanistan \n',
 'Albania \n',
 'Algeria \n',
 'Andorra \n',
 'Angola \n',
 'Antigua and Barbuda \n',
 'Argentina \n',
 'Armenia \n',
 'Australia \n',
 'Austria \n',
 'Azerbaijan \n',
 'Bahamas \n',
 'Bahrain \n',
 'Bangladesh \n',
 'Barbados \n',
 'Belarus \n',
 'Belgium \n',
 'Belize \n',
 'Benin \n',
 'Bhutan \n',
 'Bolivia (Plurinational \n   State of) \n',
 'Bosnia and Herzegovina \n',
 'Botswana \n',
 'Brazil \n',
 'Brunei Darussalam \n',
 'Bulgaria \n',
 'Burkina Faso \n',
 'Burundi \n',
 'Cabo Verde \n',
 'Cambodia \n',
 'Cameroon \n',
 'Canada \n',
 'Central African Republic \n',
 'Chad \n',
 'Chile \n',
 'China \n',
 'Colombia \n',
 'Comoros \n',
 'Congo \n',
 'Cook Islands \n',
 'Costa Rica \n',
 'Côte d’Ivoire \n',
 'Croatia \n',
 'Cuba \n',
 'Cyprus \n',
 'Czech Republic \n',
 'Democratic People’s \n   Republic of Korea \n',
 'Democratic Republic \n   of the Congo \n',
 'Denmark \n',
 'Djibouti \n',
 'Dominica \n',
 'Dominican Republic \n',
 'Ecuador \n',
 'Egypt \n',
 

## (3) Solve country names end with `'\n'` or containing special characters

We have collected all the country names from the text version of our PDF file. The total number of countries is 197. Now, we are going to do some **cleaning to resolve the last issue**. Data cleaning will be explained in greater detail in Module 3. For now, we will just clean up the country names, as they are not easy to read. We wrap the cleaning code into a Python function as follows.
```python
    def clean(line):
        line = line.strip('\n')  #remove '\n' from both start and end
        line = line.strip()      #remove whitespace
        line = line.replace('\xe2\x80\x99', '\'') #remove special characters 
        return line
```
The first line in the function removes both the leading and the trailing new line characters, '\n'. The second line removes the leading and railing white spaces. The third line replaces a special character encoding. Now insert the `clean` function into the FOR-loop.

In [55]:
import pprint
import re
reg = re.compile("^\s*$")

def clean(line):
        line = line.strip('\n') # remove leading and training '\n' 
        line = line.strip() # remove leading and trailing while spaces
        line = line.replace('\xe2\x80\x99', '\'') # '\'' literally is '.
        return line

pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False
previous_line =''
countryNames = []
for line in pdf_txt:
    if isCountryName and reg.match(line) == None: 
        if re.match(r"^\s+", line) != None:
            line = ' '.join([clean(previous_line), clean(line)])
            del countryNames[-1]
            countryNames.append(line)
        else:
            countryNames.append(clean(line))
    
    if line.startswith('and areas'):
        isCountryName = True
    elif isCountryName and reg.match(line) !=None:
        isCountryName = False
    previous_line = line
pprint.pprint(countryNames)

['Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Costa Rica',
 'Côte d’Ivoire',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Democratic People’s Republic of Korea',
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Gre

Finally, we have successfully extracted the names of 197 countries and stored them in a list. Next we are going to extract all the columns.

In [56]:
len(countryNames)

197

### 1.3 Extracting all the table columns

Extracting the table columns is not as easy as collecting the country names. Scanning the text file, you will find that you cannot simply create a Boolean variable, as there are **no patterns that can be used to identify the start and the end of each column** on each PDF page. All columns on the same PDF page are stacked together and interleaved with either text or line breaks. 

How can we extract the data in columns and recover the table structure? 

Fortunately, it looks like `pdf2txt.py` extracted the data from our PDF file in a **column-wise** way. **Each cell in our PDF was extracted as one line** in the text file; and cells were stacked according to the linear layout of the table on each page. If we are able to **extract all the cell values in order** and **know the number of records** and **the number of columns on each page**, we might then be able to unstack all the cells and put them into a tabular format. 

Getting the number of columns in the PDF table is easy. **Manually counting the number of columns**, you will get the number 12. 

Next, let's **count the number of records on each PDF page**. Since we have written the script to collect all the country names from each PDF page, it should be easy to compute the number of records on each page by inserting a count variable into the same script. Let the count variable be `numRec`. While the script hits the line starting with `'and areas'`, we set `numRec` to zero. We then increase `numRec` by one every time we successfully retrieve a name until the script hits the end of the name block. Let's insert this logic into the script and save the counts in a list, `recordsPerPage`.

In [62]:
import re
reg = re.compile("^\s*$")

def clean(line):
        line = line.strip('\n') 
        line = line.strip()
        #print(repr(line))
        line = line.replace('\xe2\x80\x99', '\'')
        return line

pdfTxtFile = './en_final_table_2.txt'
pdf_txt = open(pdfTxtFile, 'r')
isCountryName = False
previous_line =''
countryNames = []
numRec = 0 # for counting num of record
recordsPerPage = [] # storing the num of record in each pages


for line in pdf_txt:
    if isCountryName and reg.match(line) == None: 
        #print repr(line)
        if re.match(r"^\s+", line) != None:
            line = ' '.join([clean(previous_line), clean(line)])
            del countryNames[-1]
            countryNames.append(line)
            numRec -= 1 #
        else:
            countryNames.append(clean(line))
        numRec += 1 # A record is successfully added
        #print('record: %s Count: %d'%(line,numRec))
    
    if line.startswith('and areas'):
        isCountryName = True
        numRec = 0 # Re-start counting as it is a new page
    elif isCountryName and reg.match(line) !=None:
        isCountryName = False
        recordsPerPage.append(numRec) # Reach the end of this page, save record
    previous_line = line
    
print(recordsPerPage)

[39, 38, 38, 39, 38, 5]


To collect all the cell values from the text file, we will use regular expressions. The values that each cell can take are
```
    '6  \n'
    '6 x \n'  
    '12 x \n'
    '39 x,y \n'
    '96 x,y\n'
    '76 y \n'
    '100 \n'
    '100 x\n'
    '90 w \n'
    '–  \n'
```
It is not hard for you to figure out the patterns in the above values, which are a dash, a number with 1, 2 or 3 digits, or a number followed by `'x'`, `'x,y'`, `'y'` or `'w'`. Taking into account white spaces, we generate the following regular expression to encode all the patterns.
```python
    regx = re.compile("^(\d{1,3}|–)\s?(x|x,y|y|w)?\s*$")
```
* `^(\d{1,3}|–)`: The matched line should start with a **dash symbol** `'-'` or a **number** with 1 to 3 digits
* `(x|x,y|y|w)?`: The matched line should contain **none or one** of elements in the parentheses
* `\s*$`: The matched line should end with **zero or more** white space characters.

The following script will print out all the cell values that match these patterns.

```python
    import re
    regx = re.compile("^(\d{1,3}|–)\s?(x|x,y|y|w)?\s*$")
    pdfTxtFile = './en_final_table_2.txt'
    pdf_txt = open(pdfTxtFile, 'r')

    for line in pdf_txt:
        if regx.match(line) != None:
            print repr(line)
```

However, this script will also extract '2' in the following lines:
```
    'T\n'
    'A\n'
    'B\n'
    'L\n'
    'E\n'
    '2\n'
```
**`'2'` following `'E'` is not a cell value.**  We need to exclude it in order to make proper alignment among rows and columns.  
Checking whether the preceding line of `'2'` is equal to `'E\n'` or not will solve this problem.  
Similar to the method we used to handle country names that spread over two lines, 
we introduce a string variable, `previous_line`, to cache the preceding line.  
Thus, if the preceding line is `'E\n'`, the following line equal to `'2\n'` will be excluded. We add the following 
conditition into the IF statement

```python
    re.match(r"^E\s*$", previous_line) == None:
```

So, the updated script will be

```python
    import re
    regx = re.compile("^(\d{1,3}|–)\s?(x|x,y|y|w)?\s*$")
    pdfTxtFile = './en_final_table_2.txt'
    pdf_txt = open(pdfTxtFile, 'r')
    previous_line = ''
    for line in pdf_txt:
        if regx.match(line) != None and re.match(r"^E\s*$", previous_line) == None:
            print repr(line)
        previous_line = line
```

Now, we can merge all the scripts that we have written so far together and generate the final script 
for scraping data tables from the PDF file.  
In the following merged script, the part of collecting country names and counting the number of 
records on each page is wrapped in a Python function, called `extract`.  
This function takes the the text file as input and output two lists, one for country names, and another for record counts.  
The extracted data is going to be stored in a dictionary, where keys are column indices, values 
are lists of cell values in individual columns.

In [64]:
import re

def clean(line):
    """
        Clean extra '\n', whitespaces, and special characters
    """
    line = line.strip('\n')
    line = line.strip() 
    line = line.replace('\xe2\x80\x99', '\'')
    return line

def extract(pdfTxtFile):
    """
        Collecting all the country names and counting the number
        of records, i.e., countries, on each page. 
    """
    reg = re.compile("^\s*$")# pattern for string with all whitespaces
    isCountryName = False
    countryNames = []
    recordsPerPage = []
    numRec = 0
    previous_line =''

    pdf_txt = open(pdfTxtFile, 'r')
    for line in pdf_txt:
        if isCountryName and reg.match(line) == None: 
            #print repr(line)
            if re.match(r"^\s+", line) != None:
                line = ' '.join([clean(previous_line), clean(line)])
                del countryNames[-1] #delete the previous line
                countryNames.append(line)
                numRec -= 1
            else:
                countryNames.append(clean(line))
            numRec += 1

        if line.startswith('and areas'):
            isCountryName = True
            numRec = 0
        elif isCountryName and reg.match(line) !=None:
            isCountryName = False
            recordsPerPage.append(numRec)
        previous_line = line
    return countryNames,recordsPerPage

pdfTxtFile = './en_final_table_2.txt'
# All country names have been parsed into countryNames 
countryNames, recordsPerPage = extract(pdfTxtFile)
# Here we are going to extra each column data values
regx = re.compile(r"^(\d{1,3}|–)\s?(x|x,y|y|w)?\s*$")
pdf_txt = open(pdfTxtFile, 'r')
totalNumCols = 12
# initialise variables
pageNum = -1;
numRecs = 0
colIdx = 0

# Python dictinoary used to store all the data
data = {}
for i in range(totalNumCols):
    data[i] = [] # 12 slot for each column
    
idx = 0
previous_line = ''
for line in pdf_txt:
    if line.startswith('and areas'):
        pageNum += 1
        numRecs = recordsPerPage[pageNum]
        colIdx = 0
        idx = 0
    if regx.match(line) != None and re.match(r"^E\s*$", previous_line) == None and colIdx < 12:
        line = line.strip('\n').strip() #Cleaning
        data[colIdx].append(line)
        idx += 1
        if idx % numRecs == 0: # this page this column has finish parsing
            colIdx += 1
    previous_line = line       

### 1.4 Storing data in CSV format 

The final step of scraping data from PDFs is to store the extracted data in a machine readable format. Here 
we are going to store the data in CSV format using Pandas.

In [66]:
import pandas as pd
df = pd.DataFrame(data, index = countryNames)
df.to_csv('finish.csv')

Scraping data from PDFs using `pdf2txt.py` is crude, as you need to go over the text dozens of times 
**to manually identify patterns**, and encode these patterns with regular expressions.  

Checking the CSV file, 
you will find the script **does not correctly extract the table in the last page of our PDF**. The patterns
we found while extracting cell values do not apply to the text extracted from the last page. 
`pdf2txt.py` has stacked columns in arbitrary order. However, one can image that if a tool 
can **make use of the location information of the text elements**, this problem will then be solved. 

Besides the above approach, there are multiple ways of scraping data from PDFs, which utilise the meta information
encapsulated in PDF. We will walk through some of them in the following sections.
* * *

## 2. Scraping data from PDFs with  pdftables
 
After scratching our heads at the complexity shown in the approach of using `'pdf2txt.py'`. We started searching for
other tools or libraries that **make use of information on the locations of text elements in a PDF document**. We came across a Python library, called `pdftables`. In this section, you will learn how to use pdftables to extract data from our PDF files. To install this library, use the following command
```shell
    pip install pdftables.six
```
Note that installing pdftables might **downgrade your numpy version**, which could cause Pandas to fail. In this case,
you need to upgrade your numpy after installing pdftables. We should mention that the drawback of using pdftables 
is that its developers do not maintain proper documentation. Hence you might <font color='red'>need to look at the source code to figure out</font> the functions that you are going to use. Nevertheless, it is a good tool for extracting data tables from
PDFs. You will eventually find that the **all-in-one function** that you are going to use to get the data is
```python
    pdftables.get_tables()
```

In this section, we will use the same PDF file as we used in the previous section to demonstrate how to use pdftables to scrape all the tables from that PDF file. Let's start with loading our PDF with the `get_tables()` function.

In [7]:
import pdftables

ModuleNotFoundError: No module named 'python'

In [1]:
from pdftables import get_tables
pdfFile = './EN_FINAL_Table_2.pdf'
pdfobj = open(pdfFile, 'rb')
tables = get_tables(pdfobj)

ImportError: cannot import name 'PDFObjectNotFound'

The above script will take a couple of seconds to load our PDF. The `get_tables()` function returns each page as its
own table, each of those tables have a list of rows, and each of those row is a contained list of columns. You can use the following Python code to print out each row in each table:
```python
    for table in tables:
        for row in table:
            print row
```

In [None]:
for table in tables:
    for row in table[:10]:
        print (row)
    print ('==========================\n')

What did you find? 

All the titles are included in the first 5 lists, and they are very messy. For simplification,
we do not extract column titles here with a Python script. Instead assume that we can manually set up the 
title list by eyeballing the original PDF. However, we can also see country rows start from the sixth list,
and those rows are quite clean. To exclude the first five lists in each table, we can use list slicing in 
the FOR-loop over rows:
``` python
    for row in table[5:]
```

Similarly, if we print out the last 10 rows of each table,

In [None]:
for table in tables:
    for row in table[-10:]:
        print (row)
    print ('==========================\n')

Again, what did you find?

In the first five tables, the last country row is always followed by a similar row like
```
['39      THE STATE OF', 'THE WORLD\xe2\x80\x99', 'S CHILDREN', '2014 IN', 'NU', 'MBERS', '', '', '', '', '', '', '', '']
```
Therefore, our script should stop collecting country rows while it hit the above row. In the FOR-loop over rows, we should have something like
```python
    if 'THE STATE OF' in row[0]:
            break
```

Unfortunately, the above pattern does not apply to the last table. It needs specially treatment. If we look at the 
original PDF file, we will see that the last country row to be collected is the row for 'Zimbabwe'. It appears in
the last table as
```
['Zimbabwe', '11', '65', '31', '', '86', '20', '10', '2', '32', '3', '6', '61', '94 y']
```
Thus, we can put another IF statement to check if the first string in the list is 
equal to 'Zimbabwe'. If it is, then we stop collecting country rows after collecting the current row. 
```python
    if row[0] == 'Zimbabwe':
        print row
        break
```
Let's insert this logic into the FOR loop over rows.

In [None]:
for table in tables:
    for row in table[5:]:
        if 'THE STATE' in row[0]:
            break
        if row[0] == 'Zimbabwe':
            print (row)
            break;
        print (row)
    print ('==========================\n')

It seems that running the above script returns all the country rows. It will. However, it will also return, for example,
```
['Bolivia (Plurinational', '', '', '', '', '', '', '', '', '', '', '', '', '']
['State of)', '6', '64', '60', '', '83', '40', '4', '1', '27', '1', '9', '41', '89 y']
```
This is similar to what we found earlier in Section 2, while we were handling country names spreading over two rows.
We want to programmatically solve this problem with some tests based on what we have learnt so far. Since '-' is used to indicate missing data in our PDF, we know for sure that if the first element of the row is a string (i.e., not null)
and all the following elements are null, this row must contain the first part of a country name. Before we skip this
row, we need to use a variable (say 'first_name') to cache the first part, as we need to merge it with the
corresponding second part.
The code should look like
```python
    if row[2] == '':
        first_name = row[0]
        continue
```
Since these country names spread over two consecutive rows, we add the following IF statement to join the two parts 
of a country name:
```python
    if first_name != '':
        row[0] = '{} {}'.format(first_name, row[0])
        first_name = ''
```
Now, we put the two IF statements into the FOR loop over rows.

In [None]:
first_name = ''

for table in tables:
    for row in table[5:]:
        if row[2] == '':
            first_name = row[0]
            continue
        if first_name != '':
            row[0] = '{} {}'.format(first_name, row[0])
            first_name = ''
        if 'THE STATE OF' in row[0]:
            break
        if row[0] == 'Zimbabwe':
            print (row)
            break
        print (row)

We now have completely extracted all the country rows from the six tables. Next we are going to store them in Panda's
DataFrame. There are multiple ways of creating a DataFrame. Here we create DataFrame by passing a dictionary of objects.

In [None]:
import pandas as pd

data = {}
for table in tables:
    for row in table[5:]:
        if row[2] == '':
            continue
        if row[0] == 'Zimbabwe':
            data[row[0]] = row[1:]
            break
        if 'THE STATE' in row[0]:
            break
        data[row[0]] = row[1:] 
data = pd.DataFrame(data)
data

When you run the script, you will find that the forth row is empty. The number of rows is supposed to be 12, as there are 12 columns in our PDF. The empty row needs to be dropped, which can be easily done with the `drop()` function of
Pandas' DataFrame. i.e.,
```python
    data.drop(3, 0)
```
The last step is to transpose the DataFrame so that each row is a record for a country, and save the data into a CSV file.

In [None]:
data = data.drop(3, 0)
data = data.T
data.columns = range(12)
data.to_csv('./en_final_table_2_2.csv', sep='\t')

## 3. Summary

PDF is one of the hard-to-parse formats that you will encounter. 
In this chapter, we have learnt how to scrap data tables from PDFs using the following python libraries 
* pdfminer - converts PDF into text, so you can parse the text file by finding patterns and writing regular
    expressions
* pdftables - uses pdfminer to find both text elements and their locations and put aligned elements in a list.

* * *

## 4. Exercise 
1. [Tabula](http://tabula.technology/) is an open source tool that is specifically designed for scraping data within tables from PDFs and saving the data into a CSV file. With a small PDF like the one we used in the chapter, you 
could try Tabula. Thus, please download Tabula and try it on our PDF file. 
