<a href="https://colab.research.google.com/github/atlas-github/abs_digital/blob/master/Extracting_text_from_images_PDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Extract information from a PDF document using [Tabula](https://pypi.org/project/tabula-py/)

Tabula isn't usually installed in most IDEs, so install the library using the code below.

In [1]:
!pip install tabula-py

Collecting tabula-py
[?25l  Downloading https://files.pythonhosted.org/packages/cf/29/d6cb0d77ef46d84d35cffa09cf42c73b373aea664d28604eab6818f8a47c/tabula_py-2.2.0-py3-none-any.whl (11.7MB)
[K     |████████████████████████████████| 11.7MB 311kB/s 
Collecting distro
  Downloading https://files.pythonhosted.org/packages/25/b7/b3c4270a11414cb22c6352ebc7a83aaa3712043be29daa05018fd5a5c956/distro-1.5.0-py2.py3-none-any.whl
Installing collected packages: distro, tabula-py
Successfully installed distro-1.5.0 tabula-py-2.2.0


I'll be demonstrating how to extract information from page 21 of the Ministry of Health's list of FAQs on Covid-19, which can be found [here](https://www.infosihat.gov.my/images/media_sihat/lain_lain/pdf/SOALAN%20LAZIM%20COVID-19.pdf).

In [30]:
import tabula

# Read pdf into list of DataFrame
sample_list = tabula.read_pdf("SOALAN LAZIM COVID-19.pdf", pages='21')

sample_list

Got stderr: Dec 01, 2020 5:10:45 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 01, 2020 5:10:47 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[       NEGERI   BIL                        HOSPITAL SARINGAN
 0      Perlis   1.0          Hospital Tuanku Fauziah, Kangar
 1         NaN   NaN                                      NaN
 2       Kedah   2.0    Hospital Sultanah Bahiyah, Alor Setar
 3         NaN   NaN                                      NaN
 4         NaN   3.0  Hospital Sultan Abdul Halim, Sg. Petani
 5         NaN   NaN                                      NaN
 6         NaN   4.0                           Hospital Kulim
 7         NaN   NaN                                      NaN
 8         NaN   5.0       Hospital Sultanah Maliha, Langkawi
 9         NaN   NaN                                      NaN
 10        NaN   6.0                           Hospital Jitra
 11        NaN   NaN                                      NaN
 12        NaN   7.0                    Hospital Kuala Nerang
 13        NaN   NaN                                      NaN
 14        NaN   8.0                             Hospital Yan
 15     

Now to convert the result into a table.

In [31]:
import numpy as np 

df = sample_list[0]

#get rid of rows with all NaNs
df = df[df['HOSPITAL SARINGAN'].notna()]
df

Unnamed: 0,NEGERI,BIL,HOSPITAL SARINGAN
0,Perlis,1.0,"Hospital Tuanku Fauziah, Kangar"
2,Kedah,2.0,"Hospital Sultanah Bahiyah, Alor Setar"
4,,3.0,"Hospital Sultan Abdul Halim, Sg. Petani"
6,,4.0,Hospital Kulim
8,,5.0,"Hospital Sultanah Maliha, Langkawi"
10,,6.0,Hospital Jitra
12,,7.0,Hospital Kuala Nerang
14,,8.0,Hospital Yan
16,,9.0,Hospital Sik
18,,10.0,Hospital Baling


In [32]:
#replace all NaNs with blanks
result = df.replace(np.nan, '', regex = True)
result

Unnamed: 0,NEGERI,BIL,HOSPITAL SARINGAN
0,Perlis,1.0,"Hospital Tuanku Fauziah, Kangar"
2,Kedah,2.0,"Hospital Sultanah Bahiyah, Alor Setar"
4,,3.0,"Hospital Sultan Abdul Halim, Sg. Petani"
6,,4.0,Hospital Kulim
8,,5.0,"Hospital Sultanah Maliha, Langkawi"
10,,6.0,Hospital Jitra
12,,7.0,Hospital Kuala Nerang
14,,8.0,Hospital Yan
16,,9.0,Hospital Sik
18,,10.0,Hospital Baling


And if you would like to export the file.

In [None]:
result.to_csv("result.csv")
from google.colab import files
files.download("result.csv")

#Extract information from Google Vision API's [OCR](https://cloud.google.com/vision/docs/ocr) (Optical Character Recognition).

Start by setting up definitions based on Google Vision API's [OCR](https://cloud.google.com/vision/docs/ocr). The GOOGLE_APPLICATION_CREDENTIALS file can be obtained by creating a service account key using this [method](https://cloud.google.com/docs/authentication/production).  

In [None]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="OCRproject-135c7e667aa9.json"

def implicit():
    from google.cloud import storage

    # If you don't specify credentials when constructing the client, the
    # client library will look for credentials in the environment.
    storage_client = storage.Client()

    # Make an authenticated API request
    buckets = list(storage_client.list_buckets())
    print(buckets)

def detect_text(path):
    """Detects text in the file."""
    from google.cloud import vision
    import io
    client = vision.ImageAnnotatorClient()

    with io.open(path, 'rb') as image_file:
        content = image_file.read()

    image = vision.types.Image(content=content)

    response = client.text_detection(image=image)
    texts = response.text_annotations
    
    #print('Texts:')
    
    return texts

Install the google-cloud-vision library if it has not been installed in your IDE. 

In [None]:
!pip install google-cloud-vision

For this demonstration, I will be extracting the table listing electrical usage from [here](https://www.tnb.com.my/assets/images/bill_with_sstv2.png).

In [None]:
import matplotlib.pyplot as plt
import cv2
%matplotlib inline

tnb_text = detect_text("bill_with_sstv2.png")
tnb_text

[locale: "ms"
description: "BIL ELEKTRIK ANDA\nTENAGA\nNASIONAL\nNo. Akaun : 220001234512\n: 1002000\n: RM350.00\nTERIMA KASIH\nNo. Kontrak\nKerana\nDeposit\nNo. Invois\nMembayar Dalam\nTempoh 30 Hari\n: 10001234\nAli bin Abu\n3\nTNB Careline\n1-300-88-5454\nE23A - 201 Sek 3\nWangsa Maju\n53300\nWP Kuala Lumpur\nTarikh Bil\nJumlahPerlu Dibayar RM311.90\n03 Okt 2018\nAmaun\nBayar Sebelum\nTunggakan\nCaj Semasa\nPenggenapan\nRM\n0.00\nTerima kasih\nRM 311.88\n0,02\nRM\nJumlah Bil\nRM\n311.90\n01.11.2018\nAmaun\n52685\nTarikh\n5\nBil Terdahulu\nRM\n02.08.2018\nBayaran Akhir\nRM\n526.85\n30.08.2018\nJenis Bacaan\nBacaan Sebenar\nTempoh Bil : 03.09.2018 - 03.10.2018 (31 Hari)\nTarif\nFaktor Prorata\n:A: Kediaman\n1,00000\nBlok Tarif (KWh)\n200\n100\n300\n300\nBlok Prorata (kWh)\n200\n100\n300\n130\nKadar (RM)\n0.218\nAmaun (RM)\n43.60\n33.40\n154.80\n70.98\n8\n0.334\n0.516\n0.546\nJumlah\n730\n302.78\nTidak Kena\nST\nKena\nST\nKeterangan\nJumlah\nKegunaan kWh\nKegunaan\nkWh\n600\n130\n730\n

Convert the OCR output into text, and take only the first value.

In [None]:
tnb_raw = tnb_text[0].__str__()
tnb_raw

'locale: "ms"\ndescription: "BIL ELEKTRIK ANDA\\nTENAGA\\nNASIONAL\\nNo. Akaun : 220001234512\\n: 1002000\\n: RM350.00\\nTERIMA KASIH\\nNo. Kontrak\\nKerana\\nDeposit\\nNo. Invois\\nMembayar Dalam\\nTempoh 30 Hari\\n: 10001234\\nAli bin Abu\\n3\\nTNB Careline\\n1-300-88-5454\\nE23A - 201 Sek 3\\nWangsa Maju\\n53300\\nWP Kuala Lumpur\\nTarikh Bil\\nJumlahPerlu Dibayar RM311.90\\n03 Okt 2018\\nAmaun\\nBayar Sebelum\\nTunggakan\\nCaj Semasa\\nPenggenapan\\nRM\\n0.00\\nTerima kasih\\nRM 311.88\\n0,02\\nRM\\nJumlah Bil\\nRM\\n311.90\\n01.11.2018\\nAmaun\\n52685\\nTarikh\\n5\\nBil Terdahulu\\nRM\\n02.08.2018\\nBayaran Akhir\\nRM\\n526.85\\n30.08.2018\\nJenis Bacaan\\nBacaan Sebenar\\nTempoh Bil : 03.09.2018 - 03.10.2018 (31 Hari)\\nTarif\\nFaktor Prorata\\n:A: Kediaman\\n1,00000\\nBlok Tarif (KWh)\\n200\\n100\\n300\\n300\\nBlok Prorata (kWh)\\n200\\n100\\n300\\n130\\nKadar (RM)\\n0.218\\nAmaun (RM)\\n43.60\\n33.40\\n154.80\\n70.98\\n8\\n0.334\\n0.516\\n0.546\\nJumlah\\n730\\n302.78\\nTidak K

Seperate each value by the delimiter.

In [None]:
tnb_list = list(tnb_raw.split("\\n"))
tnb_list

['locale: "ms"\ndescription: "BIL ELEKTRIK ANDA',
 'TENAGA',
 'NASIONAL',
 'No. Akaun : 220001234512',
 ': 1002000',
 ': RM350.00',
 'TERIMA KASIH',
 'No. Kontrak',
 'Kerana',
 'Deposit',
 'No. Invois',
 'Membayar Dalam',
 'Tempoh 30 Hari',
 ': 10001234',
 'Ali bin Abu',
 '3',
 'TNB Careline',
 '1-300-88-5454',
 'E23A - 201 Sek 3',
 'Wangsa Maju',
 '53300',
 'WP Kuala Lumpur',
 'Tarikh Bil',
 'JumlahPerlu Dibayar RM311.90',
 '03 Okt 2018',
 'Amaun',
 'Bayar Sebelum',
 'Tunggakan',
 'Caj Semasa',
 'Penggenapan',
 'RM',
 '0.00',
 'Terima kasih',
 'RM 311.88',
 '0,02',
 'RM',
 'Jumlah Bil',
 'RM',
 '311.90',
 '01.11.2018',
 'Amaun',
 '52685',
 'Tarikh',
 '5',
 'Bil Terdahulu',
 'RM',
 '02.08.2018',
 'Bayaran Akhir',
 'RM',
 '526.85',
 '30.08.2018',
 'Jenis Bacaan',
 'Bacaan Sebenar',
 'Tempoh Bil : 03.09.2018 - 03.10.2018 (31 Hari)',
 'Tarif',
 'Faktor Prorata',
 ':A: Kediaman',
 '1,00000',
 'Blok Tarif (KWh)',
 '200',
 '100',
 '300',
 '300',
 'Blok Prorata (kWh)',
 '200',
 '100',
 '300',

Identify the parts which contain the data I am looking for.

In [None]:
block = tnb_list[58:63]
block

['Blok Tarif (KWh)', '200', '100', '300', '300']

In [None]:
prorated = tnb_list[63:68]
prorated

['Blok Prorata (kWh)', '200', '100', '300', '130']

In [None]:
rate = tnb_list[68:70] + tnb_list[76:79]
rate

['Kadar (RM)', '0.218', '0.334', '0.516', '0.546']

In [None]:
amount = tnb_list[70:75]
amount

['Amaun (RM)', '43.60', '33.40', '154.80', '70.98']

Turn the lists into a table, and clean the table into an appropriate format.

In [None]:
import pandas as pd
compiled = pd.DataFrame([block, prorated, rate, amount]).T
compiled

Unnamed: 0,0,1,2,3
0,Blok Tarif (KWh),Blok Prorata (kWh),Kadar (RM),Amaun (RM)
1,200,200,0.218,43.60
2,100,100,0.334,33.40
3,300,300,0.516,154.80
4,300,130,0.546,70.98


In [None]:
compiled.columns = compiled.iloc[0]

And here's the final result. 

In [None]:
compiled = compiled.drop([0])
compiled

Unnamed: 0,Blok Tarif (KWh),Blok Prorata (kWh),Kadar (RM),Amaun (RM)
1,200,200,0.218,43.6
2,100,100,0.334,33.4
3,300,300,0.516,154.8
4,300,130,0.546,70.98
