<a href="https://colab.research.google.com/github/atlas-github/abs_digital/blob/master/Extracting_text_from_images_PDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Extract information from a PDF document using [Tabula](https://pypi.org/project/tabula-py/)

Tabula isn't usually a installed in most IDEs, so install the library using the code below.

In [None]:
!pip install tabula-py

I'll be demonstrating how to extract information from page 4 of Maybank's Annual Report 2019, which can be found [here](https://www.maybank.com/en/investor-relations/reporting-events/reports/annual-reports.page).

In [3]:
import tabula

# Read pdf into list of DataFrame
sample_list = tabula.read_pdf("Maybank Annual Report 2019 - Financial Statements (English).pdf", pages='4')

sample_list

[                                           Unnamed: 0  ...                Bank
 0                                                 NaN  ...           FY 31 Dec
 1                                                 NaN  ...           2018 2019
 2                     OPERATING RESULTS (RM’ million)  ...                 NaN
 3                                   Operating revenue  ...      26,681  26,906
 4         Pre-provisioning operating profit (“PPOP”)1  ...       9,491  10,283
 5                                    Operating profit  ...        8,748  8,415
 6                    Profit before taxation and zakat  ...        8,748  8,415
 7   Profit attributable to equity holders of the Bank  ...        7,308  7,279
 8   KEY STATEMENTS OF FINANCIAL POSITION DATA (RM’...  ...                 NaN
 9                                        Total assets  ...    456,613  464,360
 10                   Financial investments portfolio2  ...    121,354  126,286
 11                      Loans, advances

Now to convert the result into a table.

In [4]:
df = sample_list[0]
df = df.drop([0])
df.columns = df.iloc[0]
df = df.drop([1])
#df[1, 0] = "Five-Year Group Financial Summary"
df

1,NaN,2015,2016,2017,2018,2019,2018 2019
2,OPERATING RESULTS (RM’ million),,,,,,
3,Operating revenue,40556.0,44658.0,45580,47320,52845,"26,681 26,906"
4,Pre-provisioning operating profit (“PPOP”)1,10953.0,11686.0,11911,12416,13179,"9,491 10,283"
5,Operating profit,8940.0,8671.0,9883,10803,10856,"8,748 8,415"
6,Profit before taxation and zakat,9152.0,8844.0,10098,10901,11014,"8,748 8,415"
7,Profit attributable to equity holders of the Bank,6836.0,6743.0,7521,8113,8198,"7,308 7,279"
8,KEY STATEMENTS OF FINANCIAL POSITION DATA (RM’...,,,,,,
9,Total assets,708345.0,735956.0,765302,806992,834413,"456,613 464,360"
10,Financial investments portfolio2,122166.0,130902.0,154373,177952,192830,"121,354 126,286"
11,"Loans, advances and financing",453493.0,477775.0,485584,507084,513420,"230,367 226,589"


Now to clean the extracted data.

In [48]:
import pandas as pd

#split the last column by the "  " delimiter
bank = df["2018 2019"].str.split(" ", 1, expand = True)

#combine the two dataframes
result = pd.concat([df, bank], axis = 1)

#rename the dataframe headers
result.columns = result.columns.fillna("Five-year Group Financial Summary")
result = result.rename(columns = {"2015": "2015_Group", "2016": "2016_Group", "2017": "2017_Group", "2018": "2018_Group", "2019": "2019_Group", 0: "2018_Bank", 1: "2019_Bank"})

#drop the extra column
result = result.drop(columns = {"2018 2019"})

#replace Nan with empty cells
result = result.fillna("")
result

Unnamed: 0,Five-year Group Financial Summary,2015_Group,2016_Group,2017_Group,2018_Group,2019_Group,2018_Bank,2019_Bank
2,OPERATING RESULTS (RM’ million),,,,,,,
3,Operating revenue,40556.0,44658.0,45580,47320,52845,26681,26906
4,Pre-provisioning operating profit (“PPOP”)1,10953.0,11686.0,11911,12416,13179,9491,10283
5,Operating profit,8940.0,8671.0,9883,10803,10856,8748,8415
6,Profit before taxation and zakat,9152.0,8844.0,10098,10901,11014,8748,8415
7,Profit attributable to equity holders of the Bank,6836.0,6743.0,7521,8113,8198,7308,7279
8,KEY STATEMENTS OF FINANCIAL POSITION DATA (RM’...,,,,,,,
9,Total assets,708345.0,735956.0,765302,806992,834413,456613,464360
10,Financial investments portfolio2,122166.0,130902.0,154373,177952,192830,121354,126286
11,"Loans, advances and financing",453493.0,477775.0,485584,507084,513420,230367,226589


And if you would like to export the file.

In [None]:
result.to_csv("result.csv")
from google.colab import files
files.download("result.csv")

#Extract information from Google Vision API's [OCR](https://cloud.google.com/vision/docs/ocr) (Optical Character Recognition).

In [37]:
[s.strip() for s in words.split('  ') if s]

AttributeError: ignored