# PDF Files Processing Pipeline
The goal of this notebook is to present a complete pipeline of transforming annual reports from PDF files to tabular dataset. The transformed dataset will be useful for downstream natural langauage processing tasks, including text classification, topic modelling, etc.  

# Table of Contents

* [1. Required Libraries](#1.)
* [2. Parsing PDF](#2.)
    * [2.1. Parsing PDF Files in Batch](#2.1.)
    * [2.2. Aggregate Text Files](#2.2.)
    * [2.3. File Update](#2.3.)
    * [2.4. Read Jason File into Pandas Dataframe](#2.4.)
    * [2.5. Table of Contents (TOC) Parsing](#2.5.)
        * [2.5.1. Expand Page List](#2.5.1.)
        * [2.5.2. Locate TOC Page](#2.5.2.)
        * [2.5.3. Extract TOC](#2.5.3.)
        * [2.5.4. Examples of Manual Editing](#2.5.4.)
            * [a) Change the whole TOC manually directly](#2.5.4.a.)
            * [b) Change the parser mode and matcher mode](#2.5.4.b.)
        * [2.5.5. Assign Headings](#2.5.5.)


# 1. Required Libraries <a class="anchor" id="1."></a>
The core module pdfparser is built upon the following libraries:
* tika (Java 7+ is required for this library)
* pytesseract (tesseract is required for this library)
* beatifulsoup
* pdf2image
* opencv
* pandas
* numpy 

Other libraries needed are:
* os
* json
* re

In [1]:
import pdfparser as Parser 
import os
import json
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')


# 2. Parsing PDF <a class="anchor" id="2."></a>

## 2.1. Parsing PDF Files in Batch <a class="anchor" id="2.1."></a>

Put the PDF files in the main folder, in this case the Annual_Report folder, and the following code will help you loop through all PDF files and try to parse them. The parser will take the raw PDF and extract its text content. The parser will also automatically write the text data of a PDF file into a txt file to save progress of the transformation since parsing files could take some time.    

In [2]:
# the folder that holds all initial annual reports
parent_path = ".\Data\Reports\Annual_Report" 
all_path = os.listdir(parent_path) #list of file paths
pdf_list = [file for file in all_path if file[-3:]=='pdf']
os.chdir(parent_path)

#Parse the PDF files and write page contents into text files
# the txt files will be stored under the parent_path  
for file_path in pdf_list:
    print(file_path)
    Parser.pdfparser(file_path,delete_existing=True)

Barclays_Plc_Annual_Report_2016.pdf
parsing text-formatted pdf file
parsing finished:Barclays_Plc_Annual_Report_2016
Barclays_Plc_Annual_Report_2017.pdf
parsing text-formatted pdf file
parsing finished:Barclays_Plc_Annual_Report_2017
Barclays_Plc_Annual_Report_2018.pdf
parsing text-formatted pdf file
parsing finished:Barclays_Plc_Annual_Report_2018
Barclays_Plc_Annual_Report_2019.pdf
parsing text-formatted pdf file
parsing finished:Barclays_Plc_Annual_Report_2019


# 2.2. Aggregate Text Files <a class="anchor" id="2.2."></a>
The text files will be dumped into one json file. This step makes it easier to transfer data.

In [3]:
all_text=[]
txt_list = [file for file in all_path if file[-3:]=='txt']
for file_path in txt_list:
    # Reads the text file and stores data to dictionary
    all_text.append(Parser.text_to_dict(file_path)) 
with open("raw_text_all.json","w") as f:
    json.dump(all_text,f)

# 2.3. File Update <a class="anchor" id="2.3."></a>
In case we have new files to process, we can put them in a separate folder to avoid extracting data from the processed files again.

In [4]:
#New annual report updates
os.chdir("../../..")
parent_path = ".\Data\Reports\Annual_Report_Updates" 
all_path = os.listdir(parent_path)
pdf_list = [file for file in all_path if file[-3:]=='pdf']
os.chdir(parent_path)
for file_path in pdf_list:
    Parser.pdfparser(file_path,delete_existing=True)

all_text=[]
txt_list = [file for file in all_path if file[-3:]=='txt']
print(len(txt_list))
for file_path in txt_list:
    all_text.append(Parser.text_to_dict(file_path))
with open("raw_text_all.json","w") as f:
    json.dump(all_text,f)

parsing image-formatted pdf file
Folder of Aspen_Insurance_UK_Ltd_Annual_Report_2019.pdf already exsits
Delete the exisiting folder
parsing finished:Aspen_Insurance_UK_Ltd_Annual_Report_2019
1


# 2.4. Read Json File into Pandas Dataframe <a class="anchor" id="2.4."></a>
To facilitate downstream text analysis/machine learning tasks, we can now read the transformed json file into a pandas dataframe

In [5]:
# The main json file
os.chdir("../../..")
parent_path = ".\Data\Reports\Annual_Report"  
os.chdir(parent_path)
df = pd.read_json('raw_text_all.json','records')

# Generate 4 more columns to help us evaluate the results

df['row']=df.index
df['total_page'] = df.raw_text.apply(lambda x: len(x))

# As the structure of file name may vary, we didn't include generating columns of firm_name and year in Parser  
df['firm_name'] = df.file_name.apply(lambda x:
                               re.findall(r'.+(?=_Annual_Report)',x)[0])
df['year'] = df.file_name.apply(lambda x:
                         re.findall(r'.{4}(?=_text)',x)[0])


#Updated file

os.chdir("../../..")
parent_path = ".\Data\Reports\Annual_Report_Updates"  
os.chdir(parent_path)
df_update = pd.read_json('raw_text_all.json','records')
df_update['row']=df_update.index
df_update['total_page'] = df_update.raw_text.apply(lambda x: len(x))
df_update['firm_name'] = df_update.file_name.apply(lambda x:
                               re.findall(r'.+(?=_Annual_Report)',x)[0])
df_update['year'] = df_update.file_name.apply(lambda x:
                         re.findall(r'.{4}(?=_text)',x)[0])


# Update df with df_update (must reset column row with new index)
df_new = df[~df['file_name'].isin(df_update['file_name'])].copy()
df_new = pd.concat([df_new,df_update],ignore_index=True,sort=False)
df_new['row'] = df_new.index 

## 2.5. Table of Contents (TOC) Parsing <a class="anchor" id="2.5."></a>

Having a table of contents simplifies the navigation through the reports, although it is not necessary for all nlp tasks. This section demonstrates steps to achieve this goal with our pdfparser module.

### 2.5.1. Expand Page List <a class="anchor" id="2.5.1."></a>
Previously the pages of a file are stored in a list, we expand this list to make it easier for the following processing.

In [6]:
#Expand dataframe
#each row of the dataframe now contains only the information of one page

df_expanded = Parser.expand_pagelist(df_new)
df_expanded.head()

Unnamed: 0,file_name,bookmark,raw_text,source_type,total_page,firm_name,year,page_number
0,Barclays_Plc_Annual_Report_2016_text,[],\nBarclays PLC\nAnnual Report 2016\n\nBuilding...,Text,380,Barclays_Plc,2016,1
1,Barclays_Plc_Annual_Report_2016_text,[],\nThe Detailed Report\nWithin the Annual Repor...,Text,380,Barclays_Plc,2016,2
2,Barclays_Plc_Annual_Report_2016_text,[],\nhome.barclays/annualreport Barclays PLC Annu...,Text,380,Barclays_Plc,2016,3
3,Barclays_Plc_Annual_Report_2016_text,[],\n02 • Barclays PLC Annual Report 2016 home.ba...,Text,380,Barclays_Plc,2016,4
4,Barclays_Plc_Annual_Report_2016_text,[],\nhome.barclays/annualreport Barclays PLC Annu...,Text,380,Barclays_Plc,2016,5


### 2.5.2. Locate TOC Page <a class="anchor" id="2.5.2."></a>
To parse the toc of a file, we need to first locate its position. The pdfparser will do the tasks based on the text structure and standard heading examples. If the pdfparser can't not find the TOC, you can always input the acutual number by yourself.

In [7]:
# Read the gold standard headings,
# the standard are used to improve the parsing results,
# especially for image PDF. You can add more headings as you like.
os.chdir("../../..")
standard_headings = Parser.headings_preprocessor('.\Data\standard_headings.xlsx')

# locate toc page, the final results are in the column 'toc_page'
df_toc = Parser.find_toc_page(df_expanded,standard_headings)

In [8]:
#If the toc page number is nan, need to manually check and modification of the result
for index,row in df_toc[pd.isna(df_toc['toc_page'])].iterrows():
    print(index)
    manual_check = input('''Please input where the Table of Content occurs, 
                         if no Table of Contents exits press Enter directly:''')
    try:
        df_toc.at[index,'toc_page'] = int(manual_check)
    except:
        pass

### 2.5.3. Extract TOC <a class="anchor" id="2.5.3."></a>

In [9]:
# Create a new column to store potenrial headings
df_expanded = Parser.extract_potential_headings(df_expanded)

# Auto matching
df_toc_final = Parser.auto_parser_matcher(df_expanded,df_toc,standard_headings)
df_toc_final.head()

Unnamed: 0_level_0,source_type,total_page,firm_name,year,toc_page,toc_headings_candidate,method,indexed_page_dict
file_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Barclays_Plc_Annual_Report_2016_text,Text,380,Barclays_Plc,2016,2.0,"[(2, chairman letter), (4, chief executive rev...",loo_par_auto,"{4: (2, 'chairman letter'), 6: (4, 'chief exec..."
Barclays_Plc_Annual_Report_2017_text,Text,328,Barclays_Plc,2017,3.0,"[(2, chairman letter), (4, chief executive rev...",str_par_str_mat,"{4: (2, 'chairman letter'), 6: (4, 'chief exec..."
Barclays_Plc_Annual_Report_2018_text,Text,364,Barclays_Plc,2018,3.0,"[(48, directors report), (93, people), (99, re...",str_par_loo_mat,"{5: (1000, 'strategic report'), 6: (1000, 'gov..."
Barclays_Plc_Annual_Report_2019_text,Text,344,Barclays_Plc,2019,3.0,"[(2, business profile), (4, chairman introduct...",str_par_auto,"{4: (2, 'business profile'), 6: (4, 'chairman ..."
Aspen_Insurance_UK_Ltd_Annual_Report_2019_text,Image,58,Aspen_Insurance_UK_Ltd,2019,2.0,"[(1000, the company), (1000, strategic report)...",str_par_loo_mat,"{3: (1000, 'the company'), 4: (1000, 'strategi..."


### 2.5.4. Examples of Manual Editing <a class="anchor" id="2.5.4."></a>
As for the cases where certain results are not satisfying, we can use manual editing and following is an example:

#### a) Change the whole TOC manually directly <a class="anchor" id="2.5.4.a."></a>

In [None]:
#example only
# file_name =  '.\Data\Reports\Annual_Report_Updates\Aspen_Insurance_UK_Ltd_Annual_Report_2019_text'
# dict_ = {2:(2,'strategic report'),6:(6,'directors report'),
#        7:(7,'statement of directors responsibilities'),
#        8:(8,'independent auditors report'),
#        13:(13,'financial statements'),
#        17:(17,'notes to financial statement'),
#        39:(39,'risk related disclosure'),
#        47:(47,'others')}

# df_toc_final = Parser.manual_editor(df_expanded,df_toc_final,file_name,standard_headings,parser_auto=dict_) 

#### b) Change the parser mode and matcher mode <a class="anchor" id="2.5.4.b."></a>

In [None]:
# example only: change the parser mode and matcher mode for image pdf
# file_name = '.\Data\Reports\Annual_Report\Barclays_Plc_Annual_Report_2016_text'
# df_toc_final2 = Parser.manual_editor(df_expanded,df_toc_final,file_name,standard_headings,parser='str',matcher='loose')

### 2.5.5. Assign Headings to Pages  <a class="anchor" id="2.5.5."></a>
Now we can assign headings to the corresponding page according to the extracted toc

In [10]:
df_expanded = Parser.retrieve_headings(df_expanded,df_toc_final)
df_sections = df_expanded.loc[:,['file_name','firm_name','year','source_type',
                                 'total_page','page_number','raw_text','headings']].copy()