# Extracting data from pdf documents

## Text Based pdf

### Extracting text from pdf using PyPDF2

In [2]:
%pip install pypdf2

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pypdf2
Successfully installed pypdf2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [8]:
from PyPDF2 import PdfReader

reader = PdfReader('./documents/text.pdf')
text=""

for page in reader.pages:
    text += page.extract_text()

print(text)

USENIX Example P aper
Pekka Nik ander
Aalto Univ ersit yJane-Ellen Long
USENIX A ssociation
Abstrac t
This isanexample foraUSENIX paper ,intheform
ofanHTML/ CSS template. Being heavily self-ref-
erential, this template illustrates the features in-
cluded inthis template. Itisexpec ted that the
prospec tiveauthors using HTML/ CSS would create
anewdocument based onthis template, remo ve
the c ontent, and start writing their paper .
Notethat inthis template, youmay haveamul-
ti-paragraph abstrac t.However,that itisnotnec-
essarily agood prac tice.Trytokeep your abstrac t
inone paragraph, and remember that theoptimal
length f or an abstrac t is 200-300 w ords.
1Introduc tion
For the purposes ofUSENIX conferenc epublica-
tions, theauthors, nottheUSENIX staff ,aresolely
responsible forthecontent and format ting oftheir
paper .The purpose ofthis template istohelp
those authors that want touseHTML/ CSS towrite
their papers. This template has been prepared by
HåkonWium Lie, and isbased onaguide

### Extracting the same text from pdf using PDFMiner

In [11]:
%pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting cryptography>=36.0.0 (from pdfminer.six)
  Using cached cryptography-43.0.3-cp39-abi3-win_amd64.whl.metadata (5.4 kB)
Collecting cffi>=1.12 (from cryptography>=36.0.0->pdfminer.six)
  Using cached cffi-1.17.1-cp313-cp313-win_amd64.whl.metadata (1.6 kB)
Collecting pycparser (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six)
  Using cached pycparser-2.22-py3-none-any.whl.metadata (943 bytes)
Downloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
   ---------------------------------------- 0.0/5.6 MB ? eta -:--:--
   ----- ---------------------------------- 0.8/5.6 MB 6.9 MB/s eta 0:00:01
   ------------- -------------------------- 1.8/5.6 MB 6.2 MB/s eta 0:00:01
   ---------------------- ----------------- 3.1/5.6 MB 6.0 MB/s eta 0:00:01
   ---------------------------- ----------- 3.9/5.6 MB 5.2 MB/s eta 0:00:01
   ------------------------------------- -- 5.2/5.6 MB 5.7 MB/s e

In [14]:
from pdfminer.high_level import extract_text

# Extract text from the PDF file
text = extract_text("./documents/text.pdf")
print(text)

USENIX Example Paper

Pekka Nikander
Aalto University

Jane-Ellen Long
USENIX Association

Abstract
This is an example for a USENIX paper, in the form
of an HTML/CSS template. Being heavily self-ref-
erential, this template illustrates the features in-
cluded in this template. It is expected that the
prospective authors using HTML/CSS would create
a new document based on this template, remove
the content, and start writing their paper.

Note that in this template, you may have a mul-
ti-paragraph abstract. However, that it is not nec-
essarily a good practice. Try to keep your abstract
in one paragraph, and remember that the optimal
length for an abstract is 200-300 words.

1 Introduction
For the purposes of USENIX conference publica-
tions, the authors, not the USENIX staff, are solely
responsible for the content and formatting of their
paper. The purpose of this template is to help
those authors that want to use HTML/CSS to write
their papers. This template has been prepared by
Håkon

As you can compare, this extraction is far superior to the one extracted earlier. The reason, for this is, this pdf file is a complex one and not that simple. 

Let us compare the two libraries that we have used to extract the information from the pdf files.

`PyPDF2`

- Best for: Extracting simple, text-based content.
- Advantages: Lightweight and easy to use.
- Limitations: Struggles with complex layouts or scanned PDFs.

`PDFMiner`

- Best for: Detailed control over text extraction.
- Advantages: Supports layout analysis and font styles.
- Limitations: Slower and more complex to use than PyPDF2.

## Table Exraction from pdf 

### Using tabula

In [None]:
%pip install tabula-py

Note: you may need to restart the kernel to use updated packages.


In [33]:
import tabula
import pandas as pd

tables = tabula.read_pdf("./documents/table.pdf", pages="all") # can add pandas_options={"header": None} if you don't want the firt row to be the header
df = pd.DataFrame(tables[0])
df

Unnamed: 0.1,STATION\rCODE,Unnamed: 0,LOCATIONS,Unnamed: 1,STATE,Unnamed: 2,Min\rTEMPERATURE\roC,Unnamed: 3,Max\rTEMPERATURE\roC,Mean\rTEMPERATURE\roC,...,Min FECAL\rCOLIFORM\r(MPN/100ml),Max FECAL\rCOLIFORM\r(MPN/100ml),Unnamed: 6,Mean\rFECAL\rCOLIFORM\r(MPN/100ml),Unnamed: 7,Min TOTAL\rCOLIFORM\r(MPN/100ml),Unnamed: 8,Max TOTAL\rCOLIFORM\r(MPN/100ml),Unnamed: 9,Mean\rTOTAL\rCOLIFORM\r(MPN/100ml)
0,1898,,"PETROL PUMP OPP.\rHERO CYCLE,\rLUDHIANA",,PUNJAB,,,,,,...,,,,,,,,,,
1,1900,,"GURCHAARAN SINGH\rHAIBOWAL DAIRY\rCOMPLEX, LUD...",,PUNJAB,,,,,,...,,,,,,,,,,
2,1901,,"DUSSHERA GROUND\rINDUSTRIAL ESTATE,\rLUDHIANA",,PUNJAB,,,,,,...,,,,,,,,,,
3,1902,,"SHUKLA TEA STAL\rPOINT, LUDHIANA",,PUNJAB,,,,,,...,,,,,,,,,,
4,1903,,"PUNJAB\rAGRICULTUREAL\rUNIVERSITY,\rLUDHIANA",,PUNJAB,,,,,,...,,,,,,,,,,
5,2917,,"NEAR HARMANDIR\rSAHEB, AMRITSAR,\rPUNJAB",,PUNJAB,,,,,,...,,,,,,,,,,
6,2918,,"DERA BASSI, PUNJAB",,PUNJAB,,,,,,...,,,,,,,,,,
7,2920,,"HAMIRA VILLAGE,\rPUNJAB",,PUNJAB,,,,,,...,,,,,,,,,,
8,2921,,"HAMIRA VILLAGE,\rPUNJAB",,PUNJAB,,,,,,...,,,,,,,,,,
9,2922,,"LEATHER COMPLEX,\rJALANDHAR, PUNJAB",,PUNJAB,,,,,,...,,,,,,,,,,


In [21]:
df.columns # get the names of the columns 

Index(['STATION\rCODE', 'Unnamed: 0', 'LOCATIONS', 'Unnamed: 1', 'STATE',
       'Unnamed: 2', 'Min\rTEMPERATURE\roC', 'Unnamed: 3',
       'Max\rTEMPERATURE\roC', 'Mean\rTEMPERATURE\roC', 'Min\rpH', 'Max\rpH',
       'Mean\rpH', 'Unnamed: 4', 'Min\rCONDUCTIVITY\r(?mhos/cm)',
       'Max\rCONDUCTIVITY\r(?mhos/cm)', 'Mean\rCONDUCTIVITY\r(?mhos/cm)',
       'Min\rB.O.D.\r(mg/l)', 'Unnamed: 5', 'Max\rB.O.D.\r(mg/l)',
       'Mean\rB.O.D.\r(mg/l)', 'Min\rNITRATE-\rN+\rNITRITE-\rN (mg/l)',
       'Max\rNITRATE-\rN+\rNITRITE-\rN (mg/l)',
       'Mean\rNITRATE-\rN+\rNITRITE-\rN (mg/l)',
       'Min FECAL\rCOLIFORM\r(MPN/100ml)', 'Max FECAL\rCOLIFORM\r(MPN/100ml)',
       'Unnamed: 6', 'Mean\rFECAL\rCOLIFORM\r(MPN/100ml)', 'Unnamed: 7',
       'Min TOTAL\rCOLIFORM\r(MPN/100ml)', 'Unnamed: 8',
       'Max TOTAL\rCOLIFORM\r(MPN/100ml)', 'Unnamed: 9',
       'Mean\rTOTAL\rCOLIFORM\r(MPN/100ml)'],
      dtype='object')

In [22]:
df.info() # information about the entire data to make sure that the columns named Unnamed have 0 non null values 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 34 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
CODE                        28 non-null     int64  
 1   Unnamed: 0                          0 non-null      float64
 2   LOCATIONS                           28 non-null     object 
 3   Unnamed: 1                          0 non-null      float64
 4   STATE                               28 non-null     object 
 5   Unnamed: 2                          0 non-null      float64
oC                  8 non-null      float64
 7   Unnamed: 3                          0 non-null      float64
oC                  8 non-null      float64
oC                 8 non-null      float64
pH                              28 non-null     float64
pH                              28 non-null     float64
pH                             28 non-null     float64
 13  Unnamed: 4            

In [34]:
# drop columns which are unnamed 
import re 

for col in df.columns:
    if re.search(r'Unnamed', col):
        df.drop(columns=[col], inplace=True)
    else:
        col_new_name = re.sub(r'[?]', '', col)
        col_new_name = re.sub(r'\r', ' ', col_new_name)
        df = df.rename(columns={col: col_new_name})

df.columns

Index(['STATION CODE', 'LOCATIONS', 'STATE', 'Min TEMPERATURE oC',
       'Max TEMPERATURE oC', 'Mean TEMPERATURE oC', 'Min pH', 'Max pH',
       'Mean pH', 'Min CONDUCTIVITY (mhos/cm)', 'Max CONDUCTIVITY (mhos/cm)',
       'Mean CONDUCTIVITY (mhos/cm)', 'Min B.O.D. (mg/l)', 'Max B.O.D. (mg/l)',
       'Mean B.O.D. (mg/l)', 'Min NITRATE- N+ NITRITE- N (mg/l)',
       'Max NITRATE- N+ NITRITE- N (mg/l)',
       'Mean NITRATE- N+ NITRITE- N (mg/l)', 'Min FECAL COLIFORM (MPN/100ml)',
       'Max FECAL COLIFORM (MPN/100ml)', 'Mean FECAL COLIFORM (MPN/100ml)',
       'Min TOTAL COLIFORM (MPN/100ml)', 'Max TOTAL COLIFORM (MPN/100ml)',
       'Mean TOTAL COLIFORM (MPN/100ml)'],
      dtype='object')

In [35]:
# if we want, we can change the column names to be much more easier to read 
df.columns = ['Station Code',
              'Locations', 'State', 'Min Temp(C)', 'Max Temp(C)', 'Mean Temp(C)',
              'Min pH', 'Max pH', 'Mean pH',
              'Min Conductivity', 'Max Conductivity', 'Mean Conductivity',
              'Min BOD', 'Max BOD', 'Mean BOD',
              'Min Nitrate', 'Max Nitrate','Mean Nitrate',
              'Min FC', 'Max FC', 'Mean FC',
              'Min TC', 'Max TC', 'Mean TC']

df.reset_index(drop=True, inplace=True)
df.columns

Index(['Station Code', 'Locations', 'State', 'Min Temp(C)', 'Max Temp(C)',
       'Mean Temp(C)', 'Min pH', 'Max pH', 'Mean pH', 'Min Conductivity',
       'Max Conductivity', 'Mean Conductivity', 'Min BOD', 'Max BOD',
       'Mean BOD', 'Min Nitrate', 'Max Nitrate', 'Mean Nitrate', 'Min FC',
       'Max FC', 'Mean FC', 'Min TC', 'Max TC', 'Mean TC'],
      dtype='object')

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Station Code       28 non-null     int64  
 1   Locations          28 non-null     object 
 2   State              28 non-null     object 
 3   Min Temp(C)        8 non-null      float64
 4   Max Temp(C)        8 non-null      float64
 5   Mean Temp(C)       8 non-null      float64
 6   Min pH             28 non-null     float64
 7   Max pH             28 non-null     float64
 8   Mean pH            28 non-null     float64
 9   Min Conductivity   25 non-null     float64
 10  Max Conductivity   25 non-null     float64
 11  Mean Conductivity  25 non-null     float64
 12  Min BOD            7 non-null      float64
 13  Max BOD            7 non-null      float64
 14  Mean BOD           7 non-null      float64
 15  Min Nitrate        28 non-null     float64
 16  Max Nitrate        28 non-nu

In [37]:
df.head(10)

Unnamed: 0,Station Code,Locations,State,Min Temp(C),Max Temp(C),Mean Temp(C),Min pH,Max pH,Mean pH,Min Conductivity,...,Mean BOD,Min Nitrate,Max Nitrate,Mean Nitrate,Min FC,Max FC,Mean FC,Min TC,Max TC,Mean TC
0,1898,"PETROL PUMP OPP.\rHERO CYCLE,\rLUDHIANA",PUNJAB,,,,6.7,7.0,6.9,1260.0,...,,1.8,7.2,4.5,,,,,,
1,1900,"GURCHAARAN SINGH\rHAIBOWAL DAIRY\rCOMPLEX, LUD...",PUNJAB,,,,6.9,7.4,7.2,593.0,...,,1.4,3.3,2.4,,,,,,
2,1901,"DUSSHERA GROUND\rINDUSTRIAL ESTATE,\rLUDHIANA",PUNJAB,,,,6.7,6.8,6.8,896.0,...,,1.4,4.3,2.9,,,,,,
3,1902,"SHUKLA TEA STAL\rPOINT, LUDHIANA",PUNJAB,,,,6.7,6.7,6.7,1273.0,...,,1.6,4.2,2.9,,,,,,
4,1903,"PUNJAB\rAGRICULTUREAL\rUNIVERSITY,\rLUDHIANA",PUNJAB,,,,6.8,7.2,7.0,749.0,...,,1.6,3.8,2.7,,,,,,
5,2917,"NEAR HARMANDIR\rSAHEB, AMRITSAR,\rPUNJAB",PUNJAB,,,,7.0,7.0,7.0,1470.0,...,,3.0,3.0,3.0,,,,,,
6,2918,"DERA BASSI, PUNJAB",PUNJAB,,,,6.8,7.1,7.0,1504.0,...,,5.3,7.8,6.6,,,,,,
7,2920,"HAMIRA VILLAGE,\rPUNJAB",PUNJAB,,,,6.9,6.9,6.9,2634.0,...,,8.0,8.0,8.0,,,,,,
8,2921,"HAMIRA VILLAGE,\rPUNJAB",PUNJAB,,,,8.1,8.1,8.1,409.0,...,,1.2,1.2,1.2,,,,,,
9,2922,"LEATHER COMPLEX,\rJALANDHAR, PUNJAB",PUNJAB,,,,6.9,6.9,6.9,1189.0,...,,4.6,4.6,4.6,,,,,,


In [38]:
# Replace '\r' with '' in the 'Locations' column
df['Locations'] = df['Locations'].str.replace(r'\r', '', regex=True)
df.head(10)


Unnamed: 0,Station Code,Locations,State,Min Temp(C),Max Temp(C),Mean Temp(C),Min pH,Max pH,Mean pH,Min Conductivity,...,Mean BOD,Min Nitrate,Max Nitrate,Mean Nitrate,Min FC,Max FC,Mean FC,Min TC,Max TC,Mean TC
0,1898,"PETROL PUMP OPP.HERO CYCLE,LUDHIANA",PUNJAB,,,,6.7,7.0,6.9,1260.0,...,,1.8,7.2,4.5,,,,,,
1,1900,"GURCHAARAN SINGHHAIBOWAL DAIRYCOMPLEX, LUDHIANA",PUNJAB,,,,6.9,7.4,7.2,593.0,...,,1.4,3.3,2.4,,,,,,
2,1901,"DUSSHERA GROUNDINDUSTRIAL ESTATE,LUDHIANA",PUNJAB,,,,6.7,6.8,6.8,896.0,...,,1.4,4.3,2.9,,,,,,
3,1902,"SHUKLA TEA STALPOINT, LUDHIANA",PUNJAB,,,,6.7,6.7,6.7,1273.0,...,,1.6,4.2,2.9,,,,,,
4,1903,"PUNJABAGRICULTUREALUNIVERSITY,LUDHIANA",PUNJAB,,,,6.8,7.2,7.0,749.0,...,,1.6,3.8,2.7,,,,,,
5,2917,"NEAR HARMANDIRSAHEB, AMRITSAR,PUNJAB",PUNJAB,,,,7.0,7.0,7.0,1470.0,...,,3.0,3.0,3.0,,,,,,
6,2918,"DERA BASSI, PUNJAB",PUNJAB,,,,6.8,7.1,7.0,1504.0,...,,5.3,7.8,6.6,,,,,,
7,2920,"HAMIRA VILLAGE,PUNJAB",PUNJAB,,,,6.9,6.9,6.9,2634.0,...,,8.0,8.0,8.0,,,,,,
8,2921,"HAMIRA VILLAGE,PUNJAB",PUNJAB,,,,8.1,8.1,8.1,409.0,...,,1.2,1.2,1.2,,,,,,
9,2922,"LEATHER COMPLEX,JALANDHAR, PUNJAB",PUNJAB,,,,6.9,6.9,6.9,1189.0,...,,4.6,4.6,4.6,,,,,,


Now, that we have the dataframe, we can analyse the data using Pandas!