##### $\hspace{15pt}$ **Filename: 3_tableExtractionFromPDF.ipynb**
##### $\hspace{1.5pt}$ **Date Created: October 8, 2023**
##### **Date Modified: February 9, 2024**
##### $\rule{10.5in}{1pt}$
##### **Extract tabular data from a pdf document using [tabula-py](https://pypi.org/project/tabula-py/) and [camelot-py](https://pypi.org/project/camelot-py/).**

##### **The file `sampleDocument.pdf` that has to be accessed is available in this [Google Drive folder](https://drive.google.com/drive/folders/1vd_N3wW9G5ok81lv_PC5H4JCfnukLWur?usp=sharing). Before running this notebook in Colab, either change the path to access the file, or create the subfolder `/Colab Notebooks/002_tableExtractionFromPDF` in your Google drive and copy the file to the subfolder.**
##### $\rule{10.5in}{1pt}$

##### Install `tabula-py`, `ghostscript`, and `camelot-py`.

In [None]:
!pip install tabula-py -q ghostscript -q camelot-py -q

##### Load modules and packages.

In [2]:
from google.colab import drive
import camelot
import pandas as pd
import tabula

##### Mount Google Drive to Colab.

In [3]:
drive.mount("/content/gdrive")

Mounted at /content/gdrive


##### Set the path to access the file needed by this notebook.

In [4]:
path = "gdrive/MyDrive/Colab Notebooks/002_tableExtractionFromPDF/"

$\hspace{1in}$

###### **tabula-py**

##### Extract a table from a pdf document in Google Drive.

In [5]:
tables = tabula.read_pdf(path + "sampleDocument.pdf", pages = 4)
tables[0]

Feb 08, 2024 4:31:52 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Feb 08, 2024 4:31:52 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Feb 08, 2024 4:31:53 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Feb 08, 2024 4:31:53 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



Unnamed: 0,Variable,Category,Value
0,Demographic/socioeconomic factors,,
1,Gender,"Female, No. (%)",3653 (61)
2,Age,Mean (SD),77 (11)
3,,"White (non-Hispanic), No. (%)",3753 (62)
4,,"Black (non-Hispanic), No. (%)",1791 (30)
5,,"Hispanic/Latino, No. (%)",191 (3)
6,Race/ethnicity,,
7,,"Other, No. (%)",140 (2)
8,,"Asian/Pacific Islander, No. (%)",99 (2)
9,,"Unknown, No. (%)",51 (1)


$\hspace{1in}$

###### **camelot-py**

##### Extract a table from a pdf document in Google Drive.

In [6]:
tables = camelot.read_pdf(path + "sampleDocument.pdf", pages = "4", flavor = "stream", table_areas = ["0, 570, 600, 90"])
tables[0].df

Unnamed: 0,0,1,2
0,Variable,Category,Value
1,,Demographic/socioeconomic factors,
2,Gender,"Female, No. (%)",3653 (61)
3,Age,Mean (SD),77 (11)
4,,"White (non-Hispanic), No. (%)",3753 (62)
5,,"Black (non-Hispanic), No. (%)",1791 (30)
6,,"Hispanic/Latino, No. (%)",191 (3)
7,Race/ethnicity,,
8,,"Other, No. (%)",140 (2)
9,,"Asian/Paciﬁc Islander, No. (%)",99 (2)


$\hspace{1in}$