<a href="https://colab.research.google.com/github/adhang/learn-data-science/blob/main/Scraping%20Tables%20from%20Clean%20PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Tables from Clean PDF
Author: Adhang Muntaha Muhammad

[GitHub](https://github.com/adhang)
[LinkedIn](https://www.linkedin.com/in/adhangmuntaha/)
___
Dataset saved in CSV or Ms. Excel is easy to deal with because the data is in tabular format. But what if the dataset is not saved in a convenient format such as PDF?

Based on the tabular data in it, there are 2 types of PDF files:
- **Clean PDF**: the PDF file only contains tabular data
- **Dirty PDF**: the PDF file not only contains tabular data, but also other objects (such as images, paragraphs, etc) on the outside of the table

In this notebook, I will write down the steps for scraping tables from a clean PDF file.

On the upcoming project, I'll provide steps for scraping tables from a dirty PDF file.

**Contents**
- Required Libraries
  - Installing Libraries
  - Importing Libraries
- Datasets
- Scraping: Single Table - Single Page
- Scraping: Multiple Tables - Single Page
  - Read as Independent Tables
  - Read as Single Table
- Scraping: Multiple Pages
  - Read One by One
  - Read Specific or Ranged Page
  - Read All Pages
- Scraping: Based on Extraction Mode

# Required Libraries
- `tabula-py` - to scrap tables from PDF file
- `pandas` - to do some data processing


## Installing Libraries
Using Anaconda prompt
```
conda install tabula-py
conda install pandas
```
Using pip
```
pip install tabula-py
pip install pandas
```
Using notebook (inline)
```
!pip install tabula-py
!pip install pandas
```

Since I'm using Google Colab, I'll install it inline the notebook cell.

In [1]:
!pip install tabula-py
!pip install pandas



## Importing Libraries

In [2]:
import tabula as tb
import pandas as pd

# Datasets
In this project, I use my own datasets:
- `clean_single_table.pdf` - a clean PDF file contains a single table on one page
- `clean_multi_table.pdf` - a clean PDF file contains multiple tables (3) on one page 
- `clean_multi_page.pdf` - a clean PDF file contains multiple pages (3)

Beside that, I use [chezou](https://github.com/chezou) dataset for stream mode simulation. You can check it [here](https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf)

If you want to use my dataset, you can download it on my [GitHub](https://github.com/adhang/learn-data-science)

In [3]:
file_path_1 = 'https://github.com/adhang/learn-data-science/raw/main/dataset/clean_single_table.pdf'
file_path_2 = 'https://github.com/adhang/learn-data-science/raw/main/dataset/clean_multi_table.pdf'
file_path_3 = 'https://github.com/adhang/learn-data-science/raw/main/dataset/clean_multi_page.pdf'
file_path_4 = 'https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf'

# Scraping: Single Table - Single Page
When you want to read a single table on a single page, you can use this syntax:

```
table = tb.read_pdf(path, pages=x)
```
- `path` - the path of your PDF file
- `x` - which page your table is on

In [4]:
# clean_single_table.pdf
single_table = tb.read_pdf(file_path_1, pages=1)
single_table

Got stderr: Feb 10, 2022 11:22:33 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   ID               Name  QZ    HW    LR    PB  Total
 0   1   Agatsuma Zenitsu  21  31.7  24.9  16.3     94
 1   2         Amano Hina  23  29.2  24.6  17.0     93
 2   3        Asagiri Gen  24  29.2  24.6  15.8     93
 3   4  Hashibira Inosuke  20  31.4  24.4  16.2     92
 4   5        Ideale Zora  24  31.4  25.0   0.0     80
 5   6   Ishigami Byakuya  25  32.0  24.6   0.0     82
 6   7      Kamado Nezuko  19  28.9  24.4  17.0     89
 7   8    Kamado Tanjirou  23  33.0  25.0  16.7     97
 8   9    Kanroji Mitsuri  18  32.2  24.6  16.2     90
 9  10     Kochou Shinobu  19  31.0  24.6  15.8     90]

`tb.read_pdf()` will return a list of dataframe

In [5]:
type(single_table)

list

In [6]:
single_table[0]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21,31.7,24.9,16.3,94
1,2,Amano Hina,23,29.2,24.6,17.0,93
2,3,Asagiri Gen,24,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20,31.4,24.4,16.2,92
4,5,Ideale Zora,24,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19,31.0,24.6,15.8,90


In [7]:
type(single_table[0])

pandas.core.frame.DataFrame

# Scraping: Multiple Tables - Single Page
To read multiple tables on a single page, we can use `multiple_tables` parameter.
```
table = tb.read_pdf(path, pages=x, multiple_table=True)
```

- `multiple_tables = True` - read the tables as independent tables
- `multiple_tables = False` - read the tables as a single table (merge all tables into a single table)

## Read as Independent Tables

In [8]:
# clean_multi_table.pdf
multi_table_independent = tb.read_pdf(file_path_2, pages=1, multiple_tables=True)
multi_table_independent

Got stderr: Feb 10, 2022 11:22:35 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   ID               Name  QZ    HW    LR    PB  Total
 0   1   Agatsuma Zenitsu  21  31.7  24.9  16.3     94
 1   2         Amano Hina  23  29.2  24.6  17.0     93
 2   3        Asagiri Gen  24  29.2  24.6  15.8     93
 3   4  Hashibira Inosuke  20  31.4  24.4  16.2     92
 4   5        Ideale Zora  24  31.4  25.0   0.0     80
 5   6   Ishigami Byakuya  25  32.0  24.6   0.0     82
 6   7      Kamado Nezuko  19  28.9  24.4  17.0     89
 7   8    Kamado Tanjirou  23  33.0  25.0  16.7     97
 8   9    Kanroji Mitsuri  18  32.2  24.6  16.2     90
 9  10     Kochou Shinobu  19  31.0  24.6  15.8     90,
    ID               Name  QZ    HW    LR    PB  Total
 0  11   Miyamizu Mitsuha  19  31.4  25.0  16.2     91
 1  12   Morishima Hodaka  21  30.9  24.9  17.0     94
 2  13               Nero  25  29.4  24.7  16.2     95
 3  14  Novachrono Julius  23  31.5  24.7  17.0     96
 4  15     Ogawa Yuzuriha  23  32.8  24.7  16.2     96,
    ID               Name  Total
 0  16         Ooki Taiju     

In [9]:
len(multi_table_independent)

3

As you can see, there is 3 table header, and the list has a length of 3. That's because this `clean_multi_page.pdf` dataset contains 3 tables on a single page.

Let's see each table (dataframe).

In [10]:
# first table
multi_table_independent[0]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21,31.7,24.9,16.3,94
1,2,Amano Hina,23,29.2,24.6,17.0,93
2,3,Asagiri Gen,24,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20,31.4,24.4,16.2,92
4,5,Ideale Zora,24,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19,31.0,24.6,15.8,90


In [11]:
# second table
multi_table_independent[1]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,11,Miyamizu Mitsuha,19,31.4,25.0,16.2,91
1,12,Morishima Hodaka,21,30.9,24.9,17.0,94
2,13,Nero,25,29.4,24.7,16.2,95
3,14,Novachrono Julius,23,31.5,24.7,17.0,96
4,15,Ogawa Yuzuriha,23,32.8,24.7,16.2,96


In [12]:
# third table
multi_table_independent[2]

Unnamed: 0,ID,Name,Total
0,16,Ooki Taiju,93
1,17,Papittson Charmy,91
2,18,Rengoku Kyoujurou,95
3,19,Saionji Ukyou,96
4,20,Senkuu,95


## Read as Single Table

In [13]:
# clean_multi_table.pdf
multi_table_merged = tb.read_pdf(file_path_2, pages=1, multiple_tables=False)
multi_table_merged

Got stderr: Feb 10, 2022 11:22:37 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[    ID               Name     QZ    HW    LR    PB  Total
 0    1   Agatsuma Zenitsu     21  31.7  24.9  16.3     94
 1    2         Amano Hina     23  29.2  24.6    17     93
 2    3        Asagiri Gen     24  29.2  24.6  15.8     93
 3    4  Hashibira Inosuke     20  31.4  24.4  16.2     92
 4    5        Ideale Zora     24  31.4    25     0     80
 5    6   Ishigami Byakuya     25    32  24.6     0     82
 6    7      Kamado Nezuko     19  28.9  24.4    17     89
 7    8    Kamado Tanjirou     23    33    25  16.7     97
 8    9    Kanroji Mitsuri     18  32.2  24.6  16.2     90
 9   10     Kochou Shinobu     19    31  24.6  15.8     90
 10  ID               Name     QZ    HW    LR    PB  Total
 11  11   Miyamizu Mitsuha     19  31.4    25  16.2     91
 12  12   Morishima Hodaka     21  30.9  24.9    17     94
 13  13               Nero     25  29.4  24.7  16.2     95
 14  14  Novachrono Julius     23  31.5  24.7    17     96
 15  15     Ogawa Yuzuriha     23  32.8  24.7  16.2     

From the output, you may notice that the table header counted as a record (row). That's because we merge every single table on the page.

In [14]:
multi_table_merged[0]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21,31.7,24.9,16.3,94
1,2,Amano Hina,23,29.2,24.6,17,93
2,3,Asagiri Gen,24,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20,31.4,24.4,16.2,92
4,5,Ideale Zora,24,31.4,25,0,80
5,6,Ishigami Byakuya,25,32,24.6,0,82
6,7,Kamado Nezuko,19,28.9,24.4,17,89
7,8,Kamado Tanjirou,23,33,25,16.7,97
8,9,Kanroji Mitsuri,18,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19,31,24.6,15.8,90


There are some `NaN` values on index 16-21 (not ID column). That's because the third table didn't have the same size as the first and second tables.

# Scraping: Multiple Pages
To read multiple pages, we have some options:
- Read pages one by one
- Read pages by specific page number or range
- Read all pages, but make sure the PDF file only contains tabular data (tables)

## Read One by One

In [15]:
# clean_multi_page.pdf
multi_page_1 = tb.read_pdf(file_path_3, pages=1)
multi_page_2 = tb.read_pdf(file_path_3, pages=2)
multi_page_3 = tb.read_pdf(file_path_3, pages=3)

Got stderr: Feb 10, 2022 11:22:39 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Got stderr: Feb 10, 2022 11:22:40 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Got stderr: Feb 10, 2022 11:22:43 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



In [16]:
# table on first page
multi_page_1[0]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21,31.7,24.9,16.3,94
1,2,Amano Hina,23,29.2,24.6,17.0,93
2,3,Asagiri Gen,24,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20,31.4,24.4,16.2,92
4,5,Ideale Zora,24,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19,31.0,24.6,15.8,90


In [17]:
# table on second page
multi_page_2[0]

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,11,Miyamizu Mitsuha,19,31.4,25.0,16.2,91
1,12,Morishima Hodaka,21,30.9,24.9,17.0,94
2,13,Nero,25,29.4,24.7,16.2,95
3,14,Novachrono Julius,23,31.5,24.7,17.0,96
4,15,Ogawa Yuzuriha,23,32.8,24.7,16.2,96


In [18]:
# table on third page
multi_page_3[0]

Unnamed: 0,ID,Name,Total
0,16,Ooki Taiju,93
1,17,Papittson Charmy,91
2,18,Rengoku Kyoujurou,95
3,19,Saionji Ukyou,96
4,20,Senkuu,95


To combine it along the row, set `axis = 0`

In [19]:
# list of dataframes
table_list = [multi_page_1[0], multi_page_2[0], multi_page_3[0]]

# combine it
multi_page_concat = pd.concat(table_list, axis=0)
multi_page_concat

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21.0,31.7,24.9,16.3,94
1,2,Amano Hina,23.0,29.2,24.6,17.0,93
2,3,Asagiri Gen,24.0,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20.0,31.4,24.4,16.2,92
4,5,Ideale Zora,24.0,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25.0,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19.0,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23.0,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18.0,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19.0,31.0,24.6,15.8,90


You can see that there are some `NaN` values, it's because the table column didn't match. But, if the table column name is the same, the value will be stored in that column.

## Read Specific or Ranged Page
We can specify the `pages` parameter to be a ranged page number using string or list.

In [20]:
# clean_multi_page.pdf
# read multiple pages using string
multi_page_range = tb.read_pdf(file_path_3, pages='1-2,3')
multi_page_range

Got stderr: Feb 10, 2022 11:22:46 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   ID               Name  QZ    HW    LR    PB  Total
 0   1   Agatsuma Zenitsu  21  31.7  24.9  16.3     94
 1   2         Amano Hina  23  29.2  24.6  17.0     93
 2   3        Asagiri Gen  24  29.2  24.6  15.8     93
 3   4  Hashibira Inosuke  20  31.4  24.4  16.2     92
 4   5        Ideale Zora  24  31.4  25.0   0.0     80
 5   6   Ishigami Byakuya  25  32.0  24.6   0.0     82
 6   7      Kamado Nezuko  19  28.9  24.4  17.0     89
 7   8    Kamado Tanjirou  23  33.0  25.0  16.7     97
 8   9    Kanroji Mitsuri  18  32.2  24.6  16.2     90
 9  10     Kochou Shinobu  19  31.0  24.6  15.8     90,
    ID               Name  QZ    HW    LR    PB  Total
 0  11   Miyamizu Mitsuha  19  31.4  25.0  16.2     91
 1  12   Morishima Hodaka  21  30.9  24.9  17.0     94
 2  13               Nero  25  29.4  24.7  16.2     95
 3  14  Novachrono Julius  23  31.5  24.7  17.0     96
 4  15     Ogawa Yuzuriha  23  32.8  24.7  16.2     96,
    ID               Name  Total
 0  16         Ooki Taiju     

In [21]:
# combine it
multi_page_range_concat = pd.concat(multi_page_range, axis=0)
multi_page_range_concat

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21.0,31.7,24.9,16.3,94
1,2,Amano Hina,23.0,29.2,24.6,17.0,93
2,3,Asagiri Gen,24.0,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20.0,31.4,24.4,16.2,92
4,5,Ideale Zora,24.0,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25.0,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19.0,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23.0,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18.0,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19.0,31.0,24.6,15.8,90


In [22]:
# clean_multi_page.pdf
# read multiple pages using list
multi_page_range = tb.read_pdf(file_path_3, pages=[1,2,3])
multi_page_range

Got stderr: Feb 10, 2022 11:22:49 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   ID               Name  QZ    HW    LR    PB  Total
 0   1   Agatsuma Zenitsu  21  31.7  24.9  16.3     94
 1   2         Amano Hina  23  29.2  24.6  17.0     93
 2   3        Asagiri Gen  24  29.2  24.6  15.8     93
 3   4  Hashibira Inosuke  20  31.4  24.4  16.2     92
 4   5        Ideale Zora  24  31.4  25.0   0.0     80
 5   6   Ishigami Byakuya  25  32.0  24.6   0.0     82
 6   7      Kamado Nezuko  19  28.9  24.4  17.0     89
 7   8    Kamado Tanjirou  23  33.0  25.0  16.7     97
 8   9    Kanroji Mitsuri  18  32.2  24.6  16.2     90
 9  10     Kochou Shinobu  19  31.0  24.6  15.8     90,
    ID               Name  QZ    HW    LR    PB  Total
 0  11   Miyamizu Mitsuha  19  31.4  25.0  16.2     91
 1  12   Morishima Hodaka  21  30.9  24.9  17.0     94
 2  13               Nero  25  29.4  24.7  16.2     95
 3  14  Novachrono Julius  23  31.5  24.7  17.0     96
 4  15     Ogawa Yuzuriha  23  32.8  24.7  16.2     96,
    ID               Name  Total
 0  16         Ooki Taiju     

In [23]:
# combine it
multi_page_range_concat = pd.concat(multi_page_range, axis=0)
multi_page_range_concat

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21.0,31.7,24.9,16.3,94
1,2,Amano Hina,23.0,29.2,24.6,17.0,93
2,3,Asagiri Gen,24.0,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20.0,31.4,24.4,16.2,92
4,5,Ideale Zora,24.0,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25.0,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19.0,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23.0,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18.0,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19.0,31.0,24.6,15.8,90


## Read All Pages
To read all pages, we can set `pages = all`

In [24]:
# clean_multi_page.pdf
multi_page_all = tb.read_pdf(file_path_3, pages='all')
multi_page_all

Got stderr: Feb 10, 2022 11:22:52 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   ID               Name  QZ    HW    LR    PB  Total
 0   1   Agatsuma Zenitsu  21  31.7  24.9  16.3     94
 1   2         Amano Hina  23  29.2  24.6  17.0     93
 2   3        Asagiri Gen  24  29.2  24.6  15.8     93
 3   4  Hashibira Inosuke  20  31.4  24.4  16.2     92
 4   5        Ideale Zora  24  31.4  25.0   0.0     80
 5   6   Ishigami Byakuya  25  32.0  24.6   0.0     82
 6   7      Kamado Nezuko  19  28.9  24.4  17.0     89
 7   8    Kamado Tanjirou  23  33.0  25.0  16.7     97
 8   9    Kanroji Mitsuri  18  32.2  24.6  16.2     90
 9  10     Kochou Shinobu  19  31.0  24.6  15.8     90,
    ID               Name  QZ    HW    LR    PB  Total
 0  11   Miyamizu Mitsuha  19  31.4  25.0  16.2     91
 1  12   Morishima Hodaka  21  30.9  24.9  17.0     94
 2  13               Nero  25  29.4  24.7  16.2     95
 3  14  Novachrono Julius  23  31.5  24.7  17.0     96
 4  15     Ogawa Yuzuriha  23  32.8  24.7  16.2     96,
    ID               Name  Total
 0  16         Ooki Taiju     

In [25]:
# combine it
multi_page_all_concat = pd.concat(multi_page_all, axis=0)
multi_page_all_concat

Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21.0,31.7,24.9,16.3,94
1,2,Amano Hina,23.0,29.2,24.6,17.0,93
2,3,Asagiri Gen,24.0,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20.0,31.4,24.4,16.2,92
4,5,Ideale Zora,24.0,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25.0,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19.0,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23.0,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18.0,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19.0,31.0,24.6,15.8,90


# Scraping: Based on Extraction Mode
There are two parameters to set the extraction mode:
- `lattice` (boolean, optional) - Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). Default `False`.
- `stream` (boolean, optional) - Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet). Default `False`.

## Lattice Mode

In [26]:
# clean_single_table.pdf
table_lattice = tb.read_pdf(file_path_1, pages=1, lattice=True)
table_lattice[0]

Got stderr: Feb 10, 2022 11:22:56 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Unnamed: 0,ID,Name,QZ,HW,LR,PB,Total
0,1,Agatsuma Zenitsu,21,31.7,24.9,16.3,94
1,2,Amano Hina,23,29.2,24.6,17.0,93
2,3,Asagiri Gen,24,29.2,24.6,15.8,93
3,4,Hashibira Inosuke,20,31.4,24.4,16.2,92
4,5,Ideale Zora,24,31.4,25.0,0.0,80
5,6,Ishigami Byakuya,25,32.0,24.6,0.0,82
6,7,Kamado Nezuko,19,28.9,24.4,17.0,89
7,8,Kamado Tanjirou,23,33.0,25.0,16.7,97
8,9,Kanroji Mitsuri,18,32.2,24.6,16.2,90
9,10,Kochou Shinobu,19,31.0,24.6,15.8,90


## Stream Mode

In [27]:
# this time, I use @chezou dataset
table_stream = tb.read_pdf(file_path_4, pages=1, stream=True)
table_stream[0].head()

Got stderr: Feb 10, 2022 11:22:58 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Feb 10, 2022 11:22:59 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [28]:
# this time, I use @chezou dataset
table_no_stream = tb.read_pdf(file_path_4, pages=1)
table_no_stream[0].head()

Got stderr: Feb 10, 2022 11:23:03 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Feb 10, 2022 11:23:03 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3


# Closing
I think that's all I can write for now. Next, I will explain about scraping for a dirty PDF file. Thank you and see you!