# Extracting PDF metadata with Python

Adapted this from various sources. See in particular: 'ReportLab - PDF Processing with Python' ([URL](https://leanpub.com/reportlab))

## Installation

Install `PyPDF4` using `pip`:

In [2]:
!pip install PyPDF4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF4
  Downloading PyPDF4-1.27.0.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.9/63.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: PyPDF4
  Building wheel for PyPDF4 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF4: filename=PyPDF4-1.27.0-py3-none-any.whl size=61227 sha256=0bda6a79cb3960b55ba82c5dd601177ea815634fb3697e1eb9e39361ad218d09
  Stored in directory: /root/.cache/pip/wheels/83/cc/14/cb307e5c99235c4497c7895cdb60b4f7ba2a738b6a5fc0d423
Successfully built PyPDF4
Installing collected packages: PyPDF4
Successfully installed PyPDF4-1.27.0


## PDF import

Import the `PDFFileReader` class from `PyPDF4`: this class allows us to read a PDF and extract data from it.

In [3]:
from PyPDF4 import PdfFileReader

As a PDF sample, I downloaded `reportlab-sample.pdf` from https://leanpub.com/reportlab and uploaded it to the current directory (could use the URL, too, but I need to be sure that it points at the PDF itself):

In [7]:
!pwd
!ls *.pdf

/content
reportlab-sample.pdf


Setting the `path`:

In [31]:
path = 'reportlab-sample.pdf'

To get these attributes into a data frame, use the pandas Python library:
1. import the necessary modules
2. `open` the file for reading as binary (`'rb'`)
3. Read the file with the `PdfFileReader` method
4. Extract document information with the `getDocumentInfo` method
5. Put the information into a dictionary named `data`
6. Turn the dictionary into a data frame `df` with `DataFrame`.


In [28]:
import pandas as pd
import PyPDF4

pdf_file = open(path, 'rb')
pdf_reader = PyPDF4.PdfFileReader(pdf_file)

document_info = pdf_reader.getDocumentInfo()

data = {
    'Title': [document_info.title],
    'Author': [document_info.author],
    'Creator': [document_info.creator],
    'Subject': [document_info.subject],
    'Producer': [document_info.producer],
}

df = pd.DataFrame(data)

Show the data frame as a table:

In [29]:
print(df)

                                    Title            Author  \
0  ReportLab - PDF Processing with Python  Michael Driscoll   

                       Creator Subject       Producer  
0  LaTeX with hyperref package    None  XeTeX 0.99999  
