# Synopsis

We will explore how to extract information from PDF files that were digitally generated. That is, we are not doing any OCR because the file does contain the actual text.

# Words to remember

**PyPDF2**

**tabula-py**

**pdf2image**




# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path
from sys import path

path.append( str(Path.cwd().parent) )

## Install some necessary packages

In [None]:
conda install PyPDF2

In [None]:
conda install tabula-py

In [None]:
conda install pdf2image

In [None]:
conda install -c conda-forge poppler

In [None]:
import pdf2image
import tabula

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from matplotlib.patches import Circle
from PyPDF2 import PdfReader
from pylab import imread, imshow

from Amaral_libraries.my_stats import half_frame
from Amaral_libraries.my_image_library import grayscale_zoom

In [None]:
my_fontsize = 15
data_folder = Path.cwd() / 'Data' / 'Muller_2012'
results_folder = Path.cwd() / 'Generated_data'

# Load PDF files

We will work with the article:

> Christopher Muller, "Northward Migration and the Rise of Racial Disparity
in American Incarceration, 1880–1950", *American Journal of Sociology* **118** (2), 281–326 (September 2012).

Besides the full 46 page article, we also have access to single pages with particular types of information (figures, tables, equations). 

We will also create PNG images from the PDF files using `pdf2image`.


In [None]:
article = data_folder / 'Muller_2012-Northward_Migration_and_the_Rise_of_Racial.pdf'

files = sorted( list( data_folder.glob('*page*.pdf') ) )

page_images = []
print(f"The single page PDF files in Data folder are:\n")
for i, file in enumerate( files ):
    print(f"{i:>2} ...{str(file)[103:]}")
    image = pdf2image.convert_from_path(file, 500)
    file_name = str(file).replace('.pdf', '.png')
    image[0].save(file_name, 'PNG')
    
    page_images.append(file_name)
    

## Display a single page

You can change the value of *n* to {0, 1, 2, 3}.

In [None]:
n = 0
print(f"...{page_images[n][79:]}")
page = imread( page_images[n] )

fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page);

# Extracting text

We first open the article for reading and processing. For this, we will use the `PdfReader` module from `PyPDF2`.

We use the option `r+b` to indicate we are opening for reading and that the file is written as bytes, not characters. 

In [None]:
reader = PdfReader(open(article, 'r+b', ))

print( reader.pages[0] )


In [None]:
page = reader.pages[0]
print(page.get_contents())

print()
print(page.extract_text())

<br>

Note that the header and footer information appear at the start of the string

> AJS Volume 118 Number 2 (September 2012): 281–326 281/H170152012 by The University of Chicago. All rights reserved.
0002-9602/2012/11802-0001\$10.00

And that somethings, like `H17015`, are likely character codes arising from issues with non `utf-8` encodings. 

<br>

# Pages with images



In [None]:
n = 2
print(f"...{page_images[n][79:]}")
page = imread( page_images[n] )

fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page);

## Extract text

Let's focus on page 6 alone, for simplicity, and extract the text as before.

In [None]:
n = 6
page = reader.pages[n]

print(page.get_contents())

print()
print(page.extract_text())

<br>

Note the header and footer at the start of the string again

> Racial Disparity in Incarceration
>
> 287

They appear to be separate by a new line...


In [None]:
page.extract_text()[:120]

## Extract image files

The `page` object the we can obtain with `reader` has an `images` attribute.

In [None]:
imgs = page.images
print(f"There are {len(imgs)} images in page {n}.\n")


print(f"The information for the first figure is:  {imgs[0]}")
# print(type(imgs[0]))
print()

name = str(page_images[2]).split('/')[-1][:-4] + f"_image_0.jpg"

file_name = ( results_folder / name  )
print(f"We will save it at:\n\t...{str(file_name)[103:]}")
print()

with open(file_name, "wb") as fp:
    fp.write(imgs[0].data)
    


**We can then read it from the file at anytime for later processing.**

In [None]:
img0 = imread(file_name) 

fig = plt.figure( figsize = (10, 10) )
ax = fig.add_subplot(111)

ax.imshow(img0);

# Pages with tables

Another type of information that frequently appears in documents is tabular data.

`PyPDF2` does not extract tables from documents, so we will use a different package: `tabula`.

In [None]:
n = 3
print(f"...{page_images[n][79:]}")
page7 = imread( page_images[n] )

fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page7);

## Extract text

Let's focus on page 7 of the article alone, for simplicity, and extract the text as before.

In [None]:
page = reader.pages[7]

print(page.get_contents())

print()
print(page.extract_text())


<br>

As before, the header and footer appear at the top. Unfortunately, there is no separation between the footer line and the start of the text in the page

> American Journal of Sociology
>
> 288TABLE 1

In [None]:
page.extract_text()[:120]

This is not great, there would be a lot to do to process this text to extract tabular data...

## Using tabula-py

`tabula` reads a PDF and saves tabular data items into a list of `pandas` `dataframes`. 

In [None]:
list_df = tabula.read_pdf( page_images[n].replace('png', 'pdf'), pages = 1,
                           encoding = 'utf-8')

df = list_df[0]

**Recall what our table looks like**

In [None]:
fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page7[400:1600, 300:2800]);

Let's first see what columns we got...

In [None]:
df.columns

OK, so many column titles are missing. 


In [None]:
df['Unnamed: 0']

In [None]:
df['Native']

It is also clear that, in some cases, two columns of data in the table were read into a single `dataframe` column.

Two factors are responsible for these failures: 

> The first is that the text encoding is not `utf-8`, so the minus signs get screwed.  
>
> The second is that some of the the columns do not have a **flat structure**.


This means that we need to do some further processing in order to extract the data.

The first column is the one with the list of time periods. Each entry is likely coded as a string.

In [None]:
print(type(df.loc[2,'Unnamed: 0']))
print(df.loc[2,'Unnamed: 0'])
df.loc[:,'Unnamed: 0']

<br>

Let's extract the times periods and add them to a **new and clean** `dataframe`.

In [None]:
my_columns = ['Time periods']
time_periods = []
for i in range(2, 9):
    temp = df.loc[i,'Unnamed: 0']
    value = temp.split()[0]
    time_periods.append(value)

print(time_periods)

clean_df = pd.DataFrame(time_periods, columns = my_columns) 
clean_df

Moving on to other columns. Recall from above that:

> some contain data from multiple columns
>
> `-` is extracted as `!`
>
> numbers are printed with commas to make it easier for human eyes to read, but that is not something that `int` knows how to operate on.

We need to:

> manually create the correct column names, and split the data, 
> 
> replace `!` with `-`, 
>
> remove `,` from numbers.

In [None]:
my_columns.append('Native Nonwhites - North')
my_columns.append('Native Nonwhites - South')
data1 = []
data2 = []
for i in range(2, 9):
    temp = df.loc[i,'Native']
    val1, val2 = temp.split()
    data1.append( int(val1.replace('!', '-').replace(',', '')) )
    data2.append( int(val2.replace('!', '-').replace(',', '')) )
                 

print(data1)
print(data2)

<br> 

## Re-factoring

We can make the code above into a function that will generate the data to be added to `clean_df`

In [None]:
def clean_data_from_tabula(df, column_name, start_index, end_index):
    """
    This function takes the correct column names, split the data 
    into two columns, replaces `!` with `-`, and removes `,` from numbers.
    
    inputs:
        df - dataframe returned by tabula
        column_name - str with column name to be processed
        start_index - int from df 
        end_index - int from df
        
    returns:
        data1 - list of int
        data2 - list of int
        
    """
    data1 = []
    data2 = []
    for i in range(start_index, end_index):
        temp = df.loc[i,column_name]
        val1, val2 = temp.split()
        data1.append( int(val1.replace('!', '-').replace(',', '')) )
        data2.append( int(val2.replace('!', '-').replace(',', '')) )
        
    return data1, data2

In [None]:
data1, data2 = clean_data_from_tabula(df, 'Native', 2, len(df))

clean_df['Native Nonwhites - North'] = data1
clean_df['Native Nonwhites - South'] = data2

clean_df

In [None]:
data1, data2 = clean_data_from_tabula(df, 'Unnamed: 2', 2, len(df))

clean_df['Native Whites - North'] = data1
clean_df['Native Whites - South'] = data2

clean_df

In [None]:
data1, data2 = clean_data_from_tabula(df, 'Foreign', 2, len(df))

clean_df['Foreign Whites - North'] = data1
clean_df['Foreign Whites - South'] = data2

## Checking our work

In [None]:
fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page7[400:1600, 300:2800]);

In [None]:
clean_df

**Everything looks good!**

<br>

# Pages with equations

Another type of information that frequently appears in documents are equations and mathematical symbols.

`PyPDF2` does not extract either from documents, so we would need to use use a different package.

In [None]:
n = 1
print(f"...{page_images[n][103:]}")
page10 = imread( page_images[n] )

fig = plt.figure( figsize = (10, 24) )
ax = fig.add_subplot(111)

ax.imshow(page10);

## Extract text

Let's focus on page 10 of the article alone, for simplicity, and extract the text as before.

In [None]:
page = reader.pages[9]
print(page.get_contents())

print()
text_in_page = page.extract_text()
print(text_in_page)

<br>

Clearly somethings are getting messed up...

In [None]:
print(text_in_page[33:350].split())


The presence of italic font, symbols, and equations really messes up the text extraction.  This may have to do with  with encoding available for `PyPDF2`. The default is likely `utf8`, which is not able to encode many of the symbols appearing in equations.

A possible solution to handle this type of information is to select portions of the image and use other software for processing.  [Matchpix](https://mathpix.com/ocr) is a web software that extracts `LaTeX` formatted equations from images.