# PDF Image Extraction Experimental notebook

Before running this notebook make sure to build the pdf-content-extraction docker
```
     $ cd docker && ./build
```


During the experimental setup all PDF files from the dataset will be copy to a temporary
folder, named as `pdfs`, and all files will be renamed according to the PDF-KEY in the
`pdf-image-experimental-setup.json`.

Inside the JSON file, you will find the following structure:
```
PDF-KEY: {
  'dataset_path': 'dataset/path_to_pdf_file/.../pdf_file.pdf',
  'number_of_original_figures': n <int>,
  'original_figures': {'fig1': 'dataset/path_to_pdf_figure_1/fig1.<ext>',
                       'fig2': 'dataset/path_to_pdf_figure_2/fig2.<ext>',
                        ...
                       'figN': 'dataset/path_to_pdf_figure_N/figN.<ext>'
}
```


This notebook uses the script `run.sh` to extract the image from a PDF file


In [4]:
# Import Library
from glob import glob
import shutil
import json
import subprocess
import os
from tqdm.notebook import tqdm

In [2]:
def extract_image(pdf_file, output_path='.'):
    cmd = f'./run.sh {pdf_file} {output_path}'
    result = subprocess.run(cmd,shell=True,stdout=subprocess.PIPE)
    return result.stdout.decode('utf-8')

## Testing the extraction for samples file

In [3]:
# Creates the folder test
os.makedirs('test/')
# Run simple tests
sample_1 = extract_image('sample_1.pdf','test/')
print(sample_1)
sample_2 = extract_image('sample_2.pdf','test/')
print(sample_2)

sample_1.pdf
/INPUT/sample_1.pdf
Extraction Done!

sample_2.pdf
/INPUT/sample_2.pdf
Extraction Done!



# Experiment setup

In [None]:
# Load the json file to a variable
with open('pdf-image-experimental-setup.json','r') as json_pointer:
    exp_setup = json.load(json_pointer)

In [None]:
# Creates the folder pdfs
os.makedirs('pdfs')

In [None]:
# Copy All files to a temporary folder
pdfs = []
total_docs = len(exp_setup)
for keyname, value in tqdm(exp_setup.items(), total=total_docs):
    pdf_keyname = f'pdfs/{keyname}.pdf'
    src_name = value['dataset_path']
    shutil.copy(src_name,pdf_keyname)
    pdfs.append(pdf_keyname)

# Run Method

In [None]:
# Extracte figures from the pdfs at ./pdfs
for pdf_index, pdf in enumerate(pdfs,start=1):
    extract_image(pdf)
    print(f"Copying {copy_index} / {total_docs}",end='\r',flush=True)
print('\nDone')