## Install dependencies

In [None]:
%pip install beautifulsoup4==4.12.3 grobid-client-python==0.0.8 lxml==5.3.0

## Extract from the sample PDF

In [None]:
import multiprocessing
import os

from bs4 import BeautifulSoup
from grobid_client.grobid_client import GrobidClient

Before creating the client, make sure that the GROBID server is up and running:

```sh
docker compose up grobid
```

In [None]:
grobid_server = os.environ.get("GROBID_SERVICE_URL", "http:localhost:8070")
n = 2 * multiprocessing.cpu_count()  # Assumes hyperthreading
output = "../data/grobid"
pdfs = "../aicacia/extraction/example/pdf"

client = GrobidClient(grobid_server=grobid_server)

In [None]:
client.process("processFulltextDocument", pdfs, output=output, n=n)

## Parsing the TEI output with Beautiful Soup

In [None]:
with open("../data/grobid/sample.grobid.tei.xml") as f:
    soup = BeautifulSoup(f, "lxml-xml")

### Extracting text

In [None]:
title_stmt = soup.find("titleStmt")
title_stmt.title.text

'Bi-criteria Algorithm for Scheduling Jobs on Cluster Platforms'

In [None]:
abstract = soup.find("abstract")
print(abstract.text)


We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives.We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.



In [None]:
text = soup.find("text").body.find_all("div")
first_paragraph = text[0].p.text
first_paragraph_title = text[0].head.text

print(first_paragraph_title, end="\n\n")
print(first_paragraph)

INTRODUCTION 1.1 Cluster computing

The last few years have been characterized by huge technological changes in the area of parallel and distributed computing. Today, powerful machines are available at low price everywhere in the world. The main visible line of such changes is the large spreading of clusters which consist in a collection of tens or hundreds of standard almost identical processors connected together by a high speed interconnection network [6]. The next natural step is the extension to local sets of clusters or to geographically distant grids [10].


### Extracting metadata

In [None]:
# TODO(jason.prasad): still attempting to process the metadata. GROBID does have the ability to
# include a crossref service: https://grobid.readthedocs.io/en/latest/Consolidation/