## Install dependencies

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%pip install -e ../aicacia/extraction aicacia-document-exporter==0.1.4

## Extract from TEI files and populate documents

In [None]:
!cp ../aicacia/extraction/example/grobid/sample.grobid.tei.xml ../data/grobid

In [None]:
import sqlite3

from aicacia_document_exporter.Document import (Document, DocumentSection,
                                                MediaType)
from aicacia_document_exporter.SimpleFileDocumentExporter import \
    SimpleFileDocumentExporter
from aicacia_extraction.grobid import TEIDocument

### Using TEIDocument from aicacia_extraction

In [None]:
tei_document = TEIDocument("../data/grobid/sample.grobid.tei.xml")

In [None]:
tei_document.title

'Bi-criteria Algorithm for Scheduling Jobs on Cluster Platforms'

In [None]:
tei_document.sections[0]

Section(title='INTRODUCTION 1.1 Cluster computing', text='The last few years have been characterized by huge technological changes in the area of parallel and distributed computing. Today, powerful machines are available at low price everywhere in the world. The main visible line of such changes is the large spreading of clusters which consist in a collection of tens or hundreds of standard almost identical processors connected together by a high speed interconnection network [6]. The next natural step is the extension to local sets of clusters or to geographically distant grids [10]. In the last issue of the Top500 ranking (from November 2003 [1]), 52 networks of workstations (NOW) of different kinds were listed and 123 entries are clusters sold either by IBM, HP or Dell. Looking at previous rankings we can see that this number (within the Top500) approximately doubled each year. This democratization of clusters calls for new practical administration tools. Even if more and more appli

In [None]:
tei_document.figures[0]

Figure(title='Figure 1 :', label='1', description='Figure 1: Job submission in clusters.')

### Using the DocumentExporter from aicacia_document_exporter

Make sure to create the ../data/db/document.db. For example, `sqlite3 document.db`.

In [None]:
abstract_section = DocumentSection(
    tei_document.abstract, MediaType.TEXT, 0, metadata={"semantic_position": "abstract"}
)
document_sections = [abstract_section]
offset = len(abstract_section.content)

for section in tei_document.sections:
    document_section = DocumentSection(
        f"{section.title} {section.text}",
        MediaType.TEXT,
        offset,
        metadata={"semantic_position": "body"},
    )
    document_sections.append(document_section)
    offset += len(document_section.content)

with SimpleFileDocumentExporter("../data/db/document.db") as exporter:
    exporter.insert([Document(title=tei_document.title, sections=document_sections)])

In [None]:
# Viewing the inserted document
con = sqlite3.connect("../data/db/document.db")
cur = con.cursor()
cur.execute("SELECT * FROM docs").fetchall()

[('dce5d711-a850-4728-94bb-2a38939fed5f',
  None,
  None,
  '{"title": "Bi-criteria Algorithm for Scheduling Jobs on Cluster Platforms", "sections": [{"content": "We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives.We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an

In [None]:
con.close()