# 1. Overview and definitions

## a. Definition and objectives

### 1. What is the purpose of indexing data ?

The number of documents available in a collection can make the process of retrieving information difficult. To access a particular document without the use of an index, there is no other choice than checking every document one by one. This is called a *sequential search*. As you can imagine, this is not a very efficient method: the more documents there are, the longer the search. To overcome this problem, is is possible to create indexes on various fields of a document. A field corresponds to part of a document (its title, its publication date, its text, *etc.*). Conceptually, an index associates the value of a field with the location of the document in the system. For example, in a library, it is much more efficient to consult the catalog to know where a particular book is located rather than scannning through all the books until the relevant one is found.

Indexing data is a crucial part of archival work as it is the basis for the public use of historical sources. 
An example of the indexing process in the Universal Decimal Classification (UDC). Created at the end of the 19th century, it is a hierarchical system trying to classify information in 9 main classes representing human knowledge (Social sciences, mathematics, philosophy, *etc.*). These classes are divided into 10 subparts that can themselves be further divided. The UDC also integrates special characters allowing for more precise queries. For example, the query **17:7** concerns documents about Ethics (category 17) in relation to Arts (category 7).
This system was used in the creation of the Mundaneum in Belgium, a building which objective was to collect and archive knowledge. It contains about 12 millions cards, classified according to the UDC. It is said to be the first search engine.

![Mundaneum](img/mundaneum.jpg)
CC-BY Mark Wathieu

Using digital systems to archive documents brings a change of paradigm, both for archivists and end-users. The quantity of available information does not cease to grow. On the one hand it is extremely useful as it allows researchers to *easily* work on massive collections of documents. On the other hand it brings up new difficulties, challenges and pitfalls. 

## b. Tools and methodology

When indexing a large collection of documents, it is necessary to think about what should effectively be indexed. You have to remember that indexing is a trade-off between disk space/processing power VS speed of search. Thus, indexing too much data can eventually hurt the search capabilities of a system if it is not ready to handle many indexes.

Another thing to think about is the end-users' needs. It is important to evaluate these needs beforehand so as to not waste processing power on indexes that will never be used. Let's take as an example the case of the NewsEye project. One of its use case is to be able to highlight individual words in newspapers pages. The pipeline used in this project generates for every page of newspaper a XML file containing information on the position of every word in the page. One would think it is a good idea to index individual words with their position as to ease their retrieval. However, because there are lots of pages and because each page can contain up to 10,000 words, the number of words indexed would easily reach hundreds of millions, even billions. This is not desirable as it would increase the size of an index dramatically, thus slowing the entire system. A solution for this problem is to not index individual words but rather parse the XML file and get the words and their positions when needed (after a particular user query).

During this course, we will learn how to use Solr, a software to index and retrieve data. Based on Apache Lucene, this software can be used to ingest large quantity of data and make them available for search through an API. The principles we will study in this course will still apply to other indexing softwares.