# A coincise introduction to LiLa

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#General-principles" data-toc-modified-id="General-principles-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>General principles</a></span></li><li><span><a href="#The-architecture-of-LiLa" data-toc-modified-id="The-architecture-of-LiLa-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The architecture of LiLa</a></span><ul class="toc-item"><li><span><a href="#Lemmas-(and-hypolemmas)" data-toc-modified-id="Lemmas-(and-hypolemmas)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Lemmas (and hypolemmas)</a></span></li><li><span><a href="#Lexical-resources" data-toc-modified-id="Lexical-resources-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Lexical resources</a></span></li></ul></li><li><span><a href="#Some-bibliography" data-toc-modified-id="Some-bibliography-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Some bibliography</a></span></li></ul></div>

The present file provides some theoretical background for the LiLa collection. If you're eager to see the python library in action, jump to the [Working with pylila]() notebook!

## General principles

[LiLa](https://lila-erc.eu/) is a collection of linguistic resources for Latin published using Semantic Web standards and interconnected according to the [Linked-Open-Data](https://en.wikipedia.org/wiki/Linked_data) paradigm.

LiLa brings corpora, lexicons and NLP tools together. The guiding priciple of LiLa's architecture is simple:
* lexicons describe *words* (technically, entries, which can be multi-word expressions)
* corpora and digital texts are made of tokens, which are *occurrences of words*
* NLP tools process tokens; they're often connected in pipelines which often include Lemmatization (see below) as one of the first steps.

The process that brings them together is the NLP task known as [Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation). For a morphologically rich language like Latin, this task means reducing the forms of the inflectional paradigm of a word to a base form, conventionally chosen to be the citation form. E.g. the form *rosas* (accusative plural of a 1st-stem noun) is reconducted to the nominative singular *rosa*. This is also the form that you would use to look the word up in a dicitionary. That is to say: the *lemma* is also the form used by lexicons to index lexical entries.

For these reasons, the lemma form is the perfect point for the 3 types of resources to meet, because:
* entries in lexicons are indexed using lemmas;
* corpus tokens can be lemmatized, i.e. tokens can be associated to lemmas;
* Lemmatizers output lemmatized texts (i.e. texts where tokens are associated to their lemmas)

## The architecture of LiLa

### Lemmas (and hypolemmas)

According to these principles, the core of LiLa is the [LiLa Lemma Bank](http://lila-erc.eu/data/id/lemma/LemmaBank), i.e. a collection of Latin wordforms that are potentially usable to lemmatize corpora and index lexical entries.

Currently the Lemma Bank include about 200k forms distinguished in two classes:
* Lemmas: the ordinary forms used to lemmatize.
* Hypolemmas: a sub-class of the general Lemma class used for potentially ambigous cases.

In order to get what Hypolemmas are, consider a couple of sample English examples. Would you lemmatize the form *quickly* under the adverb *quickly* or the adjective *quick*? The superlative *best* under *best* or under *good*? *Flying* (as in *flying machine*) as the adj. *flying* or the verb *fly*. Each of those solutions would make sense, and we don't know *a priori* which decision the builders of the corpora or the lexicons that we want to link have decided to adopt.

That's why we introduced hypolemmas! Hypolemmas are form that can be interpreted as either forms of an inflectional paradigm *or* lemmas of their own. Hypolemmas are lemmas, but are also connected to their "hyperlemma" that they can be lemmatized under. As in the English examples, typical hypolemmas:
* participles (hypolemmas of the main verb)
* de-adjectival adverbs (like *quickly* from *quick*)
* superlative/comparative adjectives

Following the LOD paradigm, lemmas (and obviously, hypolemmas) have a unique identifier (URI) that can be looked up on the web. This is an example: http://lila-erc.eu/data/id/lemma/86833.

Lemmas also have other properties and relations, which will be illustrated while we show how to explore the lemma bank using `pylila`

### Lexical resources

Most of the properties of Lemmas are defined using a W3C (de facto) standard ontology to describe lexical resources, [Ontolex](https://www.w3.org/2016/05/ontolex/)

In fact, LiLa's lemmas are defined as sub-class of Ontolex' [Forms](https://www.w3.org/2016/05/ontolex/#forms). One of the most important properties of forms is that of becoming [canonical forms](https://www.w3.org/2016/05/ontolex/#canonicalForm) of [Lexical Entries](https://www.w3.org/2016/05/ontolex/#lexical-entries).

This modelization strategy allows us to both publish lexical resources as LOD data (using Ontolex to model the lexicographic information) and connect them to our Lemma Bank. Potentially, all resources based on Ontolex can be linked to LiLa!

Currently, LiLa is linked to 8 lexical resources (see below)

## Some bibliography

* Most exhaustive paper: https://doi.org/10.4454/ssl.v58i1.277
* On corpora: see [this paper](https://zenodo.org/record/6664693#.YrCBeexBxfU)
* Complete list of the project publications: [here](https://lila-erc.eu/output/)