# <center>Commencer à utiliser CLTK</center>

<center>Dr. W.J.B. Mattingly</center>

<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>

<center>Avril 2022</center>

## Abordé dans ce chapitre

1) <br>
2) <br>
3) <br>

## Introduction

## Récupérer un texte

Bien que CLTK permet de récupérer des corpus de textes pour chaque langue, nous allons commencer par traiter des données stockées en local. Nous pouvons retrouver ce qu'on va faire dans <a href="https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb">carnet sur le dépôt de CLTK</a>. Dans le dépôt actuel, nous mettons à disposition une collection de textes.
Regardons un extrait de Tite-Live.

Tout d'abord, nous avons besoin de récupérer le texte en le stockant dans une variable.

In [1]:
with open("texts/lat-livy.txt") as f:
    livy_full = f.read()

Excellent ! Maintenant, regardons ce que nous avons en main.

In [2]:
print("Extrait du texte :", livy_full[:200])
print("Nombre de caractères :", len(livy_full))
print("Nombre approximatif de token :", len(livy_full.split()))

Text snippet: Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos, duobus, Aeneae Antenorique, et vetusti iure hospitii et quia pacis reddendaeque Helenae semper auctores fuerant, omne ius
Character count: 921462
Approximate token count: 129799


Nous utilisons l'expression "nombre approximatif de tokens" parce que les tokens sont considérés comme quelque chose qui a une fonction syntaxique dans le texte. Cela signifie qu'un token n'est pas seulement un mot, mais aussi un signe de ponctuation par exemple. Nous utilisons le terme "approximatif" parce que la fonction `split()` transforme une chaîne de caractères en liste en considérant par défaut le caractère espace. En d'autres termes, le nombre exact de tokens est plus élevé puisque les signes de ponctuation peuvent être collés aux mots.


## Utiliser le tuyau (*pipeline*) de CLTK

CLTK est spécialement conçu pour le traitement de langues naturelles appliqué aux langues antiques et médiévales. Pour utiliser au mieux cette bibliothèque, nous avons d'abord besoin d'importer le tuyau mentionné.

In [3]:
from cltk import NLP

Si le code ci-dessus a fonctionné sans générer d'erreur, alors cela signifie que nous avons correctement importé la classe NLP depuis CLTK. Cela nous permet de créer un tuyau de CLTK. Pour ce faire, nous avons cependant besoin de savoir la langue dans laquelle le texte a été écrit. Tite-Live était un auteur romain, la langue est le latin et son code est "lat".

In [4]:
# Charger le tuyau par défaut du latin
cltk_nlp = NLP(language="lat")

‎𐤀 CLTK version '1.0.25'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.


It is Pythonic to create an NLP object with either the name "nlp" (spaCy syntax) or cltk_nlp (CLTK syntax). One reason for distinguishing between these two is that you may have two separate NLP pipelines in your workflow. It may help, therefore, to specify which nlp object is your cltk NLP pipeline. You can name this object whatever you like but it is best to stick to these conventions as it will make your code easier to understand.

La sortie de cette cellule fournit des informations clés sur notre tuyau. Ca inclut les différents tuyaux, ou processus appliqués aux données d'entrée (le texte) :
- LatinNormalizeProcess
- LatinStanzaProcess
- LatinEmbeddingsProcess
- StopsProcess
- LatinNERProcess
- LatinLexiconProcess

We will cover each of these in depth later in this notebook. For now, simply understand that as your text moves through the NLP class, it moves through a pipeline of different processes. The sequence here is important as some processes rely on post-processing from earlier pipes.
Nous allons aborder l'ensemble des processus plus tard dans ce carnet. Pour l'instant, il faut comprendre que le texte qui est passé par une instance de la classe NLP passe à travers un certain nombre de processus dans le tuyau. L'ordre est important à respecter puisque des processus dépendent du résultat des processus précédents.

If we wish to remove a pipe from the pipeline, we can use .pipeline.process.pop(INSERT INDEX EHRER). Let's see what this looks like in practice.

In [5]:
cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

[<class 'cltk.alphabet.processes.LatinNormalizeProcess'>, <class 'cltk.dependency.processes.LatinStanzaProcess'>, <class 'cltk.embeddings.processes.LatinEmbeddingsProcess'>, <class 'cltk.stops.processes.StopsProcess'>, <class 'cltk.ner.processes.LatinNERProcess'>]


By using pop at -1, we are removing the final pipe. One reason for wishing to do this may be speed. The LatinLexiconProcess is one of the more time-consuming pipes in the pipeline and may not be necessary for your workflow which just needs to use the NER pipe.

## The CLTK Doc Object

Maintenant que nous avons mis en place notre tuyau, analysons un texte. Pour ce faire, nous allons créer un objet CLTK Doc. Si vous êtes déjà familié avec spaCy ou d'autres bibliothèques de TAL, ça doit vous dire quelque chose. L'objet Doc contient les données du texte. Avant que nous examinions l'objet Doc, instancions en donc un. Tout d'abord, raccourcissons le texte de Tite-Live.


In [6]:
livy = livy_full[:len(livy_full) // 12]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10905


Now that we have shortened Livy, let's create the CLTK Doc object. To do this, we will run the CLTK NLP class object, call the analyze method and pass in one argument: the text which is livy. If it is your first time running this, you may be prompted to download the stanza models. Type "Y" to download them.

In [7]:
cltk_doc = cltk_nlp.analyze(text=livy)

CLTK message: This part of the CLTK depends upon the Stanza NLP library.
CLTK message: Allow download of Stanza models to ``C:\Users\wma22/stanza_resources/la/tokenize/ittb.pt``? [Y/n] 


 y


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-20 08:53:02 INFO: Downloading these customized packages for language: la (Latin)...
| Processor | Package |
-----------------------
| tokenize  | ittb    |
| pos       | ittb    |
| lemma     | ittb    |
| depparse  | ittb    |
| pretrain  | ittb    |



Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.3.0/models/tokenize/ittb.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.3.0/models/pos/ittb.pt:   0%|          | 0…

Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.3.0/models/lemma/ittb.pt:   0%|          |…

Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.3.0/models/depparse/ittb.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.3.0/models/pretrain/ittb.pt:   0%|        …

2022-04-20 08:53:59 INFO: Finished downloading models and saved to C:\Users\wma22\stanza_resources.


This part of the CLTK depends upon models from the CLTK project.
Do you want to download 'https://github.com/cltk/lat_models_cltk' to '~/cltk_data/lat'? [Y/n] 


 y


Now that all the models are downloaded, our pipeline should have completed its processing on the text. Let's start examining the  doc object a bit more closely. Let's first examining what type of object it is.

In [8]:
print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>


Notice that it is a special class object that is related to the cltk, specifically a Doc object.

## Doc Object Accessors

The Doc object contains what the CLTK calls "accessors". If you are familiar with spaCy syntax, these function rather like spaCy attributes. They contain a specific piece of data. In some instances, this will be at the token level (e.g. token, lemmata, pos, etc.). In other cases, they occur at the sentence level (e.g. sentences, sentences_strings, sentences_tokens). This allows you to parse the Doc object in several different ways. Let's take a look at all the accessors that are available to us from the Latin pipeline.

In [15]:
accessors = ([x for x in dir(cltk_doc) if not x.startswith("__")])
for a in accessors:
    print (a)

_get_words_attribute
embeddings
embeddings_model
language
lemmata
morphosyntactic_features
normalized_text
pipeline
pos
raw
sentence_embeddings
sentences
sentences_strings
sentences_tokens
stanza_doc
stems
tokens
tokens_stops_filtered
words


Let's now examine some of these a bit more closely. Each will have a header so that you can use the navigation in the textbook (on the right of the screen) to navigate more easily.

### Raw

The raw accessor is no different from the plain text object that we passed to the pipeline. It's index, therefore, functions just as the text input does. Let's take a look.

In [18]:
print (cltk_doc.raw[:20])

Iam primum omnium sa


### Token

The token accessor, however, is fundamentally different. This accessor contains all sequential tokens in the text. Let's take a look at the first 20.

In [10]:
print(cltk_doc.tokens[:20])

['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti']


Notice how not only words are separated out in the output below, but also punctuation marks. This is what makes processing a text so powerful. We can analyze a text at the word level.

### Lemmata

Like the token accessor, the lemmata accessor also functions at the token level. Unlike the token accessor, however, the lemmata contains the lemma forms of each token. Note "capta" above is now replaced with its lemma: "capio".

In [11]:
print(cltk_doc.lemmata[:20])

['Iam', 'primus', 'omnis', 'satis', 'consto', 'Troia', 'capio', 'in', 'ceterus', 'saevitum', 'sum', 'Troianos', ',', 'duo', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti']


### POS

L'attribut `pos` fonctionne aussi au niveau du token. `pos` est l'acronyme de *part-of-speech* ou *pars oratori* en latin qui correspond au français "nature du mot" ou "nature grammaticale". Cet attribut est très courant dans les bibliothèques de TAL.


In [12]:
print(cltk_doc.pos[:20])

['ADV', 'ADJ', 'PRON', 'ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'PRON', 'VERB', 'AUX', 'NOUN', 'PUNCT', 'NUM', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'CCONJ', 'NOUN']


### Words

The words accessor may seem on the surface to resemble the token accessor, but it is a lot different. It contains all metadata about the word. It functions rather like spaCy's token attribute. Let's take a look at the seventh word in the text, "capta".

In [23]:
print (cltk_doc.words[6])

Word(index_char_start=None, index_char_stop=None, index_token=6, index_sentence=0, string='capta', pos=verb, lemma='capio', stem=None, scansion=None, xpos='L2|modM|tem4|grp1|casA|gen2', upos='VERB', dependency_relation='acl', governor=5, features={Aspect: [perfective], Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular], Tense: [past], VerbForm: [participle], Voice: [passive]}, category={F: [neg], N: [neg], V: [pos]}, stop=False, named_entity=False, syllables=None, phonetic_transcription=None, definition=None)


Note that unlike the token accessor, the words accessor allows us to see all metadata relevant to this individual word. We can access each of these features as well. Let's say I was interested in knowing its part-of-speech. I can access that data like so:

In [24]:
print (cltk_doc.words[6].pos)

verb


Maintenant que nous savons que c'est un verbe
Now we know it is a verb. What if we wanted to know its voice? We could access its features.

In [25]:
print (cltk_doc.words[6].features)

{Aspect: [perfective], Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular], Tense: [past], VerbForm: [participle], Voice: [passive]}


À partir de là, nous pouvons naviguer
And from here we can navigate this dictionary to the "Voice" key.

In [27]:
print (cltk_doc.words[6].features["Voice"])

[passive]


And we can see that it is passive. We can access all these features equally easily.

In [31]:
print("Number:", cltk_doc.words[6].features["Number"])
print("Tense:", cltk_doc.words[6].features["Tense"])
print("VerbForm:", cltk_doc.words[6].features["VerbForm"]) 
print("Voice:", cltk_doc.words[6].features["Voice"])

Number: [singular]
Tense: [past]
VerbForm: [participle]
Voice: [passive]


The words accessor is one of the more powerful aspects of the CLTK pipeline. I encourage you to spend a bit of time exploring what is available to you from the words accessor with your own text.

### Sentence Tokens

Contrairement aux attributs précédents, `sentence_tokens` nous permet d'analyser le texte au niveau de la phrase. Ainsi, le parsage phrase par phrase devient possible.
Unlike the previous accessors, the sentence_tokens accessor allows us to analyze the Doc object at the sentencee level. This allows us to parse a text sentence-by-sentence which is not possible in Python. The split(".") approach separates a text at every ".". This means that it will separate the text where a "." is used to denote an abbreviation. In Latin, as in English, this makes the approach impossible to use effectively. The CLTK pipeline, however, allows us to parse ancient and medieval languages effectively at the sentence level.

In [13]:
print(cltk_doc.sentences_tokens[:2])

[['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti', 'iure', 'hospitii', 'et', 'quia', 'pacis', 'reddendaeque', 'Helenae', 'semper', 'auctores', 'fuerant', ',', 'omne', 'ius', 'belli', 'Achiuos', 'abstinuisse', ';'], ['casibus', 'deinde', 'variis', 'Antenorem', 'cum', 'multitudine', 'Enetum', ',', 'qui', 'seditione', 'ex', 'Paphlagonia', 'pulsi', 'et', 'sedes', 'et', 'ducem', 'rege', 'Pylaemene', 'ad', 'Troiam', 'amisso', 'quaerebant', ',', 'venisse', 'in', 'intimum', 'maris', 'Hadriatici', 'sinum', ',', 'Euganeisque', 'qui', 'inter', 'mare', 'Alpesque', 'incolebant', 'pulsis', 'Enetos', 'Troianosque', 'eas', 'tenuisse', 'terras', '.']]


## Conclusion

Ce chapitre a présenté les principales caractéristiques de la class NLP et nous avons vu comment construire le tuyau de traitement et comment y passer un texte à travers lui. Dans le prochain chapitre, nous allons examiner plus particulièrement la reconnaissance d'entités nommées.