# Automatic annotation of classical languages

Some introductory words...

In this session we would use Stanza developed by the Stanford NLP group, and spaCy to visualize the dependencies.

Performance of the system on different UD Treebanks is available in the [website of the project](https://stanfordnlp.github.io/stanza/performance.html).



## Quick introduction to Jupyter Notebooks

First, a Jupyter Notenook is made of cells. The borders of a cell appear when you click on it.
A cell can contain:
- Text in Markdown (like this cell)
- Code in Python (like most of the cells of this notebook)

What you see in the cells with codes in green (they all begin with a `#`) are comments. Everything else is code. We divided the code in different cells so you can run the code piece by piece. 
To **run a cell** just click on it and the press on the “play” button at the upper left of each cell (alternatively, press `Shift + Enter`). Depending on what the code does, you might see some output being generated below the cell. 

### Things to consider

- Be patient: the installation might take some seconds and downloading the models might take up to a couple of minutes.
- Be aware of the order: if you try to run the cell where we display the results of the dependencies without running first the installation, the importation of the modules, the pipeline of the linguistic analysis... it is not going to work. You only need to install and dowload things once, but if, for instance, you change the text you want to analyse, you need to rerun the cell that contains the text and then the cells below with the analysis.


In [None]:
#Import the library
import stanza

In [None]:
#Download the greek models
stanza.download(lang='grc', package='proiel') #this one is the default model if we don't specify a package
stanza.download(lang='grc', package='perseus')

In [None]:
#Let's put some text in a variable
text = "Ἐὰν ᾖς φιλομαθής, ἔσει πολυμαθής"

In [None]:
#First, we initialize a pipeline, which preloads and chains up a series of processors. 
#Each one of this processors performs a NLP task (e.g., lemmatization, dependency parsing, etc.)
nlp = stanza.Pipeline(lang='grc', package="proiel")
#And we can already pass our text to the pipeline for it to be annotated
doc = nlp(text)

In [None]:
#Before printing the annotation, we access each sentence, and each word of each sentence
for sentence in doc.sentences:
  for word in sentence.words:
#Now, we are goint to print the lemmata (with some text in between so it is more legible):
    print(word.text, "->", word.lemma)

In [None]:
#Now let's do both the part of the speech and the lemma
for sentence in doc.sentences:
  for word in sentence.words:
    print(word.text, " lemma:", word.lemma, " PoS:", word.pos)

In [None]:
#Morphological features
for sentence in doc.sentences:
  for word in sentence.words:
    print(word.text, word.feats)

In [None]:
#If we print the dependencies, we get a JSON file. For each token, we see all its properties, including the dependencies: 
#the "head" of the token, and the "deprel", the relation between the token and its head.
for sentence in doc.sentences:
  print(sentence.dependencies)

In [None]:
#Visualization of the dependencies using spaCy
#Firt, we import the required packages

import spacy
from spacy import displacy
from spacy_stanza import StanzaLanguage
#And we initialize the pipeline
snlp = stanza.Pipeline(lang="grc", package='proiel')
nlp = StanzaLanguage(snlp)

In [None]:
#We pass our text through the pipeline (remember: the variable text was declared a few cells above)
doc = nlp(text)
displacy.render(doc, style="dep", jupyter=True)

In [None]:
#Now we are going to do the same using the perseus model
snlp = stanza.Pipeline(lang="grc", package='perseus')
nlp = StanzaLanguage(snlp)
doc = nlp(text)
displacy.render(doc, style="dep", jupyter=True)