# Exploratory data analysis and preprocessing

In this notebook, I perform some exploratory data analysis (EDA) of the EvaLatin dataset. Then, using what I've learnt about the data I preprocess it for the modeling stage.

## Exploratory data analysis

In this section, I want to find insights that I can leverage in the modeling stage. This analysis is structured in four sections, each focusing on a different level of the data:

1. Dataset
2. Forms
3. POS
4. Lemmata

In [None]:
%load_ext blackcellmagic
%matplotlib inline
import os
import re
import unicodedata
from collections import OrderedDict
import pathlib
import fileinput
import pyconll
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from segments import Profile, Tokenizer


from filenames import ROOT, RAW_EVALATIN_TRAINING_DATA_DIR, PROCESSED_EVALATIN_POS_TRAIN_DATA

os.chdir(ROOT)
BLUE = sns.color_palette()[0]

In [None]:
# Read in all data into a single pyconll CoNLL structure
path = pathlib.Path(RAW_EVALATIN_TRAINING_DATA_DIR)
f = fileinput.input(path.glob("*.conllu"))
conll = pyconll.unit.conll.Conll(f)

# Read in all data into a pandas DataFrame
data = []
for sentence in conll:
    for token in sentence:
        d = {"form": token.form, "lemma": token.lemma, "pos": token.upos}
        data.append(d)
df = pd.DataFrame(data)

### Dataset

#### Summary

- There are 14,399 sentences in the training data.
- There are 259,645 tokens in the training data. The official guidelines say 259,646 but I'm not too worried about this discrepency.
- Most (75%) sentences have under 24 tokens, with the average having 18. The vast majority (95%) of sentences have at most 40 tokens.
- There are no punctuation or end of sentence markers.

#### Notes

- Both the POS and lemmatization task make most sense at the sentence level, although you could try type-level approaches.
- The dataset is sizeable but not huge, so it could be worth investigating external unlabelled data, external labelled data and data augmentation methods
- Sentence lengths aren't too long, so models forgetting context is not a pressing concern.
- We'll have to add beginning and end of sentence markers.

In [None]:
# How many sentences are there?
len(conll)

In [None]:
# How many tokens are there?
sum(map(len, conll))

In [None]:
# What is the distribution of sentence length?
sentence_lengths = pd.Series([len(s) for s in conll])
sentence_lengths.describe()

In [None]:
# Now we show sentence length visually
plt.figure(figsize=(12, 8))
plt.xlim((0, 100))
sns.distplot(sentence_lengths, bins=300, kde=False)

In [None]:
# This shows the percent of sentences with at most 40 tokens.
(sentence_lengths <= 40).value_counts(normalize=True) * 100

### Forms

N.B. This counts characters not graphemes.

#### Summary

- There are 43,767 unique forms in the training data, of which more than half (24,376) only appear once. The vast majority (90%) of forms appear at most 7 times in the training data.
- Most forms have at most 8 characters, with the average form having around 6. The vast majority (95%) of words have at most 10 characters.
- There are 126 different characters in the training data.
- These characters fall into one of four classes:
    - Latin
    - Greek
    - Full stop
    - Other
- Over 98% of the Latin characters are lower case.
- Full stops are used in four different ways:
    - In abbreviations of proper nouns (following the regex `[A-Z].*\.`)
    - In lacunae (following the regex `\.\.\.`)
    - For the noun "salus", almost always preceded by "suus".
    - Other abbreviations, whose full form is not found elsewhere in the sentence.
- Capitalization is used in four different ways:
    - As the first character of a sentence
    - As the first character of a proper noun (abbreviated and not)
    - In Roman numerals
    - In "HS"
- About half the forms that end with "-que" are the clitic "-que".

#### Notes

- The large number of forms, and especially the large number of hapax legomena, suggest the need to include character-based methods.
- The large number of forms with few examples in the training data suggest that the test data will also have many infrequent forms too. This lends further support for character-based methods, and context-based methods.
- We can massively reduce the size of the character vocabulary by focusing on Latin characters.
- If we're just focusing on Latin characters, we could again halve the size of the character vocabulary if we focus on lower case characters. However, they are a huge signal for proper nouns (abbreviated or not).
- We could replace all Greek words with a single Greek character.

In [None]:
# What is the distribution of number of characters per form?
word_lengths = df["form"].str.len()
word_lengths.describe()

In [None]:
# This shows the percent of words with at most 10 characters.
(word_lengths <= 10).value_counts(normalize=True) * 100

In [None]:
# What is the character set used in the forms?
chars = pd.Series(list("".join(df["form"].values)))
chars.value_counts()

In [None]:
# What classes of characters are there?
char_class = lambda ch: unicodedata.name(ch).split()[0]
pd.Series(chars.unique()).apply(char_class).value_counts()

In [None]:
# What is the distribution of upper and lower Latin characters?
is_latin = lambda ch: char_class(ch) == "LATIN"
case = lambda ch: "upper" if ch.isupper() else "lower"
chars_df = chars.to_frame("char")
chars_df["latin"] = chars_df["char"].apply(is_latin)
chars_df["case"] = chars_df["char"].apply(case)
chars_df[chars_df["latin"]]["case"].value_counts(normalize=True)

In [None]:
# How are full stops used?
full_stop_pattern = re.compile(r"\.")
has_full_stop = df["form"].str.contains(full_stop_pattern)
df[has_full_stop]["pos"].value_counts()

In [None]:
# abbreviations
abbreviation_pattern = re.compile(r"[A-Z].*\.")
is_abbreviation = df["form"].str.contains(abbreviation_pattern)
df[is_abbreviation]["pos"].value_counts()

In [None]:
# lacunae
lacuna_pattern = re.compile(r"\.\.\.")
is_lacuna = df["form"].str.contains(lacuna_pattern)
df[is_lacuna]

In [None]:
# salus
is_salus = df["lemma"] == "salus"
pd.Series(
    [df.loc[i - 1]["form"] for i in df[has_full_stop & is_salus].index]
).value_counts()

In [None]:
# remaining
df[has_full_stop & ~is_abbreviation & ~is_lacuna & ~is_salus]["form"].value_counts()

In [None]:
# How is capitalization used?
capital_pattern = re.compile(r"[A-Z]")
has_capital_letter = df["form"].str.contains(capital_pattern)
df[has_capital_letter]

In [None]:
# How many unique forms are there?
len(df["form"].unique())

In [None]:
# How many hapax legomena are there?
word_counts = df["form"].value_counts()
(word_counts == 1).sum()

In [None]:
(word_counts <= 7).value_counts(normalize=True)

In [None]:
# How many forms have the clitic "-que"?
que_form = df["form"].str.endswith("que")
df[que_form]["lemma"].str.endswith("que").value_counts(normalize=True)

### POS

#### Summary

- There are 15 different POS tags, just as the official guideline states.
- Nouns and verbs are by far the most frequent POS tags.
- By frequency, there are three classes of POS tags:
    - NOUN and VERB are in the most frequenct class, each accounting for around 23% of all tokens, totalling over 45% together. 
    - The next class consists of ADJ, ADV, PRON, DET, CCONJ, ADP, PROPN, SCONJ and PART tags, and each account for 1-8% of tags.
    - The last class consists of AUX, NUM, X and INTJ tags, which each account for less than 1% of tokens.
- NOUN, VERB, ADJ and PRON need root, morphology and syntactic context to identify them.
- ADV, DET, ADP and CCONJ are more tied to a particular form/root.
- PART is only negatives "non", "ne" and "haud",  but these can also be SCONJ or ADV.
- AUX are forms of sum or eo, as the official guide mentions.
- NUM are cardinal numbers or roman numerals.
- INTJ can largely be distinguished by their form, which comes from a small set of forms (most are "O" or "hercule"). However, there is still variation ("age", "malus"). Context should identify them. Most of the "O" INTJ are sentence-initial.

#### Notes

- As a baseline, if you just guessed NOUN for each token, you'd have an accuracy of 23%.
- The different information useful to identify the POS are root, the inflectional morphology, the derivational morphology, the syntactic context and the linear order in the sentence.
- Having contextual models is important.
- Having word-type representations is important for those tags that are strongly lexical (e.g. CCONJ, DET, ADV, INTJ).
- Having character-based models is important for those with morphology.
- Having sentence position information will help identify some tags (e.g. INTJ), so we should include beginning and end of sentence markers.
- Add end of word markers to help model suffixes.
- Start/end sentence marker, start/end word marker, characters and type representations.

In [None]:
# How many POS tags are there?
len(df["pos"].unique())

In [None]:
# What is the distribution of POS tags?
df["pos"].value_counts(normalize=True) * 100

In [None]:
# Now we show it visually with raw counts
order = df["pos"].value_counts().index
plt.figure(figsize=(12, 8))
sns.countplot(x="pos", data=df, order=order, color=BLUE)

In [None]:
# What do NOUN look like?
# Not much here beyond normal Latin morphology
noun = df["pos"] == "NOUN"
noun_counts = df[noun]["form"].value_counts()
noun_counts.head()

In [None]:
# What do VERB look like?
# Not much here beyond normal Latin morphology
verb = df["pos"] == "VERB"
verb_counts = df[verb]["form"].value_counts()
verb_counts.head()

In [None]:
# What do ADJ look like?
# Not much here beyond normal Latin morphology
adj = df["pos"] == "ADJ"
adj_counts = df[adj]["form"].value_counts()
adj_counts.head()

In [None]:
# What do ADV look like?
adv = df["pos"] == "ADV"
adv_counts = df[adv]["form"].value_counts()
adv_counts.head()

In [None]:
# What do PRON look like?
pron = df["pos"] == "PRON"
pron_counts = df[pron]["form"].value_counts()
pron_counts.head()

In [None]:
# What do DET look like?
det = df["pos"] == "DET"
det_counts = df[det]["form"].value_counts()
det_counts.head()

In [None]:
# What do CCONJ look like?
cconj = df["pos"] == "CCONJ"
cconj_counts = df[cconj]["form"].str.lower().value_counts()
cconj_counts.head()

In [None]:
# What do ADP look like?
adp = df["pos"] == "ADP"
adp_counts = df[adp]["form"].str.lower().value_counts()
adp_counts.head()

In [None]:
# What do PART look like?
part = df["pos"] == "PART"
part_counts = df[part]["form"].str.lower().value_counts()
part_counts.head()

In [None]:
negatives = ["non", "ne", "haud"]
df[df["form"].isin(negatives)]["pos"].value_counts()

In [None]:
# What do PROPN look like?
propn = df["pos"] == "PROPN"
df[propn]["form"].value_counts().head(20)

In [None]:
# What do AUX look like?
aux = df["pos"] == "AUX"
aux_counts = df[aux]["form"].str.lower().value_counts()
aux_counts.head()

In [None]:
# What do NUM look like?
num = df["pos"] == "NUM"
num_counts = df[num]["form"].str.lower().value_counts()
num_counts.head()

In [None]:
# What do INTJ look like?
intj = df["pos"] == "INTJ"
df[intj]["form"].str.lower().value_counts()

In [None]:
# How many O's begin a sentence?
len([sentence for sentence in conll if sentence[0].form.lower() == "o"])

In [None]:
# What do X look like?
X = df["pos"] == "X"
df[X]["lemma"].value_counts()

### Lemmatization

#### Summary

- There are 9,623 unique lemmata.
- sum, qui and et each account for over 2% of tokens.
- The top 20 lemmata account for almost a quarter of all tokens.
- A third of the training data have the lemma identical to the form.


#### Notes
- Lacunae can be lemmatized with the regular expression `\.\.`
- Greek words can be lemmatized with the `is_greek_function`.
- The Roman numeral regex isn't working well enough at the moment to use it. It makes more false positives than true positives in the training data, although it makes no (correct) false negatives).

In [None]:
# How many lemmata are there?
len(df["lemma"].unique())

In [None]:
# What is the distribution of lemata?
(df["lemma"].value_counts(normalize=True) * 100).head(20)

In [None]:
# What is the distribution of lemma?
# Now we show it visually with raw counts for the top N lemmata
N = 20
order = df["lemma"].value_counts().iloc[:N].index
plt.figure(figsize=(12, 8))
sns.countplot(x="lemma", data=df, order=order, color=BLUE)

In [None]:
(df["lemma"].value_counts(normalize=True) * 100).iloc[:20].sum()

In [None]:
# What is the distribution of number of POS tags per lemma?
num_pos_per_lemma = df.groupby("lemma")["pos"].nunique().to_frame("count")
num_pos_per_lemma["count"].describe()

In [None]:
# How often are the form and lemma identical?
(df["form"].str.lower() == df["lemma"]).value_counts(normalize=True)

In [None]:
# Lacunae
is_lacuna = df["lemma"] == "uox_lacunosa"
has_two_periods = df["form"].str.contains("\.\.")
df[is_lacuna] == df[has_two_periods]

In [None]:
# Greek words
def is_greek_char(ch):
    return char_class(ch) == "GREEK"


def is_greek_word(word):
    return any(map(is_greek_char, word))


# Does the `is_greek_word` function find only Greek words?
df[df["form"].apply(is_greek_word)]["lemma"].value_counts()

In [None]:
# Does the `is_greek_word` function miss any Greek words?
(df[~df["form"].apply(is_greek_word)]["lemma"] == "uox_graeca").value_counts()

In [None]:
# Roman numerals
roman_numeral_pattern = re.compile(
    r"^M{0,4}(CM|CD|D?C{0,4})(XC|XL|L?X{0,4})(IX|I[VU]|[VU]?I{0,4})$", re.IGNORECASE
)
is_really_roman_numeral = df["lemma"] == "numerus_romanus"
is_predicted_roman_numeral = df["form"].str.match(roman_numeral_pattern)
df[~is_really_roman_numeral & is_predicted_roman_numeral].head()

In [None]:
df[~is_really_roman_numeral & is_predicted_roman_numeral][
    "form"
].str.lower().value_counts()

## Preprocessing

In [None]:
# I replace some types of forms (e.g. Greek words) with placeholder markers because the specifics of their forms don't
# matter for these tasks. I also insert start and end markers for word and sentence boundaries. For all these, I want
# the marker to be a single token for convenience which do not appear in the cleaned text, and so have chosen Greek
# letters for this task. Prior to adding any of these markers, Greek letters from the original text have been removed.

GREEK_TOKEN = "α"
LACUNA_TOKEN = "β"
PROPN_ABBREVIATION_TOKEN = "γ"
START_WORD = "δ"
END_WORD = "ε"
START_SENTENCE = "ζ"
END_SENTENCE = "η"
GRAPHEME_SEPARATOR = "-"

def remove_other_chars(word):
    return "".join([ch for ch in word if char_class(ch) in ["LATIN", "GREEK", "FULL"]])

def replace_greek_word(word):
    if is_greek_word(word):
        return GREEK_TOKEN
    return word

def replace_salus(word):
    if word == "s.":
        return "salus"
    return word

def replace_lacuna(word):
    match = re.search(r"\.\.", word)
    if match:
        return LACUNA_TOKEN
    return word

def replace_propn_abbreviation(word):
    match = re.match(r"[A-Z].*\.", word)
    if match:
        return PROPN_ABBREVIATION_TOKEN
    return word

def replace_full_stop(word):
    return word.replace(".", "")


def replace_j(word):
    return word.replace("j", "")

def clean(word):
    word = remove_other_chars(word)
    word = replace_greek_word(word)
    word = replace_salus(word)
    word = replace_lacuna(word)
    word = replace_propn_abbreviation(word)
    word = replace_full_stop(word)
    word = replace_j(word)
    word = word.lower()  # might not want to do this
    return word


cleaned = list(df["form"].apply(clean))

#### Grapheme tokenization

In [None]:
# Create grapheme tokenization profile
text = " ".join(cleaned)
profile = Profile.from_text(text)
profile.column_labels.remove("frequency")
profile.graphemes.pop(" ")
for key in ["ch", "qu", "th", "rh", "ph", "gn"]:
    profile.graphemes[key] = OrderedDict([("mapping", key)])
    profile.graphemes.move_to_end(key, last=False)
with open("src/profile.prf", "w") as file:
    file.write(str(profile))
tokenizer = Tokenizer("src/profile.prf")

In [None]:
# Prepare data for POS tagging
WORD_TAG_DELIMITER = "/"
WORD_DELIMITER = "\t"

lines = []
for sentence in conll:
    line = []
    for token in sentence:
        form = tokenizer(clean(token.form), segment_separator=GRAPHEME_SEPARATOR)
        form = GRAPHEME_SEPARATOR.join([START_WORD, form, END_WORD])  # add in start/end word boundaries
        pos = token.upos
        instance = form + WORD_TAG_DELIMITER + pos
        line.append(instance)
    # add in start/end sentence boundaries
    line[0] = START_SENTENCE + GRAPHEME_SEPARATOR + line[0]
    line[-1] = line[-1].split(WORD_TAG_DELIMITER)[0] + GRAPHEME_SEPARATOR + END_SENTENCE + WORD_TAG_DELIMITER + pos
    lines.append(WORD_DELIMITER.join(line))
with open(PROCESSED_EVALATIN_POS_TRAIN_DATA, "w") as file:
    file.write("\n".join(lines))