# Analiza slovenskih podatkov za PARSEME shared task 1.1 on verbal MWE identification

## Predstavitev problema

Pri treniranju modela zmagovalnega programa [traversal](https://github.com/kawu/traversal) je model ustvarjen, vendar ne deluje na *testnih podatkih*.

Model se trenira v skadu z navodili v datoteki [README.md](https://github.com/kawu/traversal#training). Podatki za trening (`train` in `dev`), kot tudi za testiranje so vzeti iz uradne spletne strani [LINDAT/CLARIN Repository Home](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2842).

Možna ralaga je, da težava pri generiranju slovenskih `.cupt` datotek. Slovenske datoteke v obeh različicah `PARSEME shared task` nimajo `UPOS` tagov, ampak zavestno uporavljajo `JOS` tage. V specifikaciji formata `CoNLL-U` in posledično `cupt`, ki je razširjena različica tega formata, je določeno, da je 4 stolpec `UPOS: Universal part-of-speech tag.`, torej eden izmed [17-ih oblik](https://universaldependencies.org/u/pos/index.html), ki morajo biti napisani z velikimi tiskanimi črkami.

## Predlog dela

Najprej bomo pogledali, ali imajo tudi drugi jeziki svoje različice `POS` tagov namesto `UPOS` tage. Če to drži, vemo, da je izjema le pri slovenskih datotekah, zato bomo s pomočjo pomožnega 

## Opombe

Če želite testirati kodo, upoštevajte ta navodila:

* datotek z besedili naj bodo v podmapi `data/1.0` (POSIX) oziroma `data\1.0` (Windows). 
  V kolikor niso, prosim usterzno poravite spremenljivko `PATH_TO_FILES`.
* namestite naslednje knjižnice za `Python3`:
    - `jupyterlab`
    - `pyconll`

In [1]:
import os

import pyconll

In [2]:
# Constant variables
# ------------------
PATH_TO_FILES = 'data'

UPOS_TAGS_V2 = { 
    # https://universaldependencies.org/u/pos/index.html
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other"
}

UPOS_TAGS_V1 = {
    # http://universaldependencies.org/docsv1/u/pos/index.html
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary verb",
    "CONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other"
}

In [9]:
def parse_file(path_to_file, upos_tags=[]):
    """Get the UPOS tags with the pyconll library or an 
    ad hoc cupt parser.
    
    Note
    ----
        The cupt parser assumes that the columns are:
    ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC PARSEME:MWE

    Attributes
    ----------
    path_to_file: str
        A path to the conllu or cupt file.
    
    upos_tags: list
        A list of UPOS tags, either V1, V2 or both.
    
    Returns
    -------
    list
        A list of tuples with a path, filename, sentence start line, token form and upos
    """
    # Get the path, filename and extension
    filename, ext = os.path.splitext(path_to_file)

    found_tags = []
    
    # Check that the file has at least one valid UPOS tag
    file_is_blank = True
    
    # TODO: remove duplicate logic in the generation of the report
    if ext == ".conllu":
        conllu = pyconll.load_from_file(path_to_file)
        for sentence in conllu:
            for token in sentence:
                # Check if the UPOS is not a "_", marking a missing tag
                # which pyconllu parses as None
                upos_tag = token.conll().split("\t")[3]
                if token.upos not in upos_tags and upos_tag != "_":
                    report = (
                        path_to_file,
                        sentence.start_line_number,
                        token.form,
                        token.upos
                    )
                    found_tags.append(report)
                else:
                    file_is_blank = False
    else:
        for index, line in enumerate(open(path_to_file).read().splitlines()):
            if not line or line.startswith("#"):
                continue
            line_as_list = line.split("\t")
            upos_tag = line_as_list[3].strip()
            if upos_tag not in upos_tags and upos_tag != "_":
                report = (
                    path_to_file,
                    index,
                    line_as_list[1],
                    upos_tag
                )
                found_tags.append(report)
            else:
                file_is_blank = False
    if file_is_blank:
        report = (path_to_file, "File has no UPOS tags.")
        found_tags.append(report)
    return found_tags

        
def check_upos_tags(path, schema=None):
    """Checks all conllu files in a given folder or a single conllu file if the
    upos tags are Universal POS tags and returns a list of places where there the
    UPOS tags do not conform to the schema.
    
    Notes
    -----
    It checks against the V1 and V2 of the UPOS UD version.
    For the purpose of this task, the only difference is the POS tag
    for a coordinating conjunction.
    
    # http://universaldependencies.org/docsv1/u/pos/index.html
    # https://universaldependencies.org/u/pos/index.html
    
    Attributes
    ----------
    path: str or list
        A path name or a list of paths to join.
    schema: str
        Version one ("v1")  or two ("v2"). If non is given, both schemas are checked.

    Returns
    -------
    list
        All the tags that don't have a POS tag
    """
    path = os.path.join(*path) if isinstance(path, list) else path
    all_files = os.listdir(path) if os.path.isdir(path) else path
    files_to_check = [file for file in all_files if file.endswith(".conllu") or file.endswith(".cupt")]

    if not files_to_check:
        raise Exception("No CoNLL-u  or cupt file(s) to check.")

    upos_tags_v1 = list(UPOS_TAGS_V1.keys())
    upos_tags_v2 = list(UPOS_TAGS_V2.keys())

    if schema == "v1":
        upos_tags = upos_tags_v1
    elif schema == "v2":
        upos_tags = upos_tags_v2
    else:
        upos_tags = set(upos_tags_v1 + upos_tags_v2)

    tags = []

    for file in files_to_check:
        found = parse_file(os.path.join(path, file), upos_tags=upos_tags)
        tags += found
    return tags 

In [4]:
# Search for langugage where UPOS tags are not being used
path = os.path.join(PATH_TO_FILES, "1.0")
found = []

for folder in os.listdir(path=path):
    if not os.path.isdir(os.path.join(path, folder)):
        continue
    try:
        wrong_tags = check_upos_tags(os.path.join(path, folder))
        found += wrong_tags
    except:
        continue
languages = set([r[0] for r in found])
languages

{'data/1.0/FR/test.conllu',
 'data/1.0/FR/train.conllu',
 'data/1.0/HU/test.conllu',
 'data/1.0/HU/train.conllu',
 'data/1.0/IT/test.conllu',
 'data/1.0/IT/train.conllu',
 'data/1.0/RO/test.conllu',
 'data/1.0/RO/train.conllu',
 'data/1.0/SL/test.conllu',
 'data/1.0/SL/train.conllu'}

In [5]:
# Compare the tags
for file in languages:    
    c = pyconll.load_from_file(file)
    upos = []
    for sentence in c:
        for token in sentence:
            upos_tag = token.conll().split("\t")[3]
            upos.append(upos_tag)
    print(f'{file}: {", ".join(set(upos))}')

data/1.0/IT/test.conllu: I, B, V, T, F, D, X, C, E, P, A, R, S, N
data/1.0/RO/test.conllu: Pp1-po, Dd3mso---e, Pd3mpo, Ds1fp-s, Dd3-po---o, SCOLON, Pi3mpr, Tifso, Pp1-pr, RPAR, Ncmprn, Ncfsry, Moms-ly, Dd3mso---o, Vmip1p, Afpfsry, Mcfsrln, Vmg, Ps3---sf, Tdmsr, Pd3fpr, Dd3fso---o, Ps3fsrs, Ncmpoy, Mo-s-r, Dw3fso, Rg, Mc-p-l, Mcfsoln, Ncmsrn, Ncfpon, Pi3mso, Qs, Npfsry, Pi3-sr, Vmii1p, M, Afpmpry, QUOT, Vmip2s, Ds3mp-s, Dh1mp, Ds3ms-s, Pd3fpo, Ncfsvy, Vmii1s, Momprln, Di3mpr, Dd3fsr, STAR, Dd3fpr, Ds3---sm, Afpmsrn, Pp1-sr, Tdfso, Vmip3s, Vmmp2p, Mcfprln, Mofsrln, Di3fpo, Npmsrn, Pz3msr, Tsmpr, Di3mso, Afpfsoy, Momprly, Px2-pr, Dd3msr, Mlfpo, Mffsrln, Qn, Di3-sr, Mofsoly, Afpfpry, Pi3msr, Tsfpr, Cc, Dd3fsr---e, Afpfpon, Dd3fpr---o, Tdfpo, Pw3--r, Pp3-so, Np, Pd3-po, Pp3-po, Mcmsoln, SLASH, Tdmpo, Pp3mpr, Vmil3p, Ps3ms-s, Px3--o, Dw3msr, Vmis3s, Mofs-ly, Sp, Mcfp-l, Pw3mso, Ncfsoy, Vmsp1p, Afpmprn, Ds1ms-s, Ncfpry, Di3fso, Mcmp-l, Ps1fsrp, Afpfsrn, Dd3fpr---e, Dw3mso, Afpmpoy, Ps3---p, A

In [10]:
# Now the same analysis for 1.1
# -----------------------------
# NOTE: for the purpose of this notebook we'll use a simple code
# I have writen a library to read and write `cupt` and `conllu`
# files, but I have to add some good tests before releasing it.
# Search for langugage where UPOS tags are not being used
path = os.path.join(PATH_TO_FILES, "1.1")
found = []

for folder in os.listdir(path=path):
    if not os.path.isdir(os.path.join(path, folder)):
        continue
    try:
        wrong_tags = check_upos_tags(os.path.join(path, folder))
        found += wrong_tags
    except:
        continue
languages = set([r[0] for r in found])
languages

{'data/1.1/EU/dev.cupt',
 'data/1.1/EU/test.blind.cupt',
 'data/1.1/EU/test.cupt',
 'data/1.1/EU/train.cupt',
 'data/1.1/HU/dev.cupt',
 'data/1.1/HU/test.blind.cupt',
 'data/1.1/HU/test.cupt',
 'data/1.1/HU/train.cupt',
 'data/1.1/IT/dev.cupt',
 'data/1.1/IT/test.blind.cupt',
 'data/1.1/IT/test.cupt',
 'data/1.1/IT/train.cupt',
 'data/1.1/TR/dev.cupt',
 'data/1.1/TR/test.blind.cupt',
 'data/1.1/TR/test.cupt',
 'data/1.1/TR/train.cupt'}

In [12]:
# The slovenian data is missig, that is because it has only blanks
# as demonstrated in this quick example
files = [
    'data/1.1/SL/train.cupt',
    'data/1.1/RO/train.cupt',
    'data/1.1/IT/train.cupt',
    'data/1.1/EU/train.cupt',
    'data/1.1/HU/train.cupt',
    'data/1.1/FR/train.cupt'
]
all_tags = []

for file in files:
    tags = []
    for line in open(file).read().splitlines():
        if not line or line.startswith("#"):
            continue
        line_as_list = line.split("\t")
        upos_tag = line_as_list[3]
        tags.append(upos_tag)
    all_tags.append([file, set(tags)])
    
for lang in all_tags:
    print(f'File: {lang[0]}')
    print('------------------------')
    print(f'{",".join(lang[1])}')
    print()

File: data/1.1/SL/train.cupt
------------------------
_

File: data/1.1/RO/train.cupt
------------------------
ADJ,PUNCT,DET,PROPN,NUM,CCONJ,ADP,SYM,INTJ,AUX,NOUN,ADV,PART,X,SCONJ,VERB,PRON

File: data/1.1/IT/train.cupt
------------------------
I,V,_,T,F,D,N,X,E,C,P,A,R,S,B

File: data/1.1/EU/train.cupt
------------------------
ADJ,BST,ADL,HAOS,IZE,PRON,IOR,INTJ,PRT,PROPN,NUM,SYM,AUX,PART,SNB,ADB,AUR,ADT,CONJ,PUNCT,PUNT,LOT,LAB,VERB,DET,ITJ,NOUN,ADV,X,SIG,ADI

File: data/1.1/HU/train.cupt
------------------------
O,?,V,;,P,....,-,N,,,...,I,",T,],C,:,§,[,Y,M,.,R,%,(,),X,+,A,Z,S

File: data/1.1/FR/train.cupt
------------------------
ADJ,DET,PUNCT,PROPN,NUM,CCONJ,ADP,_,INTJ,SYM,AUX,NOUN,ADV,PART,X,SCONJ,VERB,PRON



## Rezultati analize

Kot vidimo Romunščina (`RO`) in Slovenščina (`SL`) uporabljata `XPOS` namesto `UPOS`, metdemt ko imajo Francoščina, Madžardščina in Italijanščina svoje različice UPOS oznak.

Ne glede na to, pa noben od naštetih jeziko, razen slovenščine, nima teh težav v različici [1.1](https://gitlab.com/parseme/sharedtask-data/tree/master/1.1), kjer je samo za slovenščino seznam prazen. 

Naslednji korak bo torej popraviti slovenske datoteke, tako da bomo `XPOS` oznake prekopirali is kolone za `UPOS` v kolono za `XPOS` in `UPOS` dobili iz začetnih (dveh) črk `XPOS` oznak.

In [4]:
# JOS to UPOS V1 maping
JOS_TO_UPOS = {
    "S": "NOUN",
    "G": "VERB",
    "P": "ADJ",
    "R": "ADV",
    "Z": "PRON",
    "K": "NUM",
    "D": "ADP", # https://universaldependencies.org/u/pos/all.html#adp-adposition
    "V": {
     "p": "CONJ",
     "d": "SCONJ"
     },
     "L": "PART",
     "M": "INTJ",
     "O": "_",
     "N": "X"
}

In [13]:
def lookup_upos_tag(word, jos_tag):
    """Gets the UPOS tag for a given JOS token.
    
    Note:
        The implementation is very crude.
        
    Attributes
    ----------
    word: str
        The word in it's original form
    
    jos_tag: str
        An XPOS tag for the slovenian language
    
    Returns
    -------
    str
        The V1 UPOS tag for a given word
        
    
    """
    if word in [".", ",", "(", ")"]:
        return "PUNCT"
    upos_tag = JOS_TO_UPOS.get(jos_tag[0], "_")
    # Exception for conjuctions
    # http://nl.ijs.si/jos/msd/html-sl/msd.C.html
    if isinstance(upos_tag, dict):
        upos_tag = upos_tag.get(jos_tag[1], "_")
    return upos_tag

# Open files
path_to_sl = os.path.join(PATH_TO_FILES, "1.1", "SL")
files = {}
for file in os.listdir(path=path_to_sl):
    filename, ext = os.path.splitext(file)
    if ext != ".cupt":
        continue
    opened_file = open(os.path.join(path_to_sl, file)).read().splitlines()
    files[filename] = opened_file
print(list(files.keys()))

['dev', 'test', 'test_with_upos', 'train_with_upos', 'dev_with_upos', 'test.blind', 'train', 'test.blind_with_upos']


In [78]:
# Add UPOS tags with a simple loop
for file_name in files.keys():
    file = files[file_name]
    for index in range(len(file)):
        line = file[index]
        if not line or line.startswith("#"):
                continue
        line_as_list = line.split("\t")
        upos = lookup_upos_tag(line_as_list[1], line_as_list[4])
        line_as_list[3] = upos
        file[index] = "\t".join(line_as_list)
    with open(os.path.join(path_to_sl, file_name + "_with_upos.cupt"), "w") as f:
        for i, row in enumerate(file):
           f.write(row + "\n")

In [79]:
# Test the writen files
cupt = open("data/1.1/SL/test.cupt").readlines()
with_upos = open("data/1.1/SL/test_with_upos.cupt").readlines()
print(f'Original len: {len(cupt)}, with upos: {len(with_upos)}')

Original len: 46506, with upos: 46506


In [80]:
# Some more tests
cupt_vid_len = len([line for line in cupt if line and not line.startswith("#") and "VID" in line])
with_upos_vid_len = len([line for line in with_upos if line and not line.startswith("#") and "VID" in line])
cupt_vid_len == with_upos_vid_len

True