# Get CONLL-U format

This notebook provides code for obtaining a random sample of sentences from those collected from the acl anthology and arxiv APIs, and for setting up a code setup for extracting the sentences in CONLL format with POS tags and dependencies.

CONLL format documentation: https://universaldependencies.org/format.html

#### get a random sample from the collected data for conducting initial tests

In [1]:
import random

random_acl_sample = random.sample(list(open('../preprocessed_data/acl_anthology_sentences.txt')),13)
random_arxiv_sample = random.sample(list(open('../preprocessed_data/arxiv_sentences.txt')),13)

with open("../preprocessed_data/random_sample.txt","w") as file:

    for sample in random_acl_sample+random_arxiv_sample:
        file.write(sample)

#### Get CONLL-U format

The file is formatted according to the CONLL-U guidelines, and contains the sentence id, text, and then 1 token per line with the following information:

**ID**: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes \(decimal numbers can be lower than 1 but must be greater than 0).
<br>
**FORM**: Word form or punctuation symbol. Spacy token.text
<br>
**LEMMA**: Lemma or stem of word form. Spacy token.lemma_
<br>
**UPOS**: Universal part-of-speech tag. Spacy token.pos_
<br>
**XPOS**: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available. Spacy token.tag_ (fine-grained POS)
<br>
**FEATS**: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. Spacy token.morph
<br>
**HEAD**: Head of the current word, which is either a value of ID or zero (0). obtained using Spacy token.i (The index of the token within the parent document)
<br>
**DEPREL**: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. Spacy token.dep_
<br>
**DEPS**: Enhanced dependency graph in the form of a list of head-deprel pairs. Currently added _ to indicate no info is provided.
<br>
**MISC**: Any other annotation. In this case, a flag to indicate if the lemma is one of the AI entity keywords.

In [51]:
def count_roots(doc):

    root_counter = 0

    for token in doc:
        if token.dep_ == 'ROOT':
            root_counter +=1

    return root_counter

In [49]:
def add_sentence_to_file(doc,sent_id,file):

    keywords = [' AI ',' LM ','LLM', 'LLMs', 'GPT','chatGPT','model','system','algorithm'] 
    # removed multiwords since comparison is done with lemma

    firstrow = "# sent_id = {sent_id}".format(sent_id = sent_id)
    secondrow = "# text = {text}".format(text = doc)
    file.write(firstrow+'\n')
    file.write(secondrow+'\n')

    for i,token in enumerate(doc):
            
        if token.morph:
            token_morph = str(token.morph)
        else:
            token_morph = '_'

        if token.dep_ == 'ROOT':
            token_head = '0'
        else:
            token_head = str(token.head.i+1)

        if token.lemma_ in keywords:
            token_is_keyword = 'IsKeyword=Yes'
        else:
            token_is_keyword = 'IsKeyword=No'
                
        token_info = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, 
                      token_morph, token_head, token.dep_, '_', token_is_keyword] # add underscore to DEPS column
        tokenrow = '\t'.join(token_info)
            
        file.write(tokenrow+'\n')
            
    file.write('\n')

In [50]:
import spacy

nlp = spacy.load("en_core_web_md")

with open("../data/random_sample.txt","w") as outfile:

    infile = open("../preprocessed_data/random_sample.txt","r")
    sent_id = 0 # initiate sent_id counter that goes up for each sentence added to the file
    
    for sentence in infile.readlines():

        sentence = sentence.strip() # remove newline
        doc = nlp(sentence) # the sentences were already added as sents from a spacy Doc when collected
        num_of_roots = count_roots(doc) # make sure the sentence has 1 root for easier processing (although they should be)
        
        if num_of_roots == 1:
            sent_id += 1 # enumeration starts at 1
            add_sentence_to_file(doc,sent_id,outfile) # add conll format (minus DEPS) for each sentence