This notebook aims at showing how the part-of-speech tags were added using the RFTagger. The RFTagger is not installed in the docker container, therefore the code cannot be executed. It had to be restarted manually multiple times during the execution, in total it took around 27 hours until it was finished. Batches were used so that it was possible to see which statements have already been processed, they have always been changed manually, so that I was able to control before whether the previous batch was successful or not

In [None]:
from subprocess import check_output, run
from nltk.tokenize import sent_tokenize, word_tokenize
import os
import pandas as pd
import re
from pandarallel import pandarallel
from pathlib import Path

In [None]:
pandarallel.initialize(progress_bar=False)

Run the make command so that the RFTagger is built

In [None]:
run(["make"], cwd="RFTagger/src")

In [None]:
def run_batches(df,starting_index, max_len):
    
    if starting_index > max_len:
        print("max_len must not be less than starting_index")
        return
    
    if max_len < 2000:
        print("Please specify a number larger than 2000")
        return
    
    # the run should be done in 1000 piece batches
    chosen_index = list(range(starting_index,starting_index+1000))
    
    # the 1000 piece batches are stored in separate .csv files with the naming convention
    # batch_<starting_index>_<ending_index>.csv (batch_0_999.csv)
    file_name = get_filename(chosen_index)
    
    while chosen_index[-1] <= (max_len):
        df_current = df.iloc[chosen_index]
        df_current["tagged"] = df_current.parallel_apply(
            lambda row: tag("test_{}".format(row.name),row[1]),axis=1)
        
        df_current.to_csv(file_name,index=False)
        new_index = [x+1000 for x in chosen_index]
        
        chosen_index = new_index
        file_name = get_filename(chosen_index)
    

def get_filename(index):
    return "batch_{start}_{end}.csv".format(start=index[0],end=index[-1])

def tag(filename, text):
    file = open("RFTagger/{}".format(filename),"w")
    file.write("\n\n".join("\n".join(word_tokenize(sentence, language='german')) for sentence in sent_tokenize(text, language='german')))
    file.close()
    
    res = check_output(["src/rft-annotate", "lib/german.par", filename], cwd="RFTagger").decode("utf-8").split("\n")

    os.system("rm RFTagger/{}".format(filename))
    
    return ' '.join(res)

def contains(text,tag):
    regexp = re.compile(r'{}'.format(tag))
    return bool(regexp.search(text))

In [None]:
df = pd.read_csv("../../data/protocols/all_parsed.csv")

This is just an example, all the batches were created like this, it took around 27 hours in total

In [None]:
run_batches(df, 60000, 63000)

The last 909 entries were completed as follows:

In [None]:
df_short = df.tail(909)

In [None]:
df_short["tagged"] = df_short.parallel_apply(lambda row: tag("test_{}".format(row.name),row[1]),axis=1)

The batches were stored like seen here with start and end index

In [None]:
df_short.to_csv("batch_63000_63909.csv",index=False)

The batches were concatenated like seen in the following, this code cannot be executed as the path does not exist

In [None]:
files = Path(PATH_TO_FOLDER_WITH_BATCHES).glob("*.csv")

In [None]:
dfs_batches = list()

for f in files:
    data = pd.read_csv(f)
    dfs_batches.append(data)

In [None]:
df_final = pd.concat(dfs, ignore_index=True)

In [None]:
df_final.to_csv("all_tagged.csv",index=False)