## Preparing Data for Validation of DDB Tagger
---

This notebook contains the pipeline to create a dataset for human validation of the DDB-Tagger. The aim is to sample sentences from a broad range of texts of the Danish Gigaword Corpus (DAGW), in which a target word is then tagged using the DDB Tagger. The sampled sentences, together with their target word and tags are stored in a dataframe. This dataframe can be used to allow human annotators to validate the tags of the DDB Tagger. This is done with the following pipeline:
 
#### 0 Import Dependencies
- Load Danish Language Model (SpaCy: `da_core_news_lg`) for splitting of sentences, tokens and POS-Tagging
- Load DDB-Tagger
 
#### 1 Defining Inputs for Sampling of Sentences and Target Words
- <u>Categories</u>: Defining categories and sections of the DAGW corpus (related to dataset retrieved 20/02/2022)
   - The following datasets were excluded:
       - CONVERSATION: NAAT
       - SOCIAL MEDIA: General Discussions, Parliament Elections
       - WEB
       - WIKI&BOOKS: Danish Literature, Gutenberg, WikiSource, Johannes V. Jensen, Religious Texts
       - NEWS: DanAvis
       - OTHER
- <u>Subset Size</u>: Defines the number of characters which are read from a given sampled file. Defaults to 1000 characters.
- <u>Context Size</u>: Defines the number of tokens to appear before a given target and after a given target in the same sentence. Defaults to 10 tokens.
- <u>Target POS</u>: The target POS tag defines the POS tag that the target word in a given sentence should have. Defaults to `NOUN`.
 
#### 2 Sampling Sentences and Target Words for Defined Categories
The following steps are performed to retrieve target tokens and sentences and create the validation data (here described using the default inputs):<br>
For each of the defined categories:
 
- Create a list of files across all sections belonging to the given category
- Until 20 targets/sentences have been found for the given category:
- Sample a random file of the list of files and remove it from the list of files (to avoid double sampling)
    - Read the first 1000 characters of the file
    - Split the text into sentences, excluding those that contain line breaks
    - For each sentence:
        - Split the sentence into tokens, excluding space and punctuation
        - If the number of tokens is large enough to contain a target word and 10 context tokens before and after the target (11):
            - For each token, check if:
                - Token is longer than 1 character
                - Token occurs only once in the given sentence (to avoid confusion when tagging)
                - Has the required POS tag (`NOUN`)
                - Has 10 tokens before and 10 tokens after in the sentence
                - Token does not occur in the validation data yet
                - Sentence of the token does not occur in the validation data yet
            - If the above requirements are fulfilled:
                - Tag the sentence of the token using the DDB-Tagger
                - Retrieve the tag of the target token in the sentence
                - If the target token has 4+ tags:
                    - Highlight the token in its sentence
                    - Add information (token, sentence, tags) to the validation data
 
#### 3 Processing Output
- Processing output to create a dataset with all information (`validation_data_full.csv`) and a dataset for rating, with less information and a column for the ratings (`validation_data_rating.csv`).

---

### 0 Importing Dependencies 

In [1]:
# basics
import os, sys
import pandas as pd
import random

# danish language model
import spacy
nlp = spacy.load("da_core_news_lg")

# tagger
sys.path.append("..")
from src.DDB_tagger import DDB_tagger
Tagger = DDB_tagger(da_model="spacy")

### 1 Defining Inputs for Sampling of Sentences and Target Words

In [2]:
# Define categories and sektions of DAGW to sample files from
categories = {"LEGAL": ["retsinformationdk", "skat", "retspraksis"],
              "SOCIAL MEDIA": ["hest"],
              "CONVERSATION": ["opensub", "ft", "ep", "spont"],
              "WIKI&BOOKS": ["wiki", "wikibooks"],
              "NEWS": ["tv2r"]}

# Define subset size (top n characters of a sampled file to read)
subset_size = 1000

# Define context size (n tokens before and after target in the same sentence)
context_size = 10

# Define POS tag of target words
target_pos = "NOUN"

### 2 Sampling Sentences and Target Words for Defined Categories

In [3]:
random.seed(1)
list_of_dicts = []

# Loop over categories and related sektions
for category, sektions in categories.items():
    
    # --- GET FILES OF CATEGORY ---

    category_files = []
    # Loop over sektions
    for sektion in sektions:
        # Get path of directory for sektion
        dir = f"../../DAGW/sektioner/{sektion}/"
        # Define prefix of the files in the sektion
        prefix = f"{sektion}_"
        # Get all filepaths that start with the prefix from the directory
        files = sorted([os.path.join(dir, file) for file in os.listdir(dir) if file.startswith(prefix)])
        # Append sektion files to category files
        category_files = category_files + files

    # Print number of files which were found for category
    print(f"\n--------\n")
    print(f"CATEGORY: {category} - found {len(category_files)} files to sample from.")

    # --- SAMPLE 20 SENTENCES FOR CATEGORY ---

    category_sentences = 0
    while category_sentences < 20:
        
        # Sample a random file from the category files
        file = random.sample(category_files, 1)[0]
        # Remove the file to avoid sampling it again
        category_files.remove(file)
        # Read the first n characters of the file
        text = open(file, "r").read()[:subset_size]

        # --- RETRIEVE APPROPRIATE SENTENCES AND TARGETS ---
        
        # Split sentences of text
        sentences = [str(sent) for sent in nlp(text).sents if "\n" not in str(sent)]
        
        # Loop over sentences
        for sent in sentences:
            
            # Get only the tokens in the sentence (excluding punctuation and space)
            tokens = [token.text for token in nlp(sent) if token.is_punct == False and token.is_space == False]
            # Get tokens and additional information in the sentence
            token_pos_idx = [(token.text, token.pos_, token.idx) for token in nlp(sent) if token.is_punct == False and token.is_space == False]
            
            # If the sentence is long enough to contain a target with sufficient context
            if len(tokens) >= (context_size + 1 + context_size):
                
                # Loop over the tokens in the sentence
                for idx, token_tuple in enumerate(token_pos_idx):
                    
                    # Save the info from the token
                    target, pos, start_idx = token_tuple
                    
                    # If the token fulfils list of requirements: 
                        # Token is longer than a single charachter
                    if (len(target) > 1 and                
                        # Token only occurs once in the sentence
                        tokens.count(target) == 1 and
                        # Token has POS tag
                        pos == target_pos and
                        # Enough context before token
                        idx > context_size and 
                        # Enough context after token
                        idx < len(token_pos_idx) - context_size and 
                        # Target not in sampled sentences/targets
                        target not in [d["TARGET"] for d in list_of_dicts] and 
                        # Sentence not in sampled sentences/targets
                        sent not in [d["SENT_ORIGINAL"] for d in list_of_dicts]):

                        # --- TAG TARGET TOKEN IF FULFILLING REQUIREMENTS ---

                        # Tag the sentence
                        sent_tagged = Tagger.tag_text(sent, only_top3_results=False, only_tagged_results=True)
                        # Get only the tags of the target token
                        target_tagged = sent_tagged[sent_tagged["TOKEN"] == target].reset_index()

                        # --- USE TARGET TOKEN IN VALIDATION DATA IF FULFILLING REQUIREMENTS ---

                        # If the target token has 4 or more tags
                        if target_tagged.at[0, "DDB4+"] != "-":

                            # Highlight the token in the sentence for rating
                            sent_highlight = sent[:start_idx] + ">>" + sent[start_idx:start_idx+len(target)] + "<<" + sent[start_idx+len(target):]

                            # Create a dictionary with all the info
                            target_dict = {"TARGET": target,
                                           "SENT_ORIGINAL": sent,
                                           "SENT_HIGHLIGHT": sent_highlight,
                                           "CATEGORY": category,
                                           "FILE": file,
                                           "DDB1": target_tagged.at[0, "DDB1"],
                                           "DDB2": target_tagged.at[0, "DDB2"], 
                                           "DDB3": target_tagged.at[0, "DDB3"],
                                           "DDB4+": target_tagged.at[0, "DDB4+"]}

                            # Append the dictionary to all dictionaries
                            list_of_dicts.append(target_dict)
                            # Add count to number of category sentences
                            category_sentences +=1
                            # Print continuous count of number of sentences found for category
                            print(f"- Number of targets/sentences found: {category_sentences}", end='\r')


--------

CATEGORY: LEGAL - found 83201 files to sample from.
- Number of targets/sentences found: 20
--------

CATEGORY: SOCIAL MEDIA - found 14498 files to sample from.
- Number of targets/sentences found: 20
--------

CATEGORY: CONVERSATION - found 38181 files to sample from.
- Number of targets/sentences found: 20
--------

CATEGORY: WIKI&BOOKS - found 427497 files to sample from.
- Number of targets/sentences found: 20
--------

CATEGORY: NEWS - found 49137 files to sample from.
- Number of targets/sentences found: 20

### 3 Processing Output

In [4]:
# PROCESSING OUTPUT
df_full = pd.DataFrame(list_of_dicts)
df_rating = df_full.drop(["SENT_ORIGINAL", "FILE", "CATEGORY"], axis=1)
df_rating["RATING"] = ""

In [5]:
# CHECKING FOR DUPLICATES
print(len(set(df_full["SENT_ORIGINAL"].values)), len(set(df_full["SENT_HIGHLIGHT"].values)))

100 100


In [6]:
# SAVING DATAFRAMES
df_full.to_csv("validation_data_full.csv")
df_rating.to_csv("validation_data_rating.csv")