<a id="Noise"></a>

# Re3d Dataset Adding Noise

The Re3d Dataset is a dataset compiled for Named Entity Recognition (NER) on the subject of National Defense. The data includes sources such as:

* Australian Department of Foreign Affiars
* BBC Online
* CENTCOM
* Delegation of the European Union to Syria
* UK Government
* US State Department
* Wikipedia

This notebook explores how to add noise to the ground truth labels.

In [31]:
from pathlib import Path
import pandas as pd
import joblib
from tqdm import tqdm

In [32]:
ROOT_DIR = Path('notebooks/noise.ipynb').resolve().parents[2]
DATA_DIR = ROOT_DIR / "data"
PREPARED_DIR = DATA_DIR / "prepared"
NOISE_DIR = PREPARED_DIR / "noise"

In [33]:
from ast import literal_eval


df = pd.read_csv(PREPARED_DIR / "master.csv")
df["tags"] = df["tags"].apply(literal_eval)
df.head()

Unnamed: 0,sentence_num,word,start_idx,end_idx,tags,single_tag,POS
0,0,This,0,4,[B-Temporal],B-Temporal,PRON
1,0,week,5,9,[I-Temporal],I-Temporal,NOUN
2,0,sees,10,14,[O],O,VERB
3,0,the,15,18,[O],O,PRON
4,0,start,19,24,[O],O,VERB


In [34]:
from rlner.utils import SentenceGetter

In [35]:
getter = SentenceGetter(df)
sentences = getter.sentences
print(f"There are {len(sentences)} total sentences")

There are 928 total sentences


We will need to split the dataset into train, validation, and test subsets. We will apply varying amounts of noise to the train subset while leaving the validation and test sets be. 

In [36]:
import random

random.Random(42).shuffle(sentences)

total_sentences = len(sentences)
val_idx = int(total_sentences * 0.8)
test_idx = val_idx + int(total_sentences * 0.1)

train_sentences = sentences[:val_idx]
val_sentences = sentences[val_idx: test_idx]
test_sentences = sentences[test_idx:]

In [37]:
print(f"Number of Train Sentences: {len(train_sentences)}")
print(f"Number of Validation Sentences: {len(val_sentences)}")
print(f"Number of Test Sentences: {len(test_sentences)}")

Number of Train Sentences: 742
Number of Validation Sentences: 92
Number of Test Sentences: 94


In [38]:
from rlner.noise import add_noise

In [12]:
def main() -> None:
    # Load master data
    df = pd.read_csv(PREPARED_DIR / "master.csv")
    df["tags"] = df["tags"].apply(literal_eval)

    # Gather as sequences
    getter = SentenceGetter(df)
    sentences = getter.sentences

    # Split into train, val, test
    random.Random(42).shuffle(sentences)
    total_sentences = len(sentences)
    val_idx = int(total_sentences * 0.8)
    test_idx = val_idx + int(total_sentences * 0.1)

    train_sentences = sentences[:val_idx]
    val_sentences = sentences[val_idx: test_idx]
    test_sentences = sentences[test_idx:]

    # Save val/test
    NOISE_DIR.mkdir(exist_ok=True)
    
    with open(PREPARED_DIR / "validation.joblib", "wb") as fp:
        joblib.dump(val_sentences, fp, compress=3)

    with open(PREPARED_DIR / "test.joblib", "wb") as fp:
        joblib.dump(test_sentences, fp, compress=3)

    # Apply noise and save
    noisy_percentages = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

    for percentage in tqdm(noisy_percentages, desc="Noise Percentages"):
        noisy_sentences = add_noise(train_sentences, percentage)

        with open(NOISE_DIR / f"noise_{percentage}.joblib", "wb") as fp:
            joblib.dump(noisy_sentences, fp, compress=3)