# Data Validation

After getting sentences that match our patterns using SPIKE's API, we would want to make sure the following hold, before tagging the data.

1. Sentences are not too short (like titles).
2. Captures make sense ( no non-alphabetical results etc.)
3. Spike is captures-oriented, that is, it returns a match per set of capture. We'd like to merge matches that are the same sentence that is because it has more than a single capture - for example
  a. sent 1: [David Bowie] and Freddie Mercury
  b. sent 2: David Bowie and [Freddie Mercury]
This should be merge that both are labeled with the musician label.
5. Similarly for non-musicians we'd like to ignore the captures and just look at the NER results.
6. `'s` is not part of the entity
7. Sentences in the train set do not appear in the test/dev sets.

and so on...

In [4]:
import string
import json
import pandas as pd
import glob

### Extract dev/test sentences

In [2]:
def remove_tags(sentence):
    tokens = []
    for t in sentence.split():
        if t:
            tokens.append(t.split('-[',1)[0])
    return clean_punct(" ".join(tokens))

def clean_punct(sentence):
    s = sentence.translate(str.maketrans('', '', string.punctuation))
    s = s.replace("  ", " ")
    return s

test_path = '../data/musicians_dataset/test.txt'
dev_path = '../data/musicians_dataset/dev.txt'
with open(test_path, 'r') as ft, open(dev_path, 'r') as fd:
    test_set = [remove_tags(sent.strip()) for sent in ft.readlines()]
    dev_set = [remove_tags(sent.strip()) for sent in fd.readlines()]
dev_and_test = dev_set + test_set

In [31]:
def sentence_is_not_too_short(sentence):
    return len(sentence['words']) > 50

def capture_is_not_non_alphabetical(sentence):
    tokens = sentence["words"]
    first = sentence['captures']['musician']['first']
    last = sentence['captures']['musician']['last']
    capture_tokens = [t for i, t in enumerate(tokens) if first <= i <= last ]
    alphabet = 'q w e r t y u i o p a s d f g h j k l z x c v b n m'.split()
    return any(x in " ".join(capture_tokens) for x in alphabet)

In [25]:
positives = []
negatives = []

for file in glob.glob('../data/spike_matches/**/*.json', recursive=True):
    with open(file, "r") as f:
        j = json.load(f)
        print(len(j))
        for sent in j:
            # start validations:
            if sentence_is_not_too_short(sent) and capture_is_not_non_alphabetical(sent):
                clean_sent = clean_punct(" ".join(sent["words"]))
                if clean_sent not in dev_and_test:
                    if 'positive' in file:
                        positives.append(sent)
                    else:
                        negatives.append(sent)
        print(len(positives))


1000
998
1000
1998
1000
2996
1000
3985
1000
4985
5000
4985


All in all, 15 sentences from test/dev appear in the train set, and have been removed. 

In [26]:
positives[0]

{'words': ['Jorge',
  'Luis',
  'Prats',
  'Soca',
  '(',
  'born',
  '3',
  'July',
  '1956',
  ')',
  'is',
  'a',
  'Cuban',
  'pianist',
  'living',
  'in',
  'Spain',
  '.'],
 'captures': {'musician': {'first': 0, 'last': 3}},
 'sentence_index': 4448,
 'highlights': [{'first': 10, 'last': 10}, {'first': 13, 'last': 13}],
 'entities': [{'first': 0,
   'last': 3,
   'label': 'PERSON',
   'priority': 0,
   'source': None}]}