# Notebook Overview

This notebook is used to translate the multi-labeled scenario text into training, testing and validation datasets for machine learning. This division would normally correspond to a 80/10/10 split, respectively.

While training the BERT-based model using the HuggingFace API can be performed on word-token sequences, the evaluation for the BERT-based model assumes complete scenario texts. Therefore, we split the dataset using a random selection of scenarios and not using a random selection of sentences, which could provide more sample diversity in training. The CRF-based model requires sentence-token sequences, thus, each scenario's token sequence in the dataset is organized by sentence. The CRF-based model also requires part-of-speech (POS) as input to feature development, in addition to tokenized words.

In [1]:
import json

# load the unprocessed, labeled scenarios
scenarios = json.load(open('scenarios-labeled.json'))
scenarios[0]

{'text': "From this screen, I like to search for anything from recipes, to home decor, to people, etc., just depending on my mood. To get to this screen all I had to do was tap on the little magnifying glass next to the image of the home icon, and this is where you would search for whatever you like, kind of like how it works with google, but more customized to your preferences and trends. To find whatever I'm looking for, I click on the search bar to type a word or phrase. I try to keep it brief. For example, if I'm looking for a pasta recipe for dinner, I'll type pasta dinner recipe, then select the phrase that matches what I'm looking for. I then rummage through the available pins and decide which one I like the most before saving the pin to the board that most closely matches the kind of query I made. ",
 'scenario_id': 'MAS-G-0001',
 'app_url': 'https://apps.apple.com/us/app/pinterest/id429047995',
 'labels': [[53, 60, 'SIM'],
  [65, 75, 'SIM'],
  [80, 86, 'SIM'],
  [359, 370, 'SI

In [2]:
# remove overlapping scenarios, retaining longest label

def find_longest_overlap(label1, label2):
    span1 = set(range(label1[0], label1[1]))
    span2 = set(range(label2[0], label2[1]))

    if len(span1.intersection(span2)) == 0:
        return 0
    elif len(span1) > len(span2):
        return -1
    else:
        return 1

# find the label indices that are shortest and overlapping
deleted_labels = []
for scenario in scenarios:
    labels = scenario['labels']
    overlaps = set()
    for i in range(len(labels)):
        for j in range(i + 1, len(labels)):
            overlap = find_longest_overlap(labels[i], labels[j])
            if overlap == 0:
                continue
            elif overlap < 0:
                overlaps.add(j)
            else:
                overlaps.add(i)
    
    # delete short overlapping labels from the list
    for i in sorted(overlaps, reverse=True):
        deleted_labels.append(labels[i][2])
        del labels[i]
        
print('Deleted %i overlapping labels' % len(deleted_labels))

Deleted 377 overlapping labels


In [3]:
import spacy, nltk

# load the spacey nlp processor for pos-tagging, sentence segmentation
nlp = spacy.load("en_core_web_sm")

def get_label(token, labels):
    start = token.idx
    end = token.idx + len(token.text)
    for label in labels:
        if label[0] <= start and end <= label[1]:
            return label[2]
    return 'O'
    
tokenized = []
for scenario in scenarios:
    labels = scenario['labels']
    
    # parse scenario text and tokenize using the BIO-format
    doc = nlp(scenario['text'])
    tokenized.append(scenario.copy())
    tokenized[-1]['tokens'] = []
    last_label = 'O'
    sent = []
    for token in doc:
        if token.is_sent_start:
            sent = []
            tokenized[-1]['tokens'].append(sent)
                      
        new_label = get_label(token, labels)
        if new_label != 'O':
            if new_label != last_label:
                last_label = new_label
                new_label = 'B-' + new_label
            else:
                new_label = 'I-' + new_label
        else:
            last_label = new_label

        sent.append([token.text, token.tag_, new_label])

In [4]:
print('Converted %i scenarios and %i sentences.' % (len(tokenized), sum([len(d['tokens']) for d in tokenized])))

Converted 300 scenarios and 2664 sentences.


In [8]:
import random
from collections import Counter

def report_frequencies(dataset):
    counter = {dataset_name:Counter() for dataset_name in dataset.keys()}

    # report frequences per dataset
    total_sent = 0
    total_token = 0
    total_counter = Counter()
    for dataset_name, data in dataset.items():
        for scenario in data:
            for tokens in scenario['tokens']:
                total_sent += 1
                for word, pos, tag in tokens:
                    counter[dataset_name][tag] += 1
                    total_counter[tag] += 1
                    total_token += 1

    for dataset_name, data in dataset.items():
        token_count = sum([c for c in counter[dataset_name].values()])
        print('\nDataset %s: %i scenarios (%0.3f), %i tokens (%0.3f):' % (
            dataset_name.upper(),
            len(data),
            len(data) / total_sent,
            token_count,
            token_count / total_token
        ))
        labels = sorted(list(counter[dataset_name].keys()))
        for label in labels:
            print('%s: %i (%0.3f)' % (
                label.rjust(10), 
                counter[dataset_name][label], 
                counter[dataset_name][label] / 
                total_counter[label]
            ))

dataset = {}
if True:
    total_scenarios = len(tokenized)
    index1 = int(total_scenarios * 0.80)
    index2 = int(index1 + (total_scenarios * 0.10))

    # randomize the scenarios by scenario_id
    shuffled = tokenized.copy()
    random.shuffle(shuffled)
    
    # divide randomized sentences into train, test, validation splits
    dataset = {
        'train': shuffled[0:index1],
        'test': shuffled[index1:index2],
        'validation': shuffled[index2:],
    }
else:
    split_ids = json.load(open('deployed_split_ids.json'))
    dataset = {'train': [], 'test': [], 'validation': []}
    for scenario in tokenized:
        if scenario['scenario_id'] in split_ids['train_ids']:
            dataset['train'].append(scenario)
        elif scenario['scenario_id'] in split_ids['test_ids']:
            dataset['test'].append(scenario)
        elif scenario['scenario_id'] in split_ids['validation_ids']:
            dataset['validation'].append(scenario)

report_frequencies(dataset)


Dataset TRAIN: 240 scenarios (0.090), 45202 tokens (0.807):
     B-COM: 235 (0.773)
     B-QUE: 265 (0.782)
     B-SIM: 2816 (0.798)
     I-COM: 1332 (0.782)
     I-QUE: 1684 (0.808)
     I-SIM: 771 (0.787)
         O: 38099 (0.809)

Dataset TEST: 30 scenarios (0.011), 5462 tokens (0.097):
     B-COM: 34 (0.112)
     B-QUE: 36 (0.106)
     B-SIM: 336 (0.095)
     I-COM: 195 (0.114)
     I-QUE: 180 (0.086)
     I-SIM: 92 (0.094)
         O: 4589 (0.097)

Dataset VALIDATION: 30 scenarios (0.011), 5367 tokens (0.096):
     B-COM: 35 (0.115)
     B-QUE: 38 (0.112)
     B-SIM: 377 (0.107)
     I-COM: 177 (0.104)
     I-QUE: 220 (0.106)
     I-SIM: 117 (0.119)
         O: 4403 (0.093)


In [9]:
json.dump(dataset, open('../models/scenarios-training-2.json', 'w'))