# Partition Data

We handle a large amount of new annotation data in Snorkel.
The problem is that Snorkel extracts a huge amount of structured information about all sentences and ngrams it calculates. 
Therefore, working on all data at once would exhaust the available memory quite quickly.
To work around this we split the new data into smaller chuncks and process each set individually.
But to initialize the model in each database we also need to copy the base data to each new data chunk. 

In [None]:
from os import listdir, makedirs
from os.path import exists
import json
import random
import math
from shutil import copyfile

We created a list of the files we actually need to train and test Snorkel and will only copy the needed annotated files.
Now we just load all lists of files we are going to partition in the next steps. 

In [None]:
file_list = listdir('data/new_data_and_sosci/')
new_files = [x for x in file_list if not x.startswith('sent')]
random.seed(42)
random.shuffle(new_files)
train_files = [x for x in file_list if x.startswith('sent')]
train_annotation = [x for x in listdir('data/sosci') if x.endswith('.ann')]
with open('sosci_train_dev_test_split.json', 'r') as sosci_data_json:
    train_dev_test_split = json.load(sosci_data_json)

Next we will perform the actual split of the silver standard into a number of fixed buckets and directly copy the files.

In [None]:
output_name = 'sosci_ssc_' 
BUCKETS_TO_BUILD = 64
BUCKETS_TO_PROCESS = 1

In [None]:
num_per_sample = math.ceil(len(new_files)/BUCKETS_TO_BUILD)
for x in range(BUCKETS_TO_PROCESS):
    makedirs('data/{}{}'.format(output_name, x))
    files_in_sample = new_files[x*num_per_sample:(x+1)*num_per_sample]
    for f in files_in_sample:
        copyfile('data/new_data_and_sosci/{}'.format(f), 'data/{}{}/{}'.format(output_name,x,f))

Now we also need to copy the annotated files into all created buckets.

Here we use a certain to distinguish between training data and testing data. 
If we just want to train the model we do not actually need to import the testing set, because this will only cost us memory space. 

In [None]:
train_or_test_set = 'test'

#makedirs('data/{}{}'.format(output_name, '0'))
for x in range(BUCKETS_TO_PROCESS):
    for f in train_files:
        file_name = f.split('.txt')[0]
        if file_name in train_dev_test_split['train']:
            copyfile('data/sosci/{}'.format(f), 'data/{}{}/{}'.format(output_name,x,f))
        elif file_name in train_dev_test_split['devel']:
            if train_or_test_set == 'test':
                copyfile('data/sosci/{}'.format(f), 'data/{}{}/{}'.format(output_name,x,f))
        elif file_name in train_dev_test_split['test']:
            continue
        else:
            print("Error: File {} was not in File split. This should not happen.".format(file_name))           

We can directly import our annotation in Snorkel from BRAT format. 
But for that we need to pass the annotation along with the base files.
We create this data structure in a separate folder, because otherwise it will lead to trouble when importing the plain text data.

In [None]:
if not exists('data/{}annotation'.format(output_name)): 
    makedirs('data/{}annotation'.format(output_name))
for f in train_files:
    file_name = f.split('.txt')[0]
    if file_name in train_dev_test_split['train']:
        copyfile('data/sosci/{}'.format(f), 'data/{}annotation/{}'.format(output_name, f))
        copyfile('data/sosci/{}'.format(f.split('.txt')[0]+'.ann'), 'data/{}annotation/{}'.format(output_name, file_name+'.ann'))
    elif file_name in train_dev_test_split['devel']:
        if train_or_test_set == 'test':
            copyfile('data/sosci/{}'.format(f), 'data/{}annotation/{}'.format(output_name, f))
            copyfile('data/sosci/{}'.format(f.split('.txt')[0]+'.ann'), 'data/{}annotation/{}'.format(output_name, file_name+'.ann'))
    elif file_name in train_dev_test_split['test']:
        continue
    else:
        print("Error: File {} was not in File split. This should not happen.".format(file_name))
copyfile('data/sosci/annotation.conf', 'data/{}annotation/annotation.conf'.format(output_name))      