# Data Preparation

## 2. Split Text

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary.

We will simplify the problem by reducing the dataset to the first 50,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then stake the first 45,000 of those as examples for training and the remaining 5,000 examples to test the fit model.

In [1]:
from pickle import load
from pickle import dump
from numpy.random import rand
from numpy.random import shuffle

In [2]:
# load a clean dataset
def load_clean_sentences(fname):
    return load(open(fname, 'rb'))

In [3]:
# save a list of clean sentences to file
def save_clean_data(sentences, fname):
    dump(sentences, open(fname, 'wb'))
    print('Saved: %s' % fname)

In [4]:
# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')

In [7]:
# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]

In [8]:
# random shuffle
shuffle(dataset)

In [9]:
# split into train/test
train, test = dataset[:9000], dataset[9000:]

In [10]:
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

Saved: english-german-both.pkl
Saved: english-german-train.pkl
Saved: english-german-test.pkl


In [11]:
train[0:10]

array([['im helping you', 'ich helfe dir'],
       ['it wasnt ours', 'es war nicht unseres'],
       ['tom walked into the bar', 'tom ging in die kneipe'],
       ['he has an eye for art', 'er hat einen blick fur kunst'],
       ['do you use aftershave', 'benutzen sie aftershave'],
       ['i said drop your weapon',
        'ich sagte lassen sie die waffe fallen'],
       ['ill stop by later', 'ich komme spater vorbei'],
       ['do you happen to know tom', 'kennen sie zufalligerweise tom'],
       ['he sketched an apple', 'er zeichnete einen apfel'],
       ['tom has ocd', 'tom leidet an einer zwangserkrankung']],
      dtype='<U370')

In [12]:
test[0:10]

array([['i truly loved her', 'ich liebte sie wirklich'],
       ['are you a good dancer', 'sind sie ein guter tanzer'],
       ['do you know that person', 'kennst du diese person'],
       ['im over eighteen', 'ich bin uber'],
       ['whats wrong', 'was ist das problem'],
       ['i decided to try again',
        'ich habe beschlossen es noch einmal zu versuchen'],
       ['tom is unmerciful', 'tom ist umbarmherzig'],
       ['are you writing a letter', 'schreibst du einen brief'],
       ['that was loud', 'das war laut'],
       ['when can i visit you', 'wann kann ich euch besuchen']],
      dtype='<U370')

the english-german-both.pkl that contains all of the train and test examples that we can use to define the parameters of the problem, such as max phrase lengths and the vocabulary, and the english-german-train.pkl and english-german-test.pkl files for the train and test dataset.