# Preprocess Notebook

This notebook is dedicated to the preprocessing of the data. The input file is in the format of an .tmx file. The steps that are currently being done are:
1. Extract lines
2. Store surplus
3. Pair lines
4. Clear punctuation
5. Sample the dataset

## Imports
Below are all the libraries used for this notebook

In [1]:
import unicodedata
import re
import pickle
import copy

import nltk

## Cleaning the data
This function includes all cleaning steps of the data. It reads the tmx file, takes out the lines that interest the model, removes unnecessary tags, and pairs the lines as englist-dutch respective order. Afterwards, it writes the data in a pickle for further usage. A pickle reading function with an optional limiter has also been included.

In [2]:
def clean(input_location: str, output_location: str, chunk_size = 1048576):
    """ This function takes a file, reads it, separates it, and saves it
    to a document via pickling it. Chunk size means the number of characters
    which roughly converts to the amount of bytes. Default setting is equal to a
    Megabyte. """
    
    # Opening the filestreams
    input_file = open(input_location, "r")
    output_file = open(f"data/{output_location}", "wb")
    
    # Reading the first chunk
    chunk = input_file.read(chunk_size)
    
    waiting_single = [[], []]              # Variable to hold the line pairs
    extra = " "                            # Variable to hold extra strings
    
    while chunk:
        #print(chunk+extra+"\n\n")
        lines, extra = getLines(extra + chunk)
        #print(f"{lines}\n{extra}")
        if not lines:
            chunk = input_file.read(chunk_size)
            continue
        waiting_single, pairs = pair(waiting_single, lines)
        pickle.dump(pairs, output_file)
        chunk = input_file.read(chunk_size)
    input_file.close()
    output_file.close()
        
       

In [3]:
def pair(temp: list, lines: list) -> list:
    """ Takes two list, one with all the unpaired lines, and a list with possibly an
    unpaired single and a tag. """
    
    results = list()
    for tag, line in lines:
        if len(temp[0]) == 2 and f"{temp[1][0]}{temp[1][1]}" != "ennl":
            # If the pair is wrongly ordered
            print(temp)
            print(results)
            raise Exception("Line mismatch occured")
        elif len(temp[0]) == 2:
            # If the pair is correct
            results.append(copy.copy(temp[0]))
            temp = [[],[]]
        # The part which adds a line to be paired
        temp[0].append(line)
        temp[1].append(tag)
    return (temp, results)
  

In [4]:
def getLines(temp_str: str) -> (list, str):
    """ Takes a string, extracts the lines with actual quotes.
    Returns the lines and the string left after the last line. """
    
    lines = re.findall(r"<tuv.+?(?=</seg>)</seg>", temp_str)
    # If there is no lines to find, pass the string as a whole
    if not lines:
        return [[], temp_str]
    # If there is a line, this part extracts the text after
    # the last scanned line
    if lines[-1] != temp_str[-len(lines[-1]):]:
        remnants = ""
        for i in range(1, len(temp_str)-len(lines[-1])):
            if lines[-1] == temp_str[-(len(lines[-1])+i):-i]:
                break
            else:
                remnants += temp_str[-i]
    # Erases the parts we are not interested in
    lines = [(line[15:17], line[24:-6]) for line in lines]
    return (lines, remnants[::-1])


## Pickle Functions:

In [5]:
def readPickle(pickle_off: str, limiter = -1):
    """ Read the pickle either in a limited amount of chunks or all
    of them at once. """
    objects = []
    file_stream = open(pickle_off, "rb")
    while limiter != 0:
        try:
            objects.append(pickle.load(file_stream))
            limiter -= 1
        except EOFError:
            break
    file_stream.close()
    return objects

In [10]:
def processPickle(pickle_off: str, process=lambda x:x, limiter = -1, store=False):
    """ Takes in a pickle, reads it in chunks and applies the process function to it in each chunk.
    Depending on the store value, it either stores the results in a list and returns it or returns nothing.
    The process function can only have one input which is the read object from the pickle. """
    
    objects = []
    file_stream = open(pickle_off, "rb")
    
    while limiter != 0:
        try:
            if store:objects.append(process(pickle.load(file_stream)))
            else:process(pickle.load(file_stream))
            print("Chunk processed.")
            limiter -= 1
        except EOFError: # When there is nothing else to read
            break
    file_stream.close()
    if store: return objects

In [11]:
def getLen(lines: list)->int:
    """ Takes a list of lines, prints and returns the length. """
    l = len(lines)
    print(f"The length of this chunk is {l} lines")
    return l

In [13]:
# In order to get the full length of a pickle, we do
full_l = sum(processPickle("data/sample1.pkl",process = getLen, store = True))
print(f"The total length of pickle is {full_l} lines")

The length of this chunk is 14635 lines
Chunk processed.
The length of this chunk is 14605 lines
Chunk processed.
The length of this chunk is 13056 lines
Chunk processed.
The length of this chunk is 13187 lines
Chunk processed.
The length of this chunk is 13669 lines
Chunk processed.
The length of this chunk is 14008 lines
Chunk processed.
The length of this chunk is 13855 lines
Chunk processed.
The length of this chunk is 13441 lines
Chunk processed.
The length of this chunk is 13209 lines
Chunk processed.
The length of this chunk is 13928 lines
Chunk processed.
The length of this chunk is 8941 lines
Chunk processed.
The total length of pickle is 146534 lines


## Sampling

In [9]:
# The pattern to tokenize each sentence
pattern = r'''(?x)           # set flag to allow verbose regexps
     (?:[a-z]\.)+            # abbreviations, e.g. U.S.A.
   | [a-z0-9]+(?:[-'][a-z0-9]+)*[,:;']?  # words with optional internal hyphens
'''

def cleanPunc(line: str) -> list:
    """ Takes a string, makes all of them lowercase and tokenizes it, getting rid of unwanted characters. """
    return nltk.regexp_tokenize(unicodeToAscii(line.lower()), pattern)

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def samplingDataset(line_pairs: list, sampleCheck) -> list:
    """ Takes a list of tokenized pairs, and a function to check one of the tokenized lists.
    Extracts the ones that return true for the check. """
    sample = list()
    for en, nl in line_pairs:
        temp_en = cleanPunc(en)
        temp_nl = cleanPunc(nl)
        if sampleCheck(temp_en, temp_nl):
            sample.append([" ".join(temp_en), " ".join(temp_nl)])
    return sample

def samplingMethod1(line: list) -> bool:
    """ Takes in a list of strings. Checks if it has the given words in given indexes depending on
    the list. """
    
    # This list is formed of lists of an index and the possible words that are supposed to be
    # in that index
    check_list = [[0, ["he", "she", "it"]], [1, ["is"]]]
    
    for index, words in check_list:
        if len(line) <= check_list[-1][0] or not line[index] in words:
            return False
    return True

def samplingMethod2(line: list) -> bool:
    """ Takes in a list of strings. Checks if it has the given words in given indexes depending on
    the list and is in the right size. """
    
    # This list is formed of lists of an index and the possible words that are supposed to be
    # in that index
    check_list = [[0, ["he", "she", "we", "i", "you"]], [1, ["is", "are", "am"]]]
    
    for index, words in check_list:
        length = len(line)
        if length > 10 or length <= check_list[-1][0] or not line[index] in words:
            return False
    return True

def samplingMethod3(line1: list, line2: list) -> bool:
    """ Takes in two lists, checks their length. """
    
    if len(line1) > 20 or len(line2) > 20:
        return False
    return True
    

def pickleSample(output_location: str, sampleCheck=lambda x:True):
    """ Takes in an output location, and a function to apply sampling with. Uses the process pickle
    function to chunk the pickle and apply the sampling in each chunk. Then writes the cumulative
    result to a new pickle. """
    output_file = open(f"data/{output_location}", "wb")
    processPickle("data/en_nl_line_pairs.pkl",
                  process=lambda x:pickle.dump(samplingDataset(x, sampleCheck), output_file))
    

In [10]:
# Example usage
pickleSample("sample_not_a_sample.pkl", sampleCheck=samplingMethod3)

Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.


In [11]:
readPickle("data/sample_not_a_sample.pkl", limiter = 1)

[[['a killer asteroid streaking to our earth at sixty seven miles an hour',
   'een dodelijke asteroide snelt met 107 200 km u op de aarde af'],
  ['giant chimney like tower scorching hot, rising from the bottom of the ocean floor',
   'gloeiend hete, schoorsteenachtige torens die verrijzen vanaf de oceaanbodem'],
  ['a young man at the sea voyage shook the foundation of the world',
   "de zeereis van n jongeman die de wereld op z'n kop zette"],
  ['an expedition back through time into the heart of human ancestry',
   'we gaan terug in de tijd, op zoek naar de oorsprong van onze voorouders'],
  ['how species evolved why some species survive and others become extinct',
   'hoe soorten evolueerden waarom sommige soorten overleefden en andere uitstierven'],
  ['these are the greatest discoveries in the origin and evolution of life',
   'dit zijn de grootste ontdekkingen over de oorsprong en evolutie van t leven'],
  ['life is a game of chance', 'het leven is een kansspel'],
  ['the odds o

In [2]:
def makeCorpus(pickle_off: str, pickle_to: str, corpus_num: int, limiter = -1, exists = False):
    input_stream = open(pickle_off, "rb")
    if exists: corpus = pickle.load(open(pickle_to, "rb"))
    else: corpus = set()
    while limiter != 0:
        try:
            data = pickle.load(input_stream)
            for row in data:
                for word in row[corpus_num].split():
                    corpus.add(word)
            print("Chunk processed.")
            limiter -= 1
        except EOFError:
            break
    input_stream.close()
    output_stream = open(pickle_to, "wb")
    pickle.dump(corpus, output_stream)
    output_stream.close()

In [8]:
makeCorpus("data/sample_shorty.pkl", "data/en_shorty.pkl", 0, limiter = 8)
print("Second Corpus Started")
makeCorpus("data/sample_shorty.pkl", "data/nl_shorty.pkl", 1, limiter = 8)

Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Second Corpus Started
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.
Chunk processed.


In [9]:
a = pickle.load(open("data/en_shorty.pkl", "rb"))
b = pickle.load(open("data/nl_shorty.pkl", "rb"))
print(len(a))
print(len(b))

1926
1840
