# Preprocess Notebook

This notebook is dedicated to the preprocessing of the data. The input file is in the format of an .tmx file. The steps that are currently being done are:
1. Extract lines
2. Store surplus
3. Pair lines
4. Clear punctuation
5. Sample the dataset

## Imports
Below are all the libraries used for this notebook

In [2]:
import re
import pickle
import copy

import nltk

## Cleaning the data
This function includes all cleaning steps of the data. It reads the tmx file, takes out the lines that interest the model, removes unnecessary tags, and pairs the lines as englist-dutch respective order. Afterwards, it writes the data in a pickle for further usage. A pickle reading function with an optional limiter has also been included.

In [1]:
def clean(input_location: str, output_location: str, chunk_size = 1048576):
    """ This function takes a file, reads it, separates it, and saves it
    to a document via pickling it. Chunk size means the number of characters
    which roughly converts to the amount of bytes. Default setting is equal to a
    Megabyte. """
    
    # Opening the filestreams
    input_file = open(input_location, "r")
    output_file = open(f"data/{output_location}", "wb")
    
    # Reading the first chunk
    chunk = input_file.read(chunk_size)
    
    waiting_single = [[], []]              # Variable to hold the line pairs
    extra = " "                            # Variable to hold extra strings
    
    while chunk:
        #print(chunk+extra+"\n\n")
        lines, extra = getLines(extra + chunk)
        #print(f"{lines}\n{extra}")
        if not lines:
            chunk = input_file.read(chunk_size)
            continue
        waiting_single, pairs = pair(waiting_single, lines)
        pickle.dump(pairs, output_file)
        chunk = input_file.read(chunk_size)
    input_file.close()
    output_file.close()
        
       

In [None]:
def pair(temp: list, lines: list) -> list:
    """ Takes two list, one with all the unpaired lines, and a list with possibly an
    unpaired single and a tag. """
    
    results = list()
    for tag, line in lines:
        if len(temp[0]) == 2 and f"{temp[1][0]}{temp[1][1]}" != "ennl":
            # If the pair is wrongly ordered
            print(temp)
            print(results)
            raise Exception("Line mismatch occured")
        elif len(temp[0]) == 2:
            # If the pair is correct
            results.append(copy.copy(temp[0]))
            temp = [[],[]]
        # The part which adds a line to be paired
        temp[0].append(line)
        temp[1].append(tag)
    return (temp, results)
  

In [None]:
def getLines(temp_str: str) -> (list, str):
    """ Takes a string, extracts the lines with actual quotes.
    Returns the lines and the string left after the last line. """
    
    lines = re.findall(r"<tuv.+?(?=</seg>)</seg>", temp_str)
    # If there is no lines to find, pass the string as a whole
    if not lines:
        return [[], temp_str]
    # If there is a line, this part extracts the text after
    # the last scanned line
    if lines[-1] != temp_str[-len(lines[-1]):]:
        remnants = ""
        for i in range(1, len(temp_str)-len(lines[-1])):
            if lines[-1] == temp_str[-(len(lines[-1])+i):-i]:
                break
            else:
                remnants += temp_str[-i]
    # Erases the parts we are not interested in
    lines = [(line[15:17], line[24:-6]) for line in lines]
    return (lines, remnants[::-1])


## Pickle Functions:

In [12]:
def readPickle(pickle_off: str, limiter = -1):
    """ Read the pickle either in a limited amount of chunks or all
    of them at once. """
    objects = []
    file_stream = open(pickle_off, "rb")
    while limiter != 0:
        try:
            objects.append(pickle.load(file_stream))
            limiter -= 1
        except EOFError:
            break
    file_stream.close()
    return objects

In [4]:
def processPickle(pickle_off: str, process=lambda x:x, limiter = -1, store=False):
    """ Takes in a pickle, reads it in chunks and applies the process function to it in each chunk.
    Depending on the store value, it either stores the results in a list and returns it or returns nothing.
    The process function can only have one input which is the read object from the pickle. """
    
    objects = []
    file_stream = open(pickle_off, "rb")
    
    while limiter != 0:
        try:
            if store:objects.append(process(pickle.load(file_stream)))
            else:process(pickle.load(file_stream))
            limiter -= 1
        except EOFError: # When there is nothing else to read
            break
    file_stream.close()
    if store: return objects

In [3]:
def getLen(lines: list)->int:
    """ Takes a list of lines, prints and returns the length. """
    l = len(lines)
    print(f"The length of this chunk is {l} lines")
    return l

In [5]:
# In order to get the full length of a pickle, we do
full_l = sum(processPickle("data/sample0.pkl",process = getLen, store = True))
print(f"The total length of pickle is {full_l} lines")

The length of this chunk is 83064 lines
The length of this chunk is 75154 lines
The length of this chunk is 77409 lines
The length of this chunk is 95147 lines
The length of this chunk is 84284 lines
The length of this chunk is 81818 lines
The length of this chunk is 74158 lines
The length of this chunk is 71400 lines
The length of this chunk is 75132 lines
The length of this chunk is 77556 lines
The length of this chunk is 49404 lines
The total length of pickle is 844526 lines


## Sampling

In [16]:
# The pattern to tokenize each sentence
pattern = r'''(?x)           # set flag to allow verbose regexps
     (?:[A-Z]\.)+            # abbreviations, e.g. U.S.A.
   | \w+(?:[-']\w+)*[,:;']?  # words with optional internal hyphens
'''

def cleanPunc(line: str) -> list:
    """ Takes a string, makes all of them lowercase and tokenizes it, getting rid of unwanted characters. """
    return nltk.regexp_tokenize(line.lower(), pattern)

def samplingDataset(line_pairs: list, sampleCheck) -> list:
    """ Takes a list of tokenized pairs, and a function to check one of the tokenized lists.
    Extracts the ones that return true for the check. """
    sample = list()
    for en, nl in line_pairs:
        temp_en = cleanPunc(en)
        temp_nl = cleanPunc(nl)
        if sampleCheck(temp_en):
            sample.append([" ".join(temp_en), " ".join(temp_nl)])
    return sample

def samplingMethod1(line: list) -> bool:
    """ Takes in a list of strings. Checks if it has the given words in given indexes depending on
    the list. """
    
    # This list is formed of lists of an index and the possible words that are supposed to be
    # in that index
    check_list = [[0, ["he", "she", "it"]], [1, ["is"]]]
    
    for index, words in check_list:
        if len(line) <= check_list[-1][0] or not line[index] in words:
            return False
    return True

def samplingMethod2(line: list) -> bool:
    """ Takes in a list of strings. Checks if it has the given words in given indexes depending on
    the list. """
    
    # This list is formed of lists of an index and the possible words that are supposed to be
    # in that index
    check_list = [[0, ["he", "she", "we", "i", "you"]], [1, ["is", "are", "am"]]]
    
    for index, words in check_list:
        length = len(line)
        if length > 10 or length <= check_list[-1][0] or not line[index] in words:
            return False
    return True

def pickleSample(output_location: str, sampleCheck=lambda x:True):
    """ Takes in an output location, and a function to apply sampling with. Uses the process pickle
    function to chunk the pickle and apply the sampling in each chunk. Then writes the cumulative
    result to a new pickle. """
    output_file = open(f"data/{output_location}", "wb")
    processPickle("data/en_nl_line_pairs.pkl",
                  process=lambda x:pickle.dump(samplingDataset(x, sampleCheck), output_file))
    

In [17]:
# Example usage
pickleSample("sample1.pkl", sampleCheck=samplingMethod2)

In [18]:
readPickle("data/sample1.pkl")

[[['we are standing on it', 'we staan erop'],
  ['we are expecting to find water', 'dit hadden we niet verwacht'],
  ['we are at the centre', 'en wij staan in het midden'],
  ['you are exactly what i need', 'je bent precies wat ik nodig heb'],
  ['i am sorry, noel', 'het spijt me, noel'],
  ['you are a gem', 'je bent een juweeltje'],
  ['you are very welcome, elizabeth', 'graag gedaan, elizabeth'],
  ['i am trust me i am', 'kan ik aan, geloof me'],
  ['we are louis and jessica huang we live in orlando',
   'wij zijn louis en jessica huang, we wonen in orlando'],
  ['i am the year of the dog', 'ik ben het jaar van de hond'],
  ['we are all our own leaders', 'wij zijn allemaal onze eigen leiders'],
  ['i am not a follower', 'volg mij ik ben geen volger'],
  ['i am real hungry for bark', 'ik heb echt trek in schors'],
  ['you are totally deflecting i am who i am', 'u dwaalt af'],
  ['you are who you are', 'ik ben wie ik ben u bent wie u bent'],
  ['we are so doing this', 'we gaan het doen