# Preprocessing

In this notebook we are going to perform the work needed for using the data with our models.
This includes the following steps:

1. Analyzing the data
2. Preprocessing the data
3. Writing a DataLoader to ensure it can be used with our models

## 1. The Dataset

We are going to use a classic dataset: The [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/) from the [Tatoeba project](https://tatoeba.org/en).

Since I am a native German speaker (and therefore can check the results the fastest in), we are going to use the Eng-Deu dataset.

In [14]:
import pandas as pd
import numpy as np
import re
import unicodedata

In [15]:
data = pd.read_csv('data/deu-eng/deu.txt', sep='\t', names=['en', 'de', 'license'])
data.iloc[np.r_[100:102, 500:502, 10000:10002]] # display some parts of the data

Unnamed: 0,en,de,license
100,We try.,Wir versuchen es.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
101,We won.,Wir haben gewonnen.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
500,It's his.,Es ist seins.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
501,It's hot.,Es ist heiß.,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
10000,They were busy.,Sie waren beschäftigt.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
10001,They were dead.,Sie waren tot.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


## 2. Preprocessing

In this section we want to preprocess our data. More specifically, we are going to tokenize our vocabulary (= all words from our corpus) so that we have a unique token for each possible identifier.

~Instead of implementing various methods for splitting words, removing non-ascii chars, etc. we are just using nltk's `word_tokenize` and sklearn's `CountVectorizer` to perform the tasks~

We are implementing our own `Tokenizer`-class! This gives us more flexibility on which strings to include and how to retrieve data lateron.

In [16]:
# import nltk
# nltk.download('punkt') # -> uncomment this cell if you haven't downloaded the stopwords yet!
# from nltk.tokenize import word_tokenize

# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction import DictVectorizer

import torch
from torch.utils.data import Dataset

In [17]:
class Tokenizer:
    
    def __init__(self):
        self.word_vector = {
            'sos': 0, 
            'eos': 1
        }
        self.n_words = 2
        
    def add_string(self, string:str):
        """adds this string to the word_vector"""
        
        tokens = self.tokenize_string(string)
        for token in tokens:
            self.word_vector[token] = self.n_words
            self.n_words += 1
            
    def add_data(self, data:pd.Series):
        strings = data.values.ravel()
        for s in strings:
            self.add_string(s)
            
    def tokenize_string(self, string):
        clean_s = self.clean_string(string)
        
        # just split 
        return clean_s.split(" ")
    
    
    @staticmethod
    def to_ascii(string):
        return ''.join(
            character for character in unicodedata.normalize('NFD', string) if unicodedata.category(character) != 'Mn'
        )
    
    @staticmethod
    def clean_string(string):
        s = Tokenizer.to_ascii(string.lower().strip())
        s = re.sub(r"([.!?])", r" \1", s) # adds space before certain punctuation marks
        s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
        return s
        
    def __getitem__(self, index):
        # clean the index too
        c_index = Tokenizer.clean_string(index).strip()
        if c_index in self.word_vector:
            return self.word_vector[c_index]
        else:
            return None

### PyTorch's Dataset-class

Since the main goal of our project is to learn PyTorch more, we are going to use the `Dataset`-class provided by the PyTorch Framework.

Taking the intial dataframe as parameter, this class / object allows us to preprocess each string the same way and access our training samples in a fast and clean way.

Bonus points: It works really well with PyTorch's `DataLoader` which we'll see in a while.

In [21]:
class TatoebaDataset(Dataset):
    
    def __init__(self, df:pd.DataFrame, max_length=None, langs=('en', 'de'), debug=False):
        self.debug = debug
        self.df = df.copy(deep=True)
        self.langs = langs
        
        # filter by max length
        self.df = self.df.loc[df[langs[0]].str.len() <= max_length, :]
        
        # clean all strings -> add 'clean'-versions of language data!
        self.df.loc[:, langs[0] + '_clean'] = self.df.loc[:, langs[0]].apply(Tokenizer.clean_string)
        self.df.loc[:, langs[1] + '_clean'] = self.df[langs[1]].apply(Tokenizer.clean_string)
        
        # create the tokenizers for both languages
        self.t0 = Tokenizer()
        self.t0.add_data(self.df[langs[0] + '_clean'])
        self.t1 = Tokenizer()
        self.t1.add_data(self.df[langs[1] + '_clean'])
    
    def __getitem__(self, index):
        if index > self.__len__():
            return None
        
        # get sentences in both languages from line <index>
        s0, s1 = self.df.loc[index, [self.langs[0] + '_clean', self.langs[1] + '_clean']]
        
        # vectorize sentences
        if self.debug: print(f" >> {self.langs[0]}: {s0}, {self.langs[1]} {s1}")
        if self.debug: print(f" >> Tokenized: {s0.split(' ')}")
        
        v0 = [self.t0[s] for s in s0.split(" ") if s != ""] + self.t0['EOS']
        v1 = [self.t1[s] for s in s1.split(" ") if s != ""] + self.t1['EOS']
        
        # transform to tensors, .view(-1,1) just ensures that we have 1-d vectors
        t0 = torch.tensor(v0).view(-1,1)
        t1 = torch.tensor(v1).view(-1,1)
        
        return t0, t1
    
    def __len__(self):
        return self.df.shape[0]
        pass

    def vocab_size(self):
        return self.t0.n_words, self.t1.n_words
    pass

In [19]:
# SHOWCASE

# init the object with the raw-data we have from our csv
ttds = TatoebaDataset(data, max_length=10) # limit to 10 words

# the sentences at the following index are:
idx = 450
print(f"Index: {idx}:\nEng: {data.iloc[idx, 0]}\nDeu: {data.iloc[idx, 1]}\n" + "-"*80)

# the dataset-object directly returns tensors that we can use for training
eng_tensor, deu_tensor = ttds[idx]

print(f"Eng: {eng_tensor}\nDeu: {deu_tensor}")

Index: 450:
Eng: I use it.
Deu: Ich benutze es.
--------------------------------------------------------------------------------
Eng: tensor([[2913],
        [1835],
        [3889],
        [3952]])
Deu: tensor([[3997],
        [1516],
        [4306],
        [4438]])


This already concludes Part 1 where we implement the first part of our pipeline: Actually getting the data!