# Preprocessing

In this notebook we are going to perform the work needed for using the data with our models.
This includes the following steps:

1. Analyzing the data
2. Preprocessing the data
3. Writing a DataLoader to ensure it can be used with our models

## Dataset

We are going to use a classic dataset: The [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/) from the [Tatoeba project](https://tatoeba.org/en).

Since I am a native German speaker (and therefore can check the results the fastest in), we are going to use the Eng-Deu dataset.

In [106]:
import pandas as pd
import numpy as np
import re

In [2]:
data = pd.read_csv('data/deu-eng/deu.txt', sep='\t', names=['en', 'de', 'license'])
data.iloc[np.r_[100:102, 500:502, 10000:10002]] # display some parts of the data

Unnamed: 0,en,de,license
100,We try.,Wir versuchen es.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
101,We won.,Wir haben gewonnen.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
500,It's his.,Es ist seins.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
501,It's hot.,Es ist heiß.,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
10000,They were busy.,Sie waren beschäftigt.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
10001,They were dead.,Sie waren tot.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


## Tokenizing

Instead of implementing various methods for splitting words, removing non-ascii chars, etc. we are just letting nltk do all the work.

In [127]:
import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

In [5]:
test =  "Hello! How are you doing?✅"
print(word_tokenize(test))

['Hello', '!', 'How', 'are', 'you', 'doing', '?', '✅']


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

In [7]:
cv = CountVectorizer(strip_accents='ascii')

In [8]:
en_sentences = data.loc[:, ['en']]

In [9]:
cv.fit(en_sentences.values.ravel())

CountVectorizer(strip_accents='ascii')

In [10]:
cv.transform(['Hello! How are you doing?']).shape

(1, 16334)

In [11]:
cv.vocabulary_.get('hello')

6895

In [13]:
import torch

In [14]:
sentence = "hello how are you doing"
torch.Tensor([cv.vocabulary_[s] for s in sentence.split(" ")]).view(-1,1)

tensor([[ 6895.],
        [ 7174.],
        [  992.],
        [16285.],
        [ 4537.]])

In [18]:
cv.vocabulary_['hoi']

0

In [101]:
cv.transform(['hoi']).

<1x16334 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [29]:
from torch.utils.data import Dataset

In [199]:
class TatoebaDataset(Dataset):
    
    def __init__(self, df:pd.DataFrame, max_length=10, langs=('en', 'de'), debug=False):
        self.debug = debug
        self.df = df.copy(deep=True)
        
        # filter by max length
        self.df = self.df.loc[df[langs[0]].str.len() <= max_length, :]
        
        # clean all strings
        self.df.loc[:, langs[0]] = self.df.loc[:, langs[0]].apply(self.clean_string)
        self.df.loc[:, langs[1]] = self.df[langs[1]].apply(self.clean_string)
        
        # create CountVectorizers for both languages
        token_pattern = r"(?u)\b\w\w+\b|!|\?|." # what to match as tokens 
        # self.tokenizer = RegexpTokenizer(r"(?u)\b\w\w+\b|!|\?|.|\S+") # will be used later
        self.cv0 = CountVectorizer(strip_accents='ascii', token_pattern=token_pattern).fit(self.df[langs[0]].values.ravel())
        self.cv1 = CountVectorizer(strip_accents='ascii', token_pattern=token_pattern).fit(self.df[langs[1]].values.ravel())
        pass
    
    def __getitem__(self, index):
        # get sentences in both languages from line <index>
        s0, s1 = self.df.iloc[index, 0:2]
        
        # vectorize sentences
        if self.debug: print(f" >> Eng: {s0}, Deu {s1}")
        if self.debug: print(f" >>Tokenized: {s0.split(' ')}")
        v0 = [self.cv0.vocabulary_[s.lower()] for s in s0.split(" ") if s != ""]
        v1 = [self.cv1.vocabulary_[s.lower()] for s in s1.split(" ") if s != ""]
        
        # transform to tensors
        t0 = torch.tensor(v0).view(-1,1)
        t1 = torch.tensor(v1).view(-1,1)
        
        return t0, t1
    
    def __len__(self):
        return self.df.shape[0]
    
    @staticmethod
    def clean_string(string):
        """Cleans a given string so it can be vectorized"""
        string = re.sub(r"([.!?])", r" \1", string)
        string = re.sub(r"[^a-zA-Z.!?]+", r" ", string) # what to keep in the strings
        
        return string

In [200]:
ttds = TatoebaDataset(data)

In [202]:
idx = 250
print(f"Index: {idx}: Eng: {data.iloc[idx, 0]} | Deu: {data.iloc[idx, 1]}")

Index: 250: Eng: Keep it . | Deu: Behalt es !


In [204]:
ttds[0][0].shape

torch.Size([2, 1])