# Preprocessing

In this notebook we are going to perform the work needed for using the data with our models.
This includes the following steps:

1. Analyzing the data
2. Preprocessing the data
3. Writing a DataLoader to ensure it can be used with our models

## Dataset

We are going to use a classic dataset: The [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/) from the [Tatoeba project](https://tatoeba.org/en).

Since I am a native German speaker (and therefore can check the results the fastest in), we are going to use the Eng-Deu dataset.

In [4]:
import pandas as pd
import numpy as np

In [30]:
data = pd.read_csv('data/deu-eng/deu.txt', sep='\t', names=['en', 'de', 'license'])
data.iloc[np.r_[100:102, 500:502, 10000:10002]] # display some parts of the data

Unnamed: 0,en,de,license
100,We try.,Wir versuchen es.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
101,We won.,Wir haben gewonnen.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
500,It's his.,Es ist seins.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
501,It's hot.,Es ist heiß.,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
10000,They were busy.,Sie waren beschäftigt.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
10001,They were dead.,Sie waren tot.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


## Tokenizing

Instead of implementing various methods for splitting words, removing non-ascii chars, etc. we are just letting nltk do all the work.

In [14]:
import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize

In [19]:
test =  "Hello! How are you doing?✅"
print(word_tokenize(test))
print(word_tokenize())

['Hello', '!', 'How', 'are', 'you', 'doing', '?', '✅']
["b'Hello", '!', 'How', 'are', 'you', 'doing', '?', "'"]


In [39]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

In [22]:
cv = CountVectorizer(strip_accents='ascii')

In [23]:
en_sentences = data.loc[:, ['en']]

In [34]:
cv.fit(en_sentences.values.ravel())

CountVectorizer(strip_accents='ascii')

In [38]:
cv.transform(['Hello! How are you doing?']).shape

(1, 16334)

In [43]:
cv.vocabulary_.get('hello')

In [53]:
sentence = "hello how are you doing"
torch.Tensor([cv.vocabulary_[s] for s in sentence.split(" ")]).view(-1,1)

tensor([[ 6895.],
        [ 7174.],
        [  992.],
        [16285.],
        [ 4537.]])

In [49]:
import torch