# Preprocessing: Clean Up & Tokenize Questions

Break question titles into tokens, and perform token-level normalization: expand shortened words, correct spelling, etc.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [1]:
from pygoose import *

In [2]:
import nltk

## Config

Automatically discover the paths to various data folders and compose the project structure.

In [3]:
project = kg.Project.discover()

## Load Data

Original question datasets.

In [4]:
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('none')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('none')

Stopwords customized for Quora dataset.

In [5]:
stopwords = set(kg.io.load_lines(project.aux_dir + 'stopwords.vocab'))

Pre-composed spelling correction dictionary.

In [6]:
spelling_corrections = kg.io.load_json(project.aux_dir + 'spelling_corrections.json')

## Load Tools

In [7]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

## Preprocess and tokenize questions

In [8]:
def translate(text, translation):
    for token, replacement in translation.items():
        text = text.replace(token, ' ' + replacement + ' ')
    text = text.replace('  ', ' ')
    return text

In [9]:
def spell_digits(text):
    translation = {
        '0': 'zero',
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'nine',
        '9': 'ten',
    }
    return translate(text, translation)

In [10]:
def expand_negations(text):
    translation = {
        "can't": 'can not',
        "won't": 'would not',
        "shan't": 'shall not',
    }
    text = translate(text, translation)
    return text.replace("n't", " not")

In [11]:
def correct_spelling(text):
    return ' '.join(
        spelling_corrections.get(token, token)
        for token in tokenizer.tokenize(text)
    )

In [12]:
def get_question_tokens(question):
    question = question.lower()
    question = spell_digits(question)
    question = expand_negations(question)
    question = correct_spelling(question)
    
    tokens = [token for token in tokenizer.tokenize(question.lower())]
    tokens.append('.')
    return tokens

In [13]:
def get_question_pair_tokens(row):
    return [get_question_tokens(row[0]), get_question_tokens(row[1])]

In [14]:
def remove_stopwords(pair):
    q1_tokens = [token for token in pair[0] if token not in stopwords]
    q2_tokens = [token for token in pair[1] if token not in stopwords]
    
    return [q1_tokens, q2_tokens]

Tokenize questions and correct spelling, keep the stopwords (useful for neural models).

In [15]:
tokens_train = kg.jobs.map_batch_parallel(
    df_train.as_matrix(columns=['question1', 'question2']),
    item_mapper=get_question_pair_tokens,
    batch_size=1000,
)

Batches: 100%|██████████| 405/405 [00:02<00:00, 137.83it/s]


In [16]:
tokens_test = kg.jobs.map_batch_parallel(
    df_test.as_matrix(columns=['question1', 'question2']),
    item_mapper=get_question_pair_tokens,
    batch_size=1000,
)

Batches: 100%|██████████| 2346/2346 [00:29<00:00, 78.32it/s] 


Build an alternative token set, with stopwords removed.

In [17]:
tokens_train_no_stopwords = kg.jobs.map_batch_parallel(
    tokens_train,
    item_mapper=remove_stopwords,
    batch_size=1000,
)

Batches: 100%|██████████| 405/405 [00:03<00:00, 113.01it/s]


In [18]:
tokens_test_no_stopwords = kg.jobs.map_batch_parallel(
    tokens_test,
    item_mapper=remove_stopwords,
    batch_size=1000,
)

Batches: 100%|██████████| 2346/2346 [00:24<00:00, 96.31it/s] 


## Extract question vocabulary

In [19]:
vocab = set()
for question in progressbar(np.array(tokens_train + tokens_test).ravel()):
    for token in question:
        vocab.add(token)

100%|██████████| 5500172/5500172 [00:11<00:00, 470564.09it/s]


In [20]:
vocab_no_stopwords = vocab - stopwords

## Save preprocessed data

Tokenized questions.

In [21]:
kg.io.save(tokens_train, project.preprocessed_data_dir + 'question_tokens_train.pickle')

In [22]:
kg.io.save(tokens_test, project.preprocessed_data_dir + 'question_tokens_test.pickle')

In [23]:
kg.io.save(tokens_train_no_stopwords, project.preprocessed_data_dir + 'question_tokens_train_no_stopwords.pickle')

In [24]:
kg.io.save(tokens_test_no_stopwords, project.preprocessed_data_dir + 'question_tokens_test_no_stopwords.pickle')

Question vocabulary.

In [25]:
kg.io.save_lines(sorted(list(vocab)), project.preprocessed_data_dir + 'question_tokens.vocab')

In [26]:
kg.io.save_lines(sorted(list(vocab_no_stopwords)), project.preprocessed_data_dir + 'question_tokens_no_stopwords.vocab')

Ground truth.

In [27]:
kg.io.save(df_train['is_duplicate'].values, project.features_dir + 'y_train.pickle')