# Preprocessing: Clean Up & Tokenize Questions

Break question titles into tokens, and perform token-level normalization: expand shortened words, correct spelling, etc.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [1]:
from pygoose import *

In [2]:
import nltk

## Config

In [3]:
project = kg.Project.discover()

## Load Data

Original question datasets.

In [4]:
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('none')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('none')

Stopwords customized for the original dataset.

In [5]:
stopwords = set(kg.io.load_lines(project.aux_dir + 'stopwords.vocab'))

Pre-composed spelling correction dictionary.

In [6]:
spelling_corrections = kg.io.load_json(project.aux_dir + 'spelling_corrections.json')

## Load Tools

In [7]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

## Process

In [8]:
def translate(text, translation):
    for token, replacement in translation.items():
        text = text.replace(token, ' ' + replacement + ' ')
    text = text.replace('  ', ' ')
    return text

In [9]:
def spell_digits(text):
    translation = {
        '0': 'zero',
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'nine',
        '9': 'ten',
    }
    return translate(text, translation)

In [10]:
def expand_negations(text):
    translation = {
        "can't": 'can not',
        "won't": 'would not',
        "shan't": 'shall not',
    }
    text = translate(text, translation)
    return text.replace("n't", " not")

In [11]:
def correct_spelling(text):
    return ' '.join(
        spelling_corrections.get(token, token)
        for token in tokenizer.tokenize(text)
    )

In [12]:
def get_question_tokens(question):
    question = question.lower()
    question = spell_digits(question)
    question = expand_negations(question)
    question = correct_spelling(question)
    
    tokens = [
        token
        for token in tokenizer.tokenize(question.lower())
        if token not in stopwords
    ]
    tokens.append('.')
    return tokens

In [13]:
def get_question_pair_tokens(row):
    return [get_question_tokens(row[0]), get_question_tokens(row[1])]

In [14]:
tokens_train = kg.jobs.map_embarrassingly_parallel(
    df_train.as_matrix(columns=['question1', 'question2']),
    get_question_pair_tokens,
    project,
)

Creating job ID: f3f20765-54e5-4a28-ad40-19577921eac6
Chunk 1/1: 100%|██████████| 404290/404290 [00:10<00:00, 37671.30it/s]


In [15]:
tokens_test = kg.jobs.map_embarrassingly_parallel(
    df_test.as_matrix(columns=['question1', 'question2']),
    get_question_pair_tokens,
    project,
)

Creating job ID: 5a684547-2b3e-42ab-ae06-52a921fd485b
Chunk 1/1: 100%|██████████| 2345796/2345796 [01:01<00:00, 38062.87it/s]


## Save preprocessed data

In [16]:
kg.io.save(tokens_train, project.preprocessed_data_dir + 'question_tokens_train.pickle')

In [17]:
kg.io.save(tokens_test, project.preprocessed_data_dir + 'question_tokens_test.pickle')