# Preprocessing: Create a FastText Vector Database

Based on the vocabulary extracted from question texts, use a pretrained FastText model to query and save word vectors.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [21]:
from pygoose import *

In [22]:
import os
import subprocess

## Config

Automatically discover the paths to various data folders and compose the project structure.

In [23]:
project = kg.Project.discover()

Number of word embedding dimensions.

In [24]:
EMBEDDING_DIM = 300

Path to FastText executable.

In [25]:
FASTTEXT_EXECUTABLE = 'fasttext'

Path to the FastText binary model [pre-trained on Wikipedia](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

In [47]:
PRETRAINED_MODEL_FILE = '/home/denys/kaggle-quora-question-pairs/data/aux/fasttext/wiki.en.bin'

Input vocab file (one word per line).

In [48]:
VOCAB_FILE = project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords.vocab'

Vector output file (one vector per line).

In [49]:
OUTPUT_FILE = project.aux_dir + 'fasttext_vocab.vec'

## Save FastText metadata

Add a header containing the number of words and embedding size to be readable by `gensim`.

In [50]:
vocab = kg.io.load_lines(VOCAB_FILE)

In [51]:
len(vocab)

226268

In [52]:
with open(OUTPUT_FILE, 'w') as f:
    print('226268 300', file=f)

## Query and save FastText vectors

Replicate the command `fasttext print-vectors model.bin < words.txt >> vectors.vec`.

In [53]:
with open(VOCAB_FILE) as f_vocab:
    with open(OUTPUT_FILE, 'a') as f_output:
        subprocess.run(
            [FASTTEXT_EXECUTABLE, 'print-word-vectors', PRETRAINED_MODEL_FILE],
            stdin=f_vocab,
            stdout=f_output,
        )