# fastText

Short notebook showing how to run classification with a trained fastText model. The model is available at https://github.com/arXiv/arxiv-classifier/releases/download/ulmfit-models-v1.0/model-unbalanced-2M-100d-softmax.bin.xz and was trained on full texts.

In [1]:
from pathlib import Path

import fasttext

MODEL_PATH = Path('/mnt/efs/fasttext')
model_name = 'model-unbalanced-2M-100d-softmax.bin'

model = fasttext.load_model(str(MODEL_PATH / model_name))



In [2]:
import re, string

re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).lower().split()

def convert_to_fasttext(text):
    line = ' '.join(text.split('\n'))
    line = ' '.join(tokenize(line))
    return line

In [3]:
def predict(model, text, topk=5):
    text = convert_to_fasttext(text)
    labels, probs = model.predict(text, k=topk)
    # remove '__label__' prefix
    labels = [label[9:] for label in labels]
    return list(zip(labels, probs))

In [4]:
predict(model, "In this paper we prove that P=NP.")

[('math.OC', 0.7141464352607727),
 ('cs.MA', 0.27467426657676697),
 ('q-fin.MF', 0.010656964965164661),
 ('cs.LO', 0.0002896013902500272),
 ('eess.SY', 0.00016926857642829418)]