FastText by Facebook Research has improved performance for word embeddings and text classification tasks on many datasets. It is efficient to use.

In [1]:
import fasttext

In [2]:
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

In [6]:
# manually download: https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
# decompress the file
model = fasttext.train_supervised(input="cooking.stackexchange.txt")

In [7]:
model.save_model("model_cooking.bin")

In [8]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.16714664]))

In [9]:
model.test("cooking.stackexchange.txt")

(15404, 0.15885484289794857, 0.06884812334702606)

Instead of getting the most probable label, the top 5 labels can be generated and evaluated against:

In [10]:
model.test("cooking.stackexchange.txt", k=5)

(15404, 0.07066995585562191, 0.1531427606775083)

In [11]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__food-safety',
  '__label__baking',
  '__label__bread',
  '__label__equipment',
  '__label__chicken'),
 array([0.07006533, 0.06888454, 0.03610465, 0.03176519, 0.02594336]))

In [None]:
# The model would improve my removing punctuation and converting case to lower
# cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt

We can improve the model by training it with more epochs, setting a learning rate, and specifying the number of N-grams (contextual words) to use.  The 'ova' loss enables multiple labels, otherwise the 'hs' loss can be used.

In [14]:
model = fasttext.train_supervised(input="cooking.stackexchange.txt", lr=0.5, epoch=200, wordNgrams=5, bucket=200000, dim=50, loss='ova')

In [15]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking', '__label__bread', '__label__bananas'),
 array([1.00001001, 0.98832273, 0.95258415]))