Skip to content
fastText - efficient text classification and representation learning - for Ruby
Ruby C++
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ext/fasttext Cleaner extconf Oct 27, 2019
lib Version bump to 0.1.1 [skip ci] Oct 27, 2019
test Less indirection and fixed tests Oct 27, 2019
vendor First commit Oct 26, 2019
.gitignore First commit Oct 26, 2019
.gitmodules First commit Oct 26, 2019
.travis.yml Added Travis Oct 26, 2019
CHANGELOG.md Added dates to changelog [skip ci] Nov 14, 2019
Gemfile First commit Oct 26, 2019
LICENSE.txt First commit Oct 26, 2019
README.md Don't use specific version [skip ci] Oct 31, 2019
Rakefile Fixed release script [skip ci] Oct 27, 2019
fasttext.gemspec Added extension to gemspec [skip ci] Oct 27, 2019

README.md

fastText

fastText - efficient text classification and representation learning - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem 'fasttext'

Getting Started

fastText has two primary use cases:

Text Classification

Prep your data

# documents
x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

# labels
y = ["ham", "ham", "spam"]

Use an array if a document has multiple labels

Train a model

model = FastText::Classifier.new
model.fit(x, y)

Get predictions

model.predict(x)

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Evaluate the model

model.test(x_test, y_test)

Get words and labels

model.words
model.labels

Use include_freq: true to get their frequency

Compress the model - significantly reduces size but sacrifices a little performance

model.quantize
model.save_model("model.ftz")

Word Representations

Prep your data

x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

Train a model

model = FastText::Vectorizer.new
model.fit(x)

Get nearest neighbors

model.nearest_neighbors("asparagus")

Get analogies

model.analogies("berlin", "germany", "france")

Get a word vector

model.word_vector("carrot")

Get words

model.words

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Use continuous bag-of-words

model = FastText::Vectorizer.new(model: "cbow")

Parameters

Text classification

FastText::Classifier.new(
  lr: 0.1,                    # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 1,               # minimal number of word occurences
  min_count_label: 1,         # minimal number of label occurences
  minn: 0,                    # min length of char ngram
  maxn: 0,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "softmax",            # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  label_prefix: "__label__"   # label prefix
  verbose: 2,                 # verbose
  pretrained_vectors: nil     # pretrained word vectors (.vec file)
)

Word representations

FastText::Vectorizer.new(
  model: "skipgram",          # unsupervised fasttext model {cbow, skipgram}
  lr: 0.05,                   # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 5,               # minimal number of word occurences
  minn: 3,                    # min length of char ngram
  maxn: 6,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "ns",                 # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  verbose: 2                  # verbose
)

Input Files

Input can be read directly from files

model.fit("train.txt")
model.test("test.txt")

Each line should be a document

text from document one
text from document two
text from document three

For text classification, lines should start with a list of labels prefixed with __label__

__label__ham text from document one
__label__ham text from document two
__label__spam text from document three

Pretrained Models

There are a number of pretrained models you can download

Language Identification

Download one of the pretrained models and load it

model = FastText.load_model("lid.176.ftz")

Get language predictions

model.predict("bon appétit")

rbenv

This library uses Rice to interface with the fastText C++ library. Rice and earlier versions of rbenv don’t play nicely together. If you encounter an error during installation, upgrade ruby-build and reinstall your Ruby version.

brew upgrade ruby-build
rbenv install [version]

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development and testing:

git clone https://github.com/ankane/fasttext.git
cd fasttext
bundle install
rake compile
rake test
You can’t perform that action at this time.