Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Commit

Permalink
multiline get_line / model.test / unit test
Browse files Browse the repository at this point in the history
Summary: See title.

Reviewed By: kahne

Differential Revision: D6629548

fbshipit-source-id: 89e0b04097d54845f8c1264a3f1fa72678de9587
  • Loading branch information
cpuhrsch authored and facebook-github-bot committed Jan 2, 2018
1 parent eeddd0d commit c952cb6
Show file tree
Hide file tree
Showing 13 changed files with 482 additions and 202 deletions.
50 changes: 50 additions & 0 deletions .circleci/config.yml
Expand Up @@ -34,6 +34,7 @@ jobs:
. .circleci/setup_circleimg.sh
. .circleci/python_test.sh
"py353":
docker:
- image: circleci/python:3.5.3
Expand Down Expand Up @@ -67,6 +68,51 @@ jobs:
. .circleci/setup_circleimg.sh
. .circleci/python_test.sh
"py361-pypi":
docker:
- image: circleci/python:3.6.1
working_directory: ~/repo
steps:
- checkout
- run:
command: |
. .circleci/setup_circleimg.sh
. .circleci/pip_test.sh
"py353-pypi":
docker:
- image: circleci/python:3.5.3
working_directory: ~/repo
steps:
- checkout
- run:
command: |
. .circleci/setup_circleimg.sh
. .circleci/pip_test.sh
"py346-pypi":
docker:
- image: circleci/python:3.4.6
working_directory: ~/repo
steps:
- checkout
- run:
command: |
. .circleci/setup_circleimg.sh
. .circleci/pip_test.sh
"py2713-pypi":
docker:
- image: circleci/python:2.7.13
working_directory: ~/repo
steps:
- checkout
- run:
command: |
. .circleci/setup_circleimg.sh
. .circleci/pip_test.sh
"gcc5":
docker:
- image: gcc:5
Expand Down Expand Up @@ -184,6 +230,10 @@ workflows:
- "py353"
- "py346"
- "py2713"
- "py361-pip"
- "py353-pip"
- "py346-pip"
- "py2713-pip"
- "gcc5"
- "gcc6"
- "gcc7"
Expand Down
12 changes: 12 additions & 0 deletions .circleci/pip_test.sh
@@ -0,0 +1,12 @@
#!/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.
#

sudo pip install --index-url https://test.pypi.org/simple/ fasttext
python runtests.py -u
145 changes: 41 additions & 104 deletions python/README.md
Expand Up @@ -4,148 +4,85 @@

## Requirements

**fastText** builds on modern Mac OS and Linux distributions.
[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
Since it uses C\++11 features, it requires a compiler with good C++11 support.
These include :

* (gcc-4.8 or newer) or (clang-3.3 or newer)

You will need

* python 2.7 or newer
* numpy & scipy
* [Python](https://www.python.org/) version 2.7 or >=3.4
* [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/)
* [pybind11](https://github.com/pybind/pybind11)

## Building fastTextpy
## Building fastText

In order to build `fastTextpy`, do the following:
The easiest way to get the latest version of [fastText is to use pip](https://pypi.python.org/pypi/fasttext).

```
$ python setup.py install
$ pip install fasttext
```

This will add the module fastTextpy to your python interpreter.
Depending on your system you might need to use 'sudo', for example

```
$ sudo python setup.py install
```
If you want to use the latest unstable release you will need to build from source using setup.py.

Now you can import this library with

```
import fastText
```


## Examples

If you're already largely familiar with fastText you could skip this section
and take a look at the examples within the doc folder.

## Using models

First, you'll need to train a model with fastText. For example

```
./fasttext skipgram -input data/fil9 -output result/fil9
```

You can see more examples within the scripts in the [fastText repository](https://github.com/facebookresearch/fastText).

Next, you can load this model from Python and query it.
In general it is assumed that the reader already has good knowledge of fastText. For this consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html).

```
from fastText import load_model
f = load_model('result/model.bin')
words, frequency = f.get_words()
subwords = f.get_subwords("Paris")
```

If you trained an unsupervised model, you can get word vectors with

```
vector = f.get_word_vector("London")
```
We recommend you look at the [examples within the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples).

If you trained a supervised model, you can get the top k labels and get their probabilities with
As with any package you can get help on any Python function using the help function.

```
k = 5
labels, probabilities = f.predict("I like this Product", k)
```

A more advanced application might look like this:

Getting the word vectors of all words:

```
words, frequency = f.get_words()
for w in words:
print((w, f.get_word_vector(w))
```

## Training models

Training a model is easy. For example
For example

```
from fastText import train_supervised
from fastText import train_unsupervised
model_unsup = train_unsupervised(
input=<data>,
epoch=1,
model="cbow",
thread=10
)
model_unsup.save_model(<path>)
model_sup = train_supervised(
input=<labeled_data>
epoch=1,
thread=10
)
```

You can then use the model objects just as exemplified above.

To get extended help on these functions use the python help functions.
+>>> import fastText
+>>> help(fastText.FastText)
For example
Help on module fastText.FastText in fastText:
```
Help on function train_unsupervised in module fastText.FastText:
NAME
fastText.FastText
train_unsupervised(input, model=u'skipgram', lr=0.05, dim=100, ws=5, epoch=5, minCount=5, minCountLabel=0, minn=3, maxn=6, neg=5, wordNgrams=1, loss=u'ns', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label=u'__label__', verbose=2, pretrainedVectors=u'', saveOutput=0)
Train an unsupervised model and return a model object.
DESCRIPTION
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.
input must be a filepath. The input text does not need to be tokenized
as per the tokenize function, but it must be preprocessed and encoded
as UTF-8. You might want to consult standard preprocessing scripts such
as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
FUNCTIONS
load_model(path)
Load a model given a filepath and return a model object.
The input fiel must not contain any labels or use the specified label prefix
unless it is ok for those words to be ignored. For an example consult the
dataset pulled by the example script word-vector-example.sh, which is
part of the fastText repository.
tokenize(text)
Given a string of text, tokenize it and return a list of tokens
[...]
```

## Processing data
## IMPORTANT: Preprocessing data / enconding conventions

You can tokenize using the fastText Dictionary method readWord.
In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.

This will give you a list of tokens split on the same whitespace characters that fastText splits on.
fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).

It will also add the EOS character as necessary, which is exposed via fastText.EOS
fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

Then resulting text is then stored entirely in memory.
* space
* tab
* vertical tab
* carriage return
* formfeed
* the null character

For example:
The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.

```
from fastText import tokenize
with open(<PATH>, 'r') as f:
tokens = tokenize(f.read())
```
The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.

0 comments on commit c952cb6

Please sign in to comment.