An NLP pipeline config is a JSON file that contains one required element chainer
:
{
"chainer": {
"in": ["x"],
"in_y": ["y"],
"pipe": [
...
],
"out": ["y_predicted"]
}
}
~deeppavlov.core.common.chainer.Chainer
is a core concept of DeepPavlov library: chainer builds a pipeline from heterogeneous components (Rule-Based/ML/DL) and allows to train or infer from pipeline as a whole. Each component in the pipeline specifies its inputs and outputs as arrays of names, for example: "in": ["tokens", "features"]
and "out": ["token_embeddings", "features_embeddings"]
and you can chain outputs of one components with inputs of other components:
{
"class_name": "deeppavlov.models.preprocessors.str_lower:str_lower",
"in": ["x"],
"out": ["x_lower"]
},
{
"class_name": "nltk_tokenizer",
"in": ["x_lower"],
"out": ["x_tokens"]
},
Pipeline elements could be child classes of ~deeppavlov.core.models.component.Component
or functions.
Each ~deeppavlov.core.models.component.Component
in the pipeline must implement method __call__
and has class_name
parameter, which is its registered codename, or full name of any python class in the form of "module_name:ClassName"
. It can also have any other parameters which repeat its __init__
method arguments. Default values of __init__
arguments will be overridden with the config values during the initialization of a class instance.
You can reuse components in the pipeline to process different parts of data with the help of id
and ref
parameters:
{
"class_name": "nltk_tokenizer",
"id": "tokenizer",
"in": ["x_lower"],
"out": ["x_tokens"]
},
{
"ref": "tokenizer",
"in": ["y"],
"out": ["y_tokens"]
},
As of version 0.1.0 every string value in a configuration file is interpreted as a format string where fields are evaluated from metadata.variables
element:
{
"chainer": {
"in": ["x"],
"pipe": [
{
"class_name": "my_component",
"in": ["x"],
"out": ["x"],
"load_path": "{MY_PATH}/file.obj"
},
{
"in": ["x"],
"out": ["y_predicted"],
"config_path": "{CONFIGS_PATH}/classifiers/intents_snips.json"
}
],
"out": ["y_predicted"]
},
"metadata": {
"variables": {
"MY_PATH": "/some/path",
"CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs"
}
}
}
Variable DEEPPAVLOV_PATH
is always preset to be a path to the deeppavlov
python module.
One can override configuration variables using environment variables with prefix DP_
. So environment variable DP_VARIABLE_NAME
will override VARIABLE_NAME
inside a configuration file.
For example, adding DP_ROOT_PATH=/my_path/to/large_hard_drive
will make most configs use this path for downloading and reading embeddings/models/datasets.
There are two abstract classes for trainable components: ~deeppavlov.core.models.estimator.Estimator
and ~deeppavlov.core.models.nn_model.NNModel
.
~deeppavlov.core.models.estimator.Estimator
are fit once on any data with no batching or early stopping, so it can be safely done at the time of pipeline initialization. fit
method has to be implemented for each ~deeppavlov.core.models.estimator.Estimator
. One example is ~deeppavlov.core.data.vocab.Vocab
.
~deeppavlov.core.models.nn_model.NNModel
requires more complex training. It can only be trained in a supervised mode (as opposed to ~deeppavlov.core.models.estimator.Estimator
which can be trained in both supervised and unsupervised settings). This process takes multiple epochs with periodic validation and logging. ~deeppavlov.core.models.nn_model.NNModel.train_on_batch
method has to be implemented for each ~deeppavlov.core.models.nn_model.NNModel
.
Training is triggered by ~deeppavlov.train_model
function.
~deeppavlov.core.models.estimator.Estimator
s that are trained should also have fit_on
parameter which contains a list of input parameter names. An ~deeppavlov.core.models.nn_model.NNModel
should have the in_y
parameter which contains a list of ground truth answer names. For example:
[
{
"id": "classes_vocab",
"class_name": "default_vocab",
"fit_on": ["y"],
"level": "token",
"save_path": "vocabs/classes.dict",
"load_path": "vocabs/classes.dict"
},
{
"in": ["x"],
"in_y": ["y"],
"out": ["y_predicted"],
"class_name": "intent_model",
"save_path": "classifiers/intent_cnn",
"load_path": "classifiers/intent_cnn",
"classes_vocab": {
"ref": "classes_vocab"
}
}
]
The config for training the pipeline should have three additional elements: dataset_reader
, dataset_iterator
and train
:
{
"dataset_reader": {
"class_name": ...,
...
},
"dataset_iterator": {
"class_name": ...,
...
},
"chainer": {
...
},
"train": {
...
}
}
Simplified version of training pipeline contains two elements: dataset
and train
. The dataset
element currently can be used for train from classification data in csv
and json
formats. You can find complete examples of how to use simplified training pipeline in intents_sample_csv.json <classifiers/intents_sample_csv.json>
and intents_sample_json.json <classifiers/intents_sample_json.json>
config files.
train
element can contain a class_name
parameter that references a trainer class (default value is nn_trainer <deeppavlov.core.trainers.NNTrainer>
). All other parameters will be passed as keyword arguments to the trainer class's constructor.
"train": {
"class_name": "nn_trainer",
"metrics": [
"f1",
{
"name": "accuracy",
"inputs": ["y", "y_labels"]
},
{
"name": "roc_auc",
"inputs": ["y", "y_probabilities"]
}
],
...
}
Each metric can be described as a JSON object with
name
and inputs
properties, where name
is a registered name of a metric function and inputs
is a list of parameter names from chainer's inner memory that will be passed to the metric function.If a metric is described as a single string, this string is interpreted as a registered name.
Default value for
inputs
parameter is a concatenation of chainer's in_y
and out
parameters.~deeppavlov.core.dara.dataset_reader.DatasetReader
class reads data and returns it in a specified format. A concrete DatasetReader
class should be inherited from this base class and registered with a codename:
from deeppavlov.core.common.registry import register
from deeppavlov.core.data.dataset_reader import DatasetReader
@register('dstc2_datasetreader')
class DSTC2DatasetReader(DatasetReader):
~deeppavlov.core.data.data_learning_iterator.DataLearningIterator
forms the sets of data ('train', 'valid', 'test') needed for training/inference and divides them into batches. A concrete DataLearningIterator
class should be registered and can be inherited from deeppavlov.data.data_learning_iterator.DataLearningIterator
class. This is a base class and can be used as a DataLearningIterator
as well.
~deeppavlov.core.data.data_fitting_iterator.DataFittingIterator
iterates over provided dataset without train/valid/test splitting and is useful for ~deeppavlov.core.models.estimator.Estimator
s that do not require training.
All components inherited from ~deeppavlov.core.models.component.Component
abstract class can be used for inference. The __call__
method should return standard output of a component. For example, a tokenizer should return tokens, a NER recognizer should return recognized entities, a bot should return an utterance. A particular format of returned data should be defined in __call__
.
Inference is triggered by ~deeppavlov.core.commands.infer.interact_model
function. There is no need in a separate JSON for inference.
Each DeepPavlov model is determined by its configuration file. You can use existing config files or create yours. You can also choose a config file and modify preprocessors/tokenizers/embedders/vectorizers there. The components below have the same interface and are responsible for the same functions, therefore they can be used in the same parts of a config pipeline.
Here is a list of useful ~deeppavlov.core.models.component.Component
s aimed to preprocess, postprocess and vectorize your data.
Preprocessor is a component that processes batch of samples.
Already implemented universal preprocessors of tokenized texts (each sample is a list of tokens):
~deeppavlov.models.preprocessors.char_splitter.CharSplitter
(registered aschar_splitter
) splits every token in given batch of tokenized samples to a sequence of characters.~deeppavlov.models.preprocessors.mask.Mask
(registered asmask
) returns binary mask of corresponding length (padding up to the maximum length per batch.~deeppavlov.models.preprocessors.russian_lemmatizer.PymorphyRussianLemmatizer
(registered aspymorphy_russian_lemmatizer
) performs lemmatization for Russian language.~deeppavlov.models.preprocessors.sanitizer.Sanitizer
(registered assanitizer
) removes all combining characters like diacritical marks from tokens.
Already implemented universal preprocessors of non-tokenized texts (each sample is a string):
~deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor
(registered asdirty_comments_preprocessor
) preprocesses samples converting samples to lowercase, paraphrasing English combinations with apostrophe'
, transforming more than three the same symbols to two symbols.~deeppavlov.models.preprocessors.str_lower.str_lower
converts samples to lowercase.
Already implemented universal preprocessors of another type of features:
~deeppavlov.models.preprocessors.one_hotter.OneHotter
(registered asone_hotter
) performs one-hotting operation for the batch of samples where each sample is an integer label or a list of integer labels (can be combined in one batch). Ifmulti_label
parameter is set toTrue
, returns one one-dimensional vector per sample with several elements equal to1
.
Tokenizer is a component that processes batch of samples (each sample is a text string).
~deeppavlov.models.tokenizers.lazy_tokenizer.LazyTokenizer
(registered aslazy_tokenizer
) tokenizes usingnltk.word_tokenize
.~deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer
(registered asnltk_tokenizer
) tokenizes using tokenizers fromnltk.tokenize
, e.g.nltk.tokenize.wordpunct_tokenize
.~deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer
(registered asnltk_moses_tokenizer
) tokenizes and detokenizes usingnltk.tokenize.moses.MosesDetokenizer
,nltk.tokenize.moses.MosesTokenizer
.~deeppavlov.models.tokenizers.ru_sent_tokenizer.RuSentTokenizer
(registered asru_sent_tokenizer
) is a rule-based tokenizer for Russian language.~deeppavlov.models.tokenizers.ru_tokenizer.RussianTokenizer
(registered asru_tokenizer
) tokenizes or lemmatizes Russian texts usingnltk.tokenize.toktok.ToktokTokenizer
.~deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer
(registered asstream_spacy_tokenizer
) tokenizes or lemmatizes texts with spacyen_core_web_sm
models by default.~deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer
(registered assplit_tokenizer
) tokenizes using string methodsplit
.
Embedder is a component that converts every token in a tokenized batch to a vector of a particular dimension (optionally, returns a single vector per sample).
~deeppavlov.models.embedders.glove_embedder.GloVeEmbedder
(registered asglove
) reads embedding file in GloVe format (file starts withnumber_of_words embeddings_dim line
followed by linesword embedding_vector
). Ifmean
returns one vector per sample --- mean of embedding vectors of tokens.~deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder
(registered asfasttext
) reads embedding file in fastText format. Ifmean
returns one vector per sample - mean of embedding vectors of tokens.~deeppavlov.models.embedders.bow_embedder.BoWEmbedder
(registered asbow
) performs one-hot encoding of tokens using pre-built vocabulary.~deeppavlov.models.embedders.tfidf_weighted_embedder.TfidfWeightedEmbedder
(registered astfidf_weighted
) accepts embedder, tokenizer (for detokenization, by default, detokenize with joining with space), TFIDF vectorizer or counter vocabulary, optionally accepts tags vocabulary (to assign additional multiplcative weights to particular tags). Ifmean
returns one vector per sample - mean of embedding vectors of tokens.~deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder
(registered aselmo
) converts tokens to pre-trained contextual representations from large-scale bidirectional language models. See examples here.
Vectorizer is a component that converts batch of text samples to batch of vectors.
~deeppavlov.models.sklearn.sklearn_component.SklearnComponent
(registered assklearn_component
) is a DeepPavlov wrapper for most of sklearn estimators, vectorizers etc. For example, to get TFIDF-vectorizer one should assign in configmodel_class
tosklearn.feature_extraction.text:TfidfVectorizer
,infer_method
totransform
, passload_path
,save_path
and other sklearn model parameters.~deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer
(registered ashashing_tfidf_vectorizer
) implements hashing version of usual TFIDF-vecotrizer. It creates a TFIDF matrix from collection of documents of size[n_documents X n_features(hash_size)]
.