Annotators

Overview

Annotators are components (connectors/services) that annotate a given user's utterance.

An example of an annotator is NER: this annotator may return a dictionary with tokens and tags keys:

{"tokens": ["Paris"], "tags": ["I-LOC"]}

Another example is Sentiment Classification annotator. It can return a list of labels, e.g.:

["neutral", "speech"]

Available English Annotators

Name	Requirements	Description
ASR	40 MiB RAM	calculates overall ASR confidence for a given utterance and grades it as either very low, low, medium, or high (for Amazon markup)
Badlisted words	150 MiB RAM	detects words and phrases from the badlist
Combined classification	1.5 GiB RAM, 3.5 GiB GPU	BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification
COMeT Atomic	2 GiB RAM, 1.1 GiB GPU	Commonsense prediction models COMeT Atomic
COMeT ConceptNet	2 GiB RAM, 1.1 GiB GPU	Commonsense prediction models COMeT ConceptNet
Convers Evaluator Annotator	1 GiB RAM, 4.5 GiB GPU	is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous
Entity Detection	1.5 GiB RAM, 3.2 GiB GPU	extracts entities and their types from utterances
Entity Linking	640 MB RAM	finds Wikidata entity ids for the entities detected with Entity Detection
Entity Storer	220 MiB RAM	a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state
Fact Random	50 MiB RAM	returns random facts for the given entity (for entities from user utterance)
Fact Retrieval	7.4 GiB RAM, 1.2 GiB GPU	extracts facts from Wikipedia and wikiHow
Intent Catcher	1.7 GiB RAM, 2.4 GiB GPU	classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps
KBQA	2 GiB RAM, 1.4 GiB GPU	answers user's factoid questions based on Wikidata KB
MIDAS Classification	1.1 GiB RAM, 4.5 GiB GPU	BERT-based model trained on a semantic classes subset of MIDAS dataset
MIDAS Predictor	30 MiB RAM	BERT-based model trained on a semantic classes subset of MIDAS dataset
NER	2.2 GiB RAM, 5 GiB GPU	extracts person names, names of locations, organizations from uncased text
News API annotator	80 MiB RAM	extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key.
Personality Catcher	30 MiB RAM
Rake keywords	40 MiB RAM	extracts keywords from utterances with the help of RAKE algorithm
Relative Persona Extractor	50 MiB RAM	Annotator utilizing Sentence Ranker to rank persona sentences and selecting `N_SENTENCES_OT_RETURN` the most relevant sentences
Sentrewrite	200 MiB RAM	rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components
Sentseg	1 GiB RAM	allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation
Spacy Nounphrases	180 MiB RAM	extracts nounphrases using Spacy and filters out generic ones
Speech Function Classifier		a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade
Speech Function Predictor		yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier
Spelling Preprocessing	30 MiB RAM	pattern-based component to rewrite different colloquial expressions to a more formal style of conversation
Topic recommendation	40 MiB RAM	offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4).
User Persona Extractor	40 MiB RAM	determines which age category the user belongs to based on some key words
Wiki Parser	100 MiB RAM	extracts Wikidata triplets for the entities detected with Entity Linking
Wiki Facts	1.7 GiB RAM

Available Russian Annotators

Name	Requirements	Description
Badlisted words	50 MiB RAM	Аннотатор детекции нецензурных слов основан на лемматизации с помощью pymorphy2 и проверки по словарю нецензурных слов.
Entity Detection	3 GiB RAM	Аннотатор извлечения не именованных сущностей и определения их типа для русского языка нижнего регистра на основе на основе нейросетевой модели ruBERT (PyTorch).
Entity Linking	300 MiB RAM	Аннотатор связывания (нахождения Wikidata id) сущностей, извлеченных с помощью Entity detection, на основе дистиллированной модели ruBERT.
Intent Catcher	1.8 GiB RAM, 5 Gib GPU	Аннотатор детектирования специальных намерений пользователя на основе многоязычной модели Universal Sentence Encoding.
NER	1.8 GiB RAM, 5 Gib GPU	Аннотатор извлечения именованных сущностей для русского языка нижнего регистра на основе нейросетевой модели Conversational ruBERT (PyTorch).
Sentseg	2.4 GiB RAM, 5 Gib GPU	Аннотатор восстановления пунктуации для русского языка нижнего регистра на основе нейросетевой модели ruBERT (PyTorch). Модель обучена на русскоязычных субтитрах.
Spacy Annotator	250 MiB RAM	Аннотатор токенизации и аннотирования токенов на основе библиотеки spacy и входящей в нее модели “ru_core_news_sm”.
Spelling Preprocessing	4.5 GiB RAM	Аннотатор исправления опечаток и грамматических ошибок на основе модели расстояния Левенштейна. Используется предобученная модель из библиотеки DeepPavlov.
Toxic Classification	1.9 GiB RAM, 1.3 Gib GPU	Классификатор токсичности для фильтрации реплик пользователя от Сколтеха
Wiki Parser	100 MiB RAM	Аннотатор извлечения триплетов из Wikidata для сущностей, извлеченных с помощью Entity detection.
DialogRPT	3.9 GiB RAM, 2.2 GiB GPU	Сервис оценки вероятности реплики понравиться пользователю (updown) на основе ранжирующей модели DialogRPT, которая дообучена на основе генеративной модели Russian DialoGPT на комментариев с сайта Пикабу.

Developing Own Annotator

TBD

Registering Own Annotator

Registering a new annotator is done by adding its configuration to the docker-compose.yml located in the root directory, and to pipeline_conf.json located in the /agent subdirectory of the repository. After adding new annotator, it should be built and run using docker-compose -f docker-compose.yml up --build _annotator_name_, building and running agent by running command docker-compose -f docker-compose.yml build agent followed by docker-compose -f docker-compose.yml up agent.

You must register an annotator before you can test it with your own copy of Deepy.

Provided Annotators

Currently, we provide two annotators for the basic distribution:

Emotion Classification
Spell Checking Classification

We plan to eventually ship them as part of the DeepPavlov Library and then re-add them to the Deepy distributions as native DeepPavlov Library components.

Resources

Annotators @ ReadTheDocs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly