Skip to content

Annotators

Dilyara Zharikova (Baymurzina) edited this page Dec 28, 2022 · 6 revisions

Overview

Annotators are components (connectors/services) that annotate a given user's utterance.

An example of an annotator is NER: this annotator may return a dictionary with tokens and tags keys:

{"tokens": ["Paris"], "tags": ["I-LOC"]}

Another example is Sentiment Classification annotator. It can return a list of labels, e.g.:

["neutral", "speech"]

Available English Annotators

Name Requirements Description
ASR 40 MiB RAM calculates overall ASR confidence for a given utterance and grades it as either very low, low, medium, or high (for Amazon markup)
Badlisted words 150 MiB RAM detects words and phrases from the badlist
Combined classification 1.5 GiB RAM, 3.5 GiB GPU BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification
COMeT Atomic 2 GiB RAM, 1.1 GiB GPU Commonsense prediction models COMeT Atomic
COMeT ConceptNet 2 GiB RAM, 1.1 GiB GPU Commonsense prediction models COMeT ConceptNet
Convers Evaluator Annotator 1 GiB RAM, 4.5 GiB GPU is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous
Entity Detection 1.5 GiB RAM, 3.2 GiB GPU extracts entities and their types from utterances
Entity Linking 640 MB RAM finds Wikidata entity ids for the entities detected with Entity Detection
Entity Storer 220 MiB RAM a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state
Fact Random 50 MiB RAM returns random facts for the given entity (for entities from user utterance)
Fact Retrieval 7.4 GiB RAM, 1.2 GiB GPU extracts facts from Wikipedia and wikiHow
Intent Catcher 1.7 GiB RAM, 2.4 GiB GPU classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps
KBQA 2 GiB RAM, 1.4 GiB GPU answers user's factoid questions based on Wikidata KB
MIDAS Classification 1.1 GiB RAM, 4.5 GiB GPU BERT-based model trained on a semantic classes subset of MIDAS dataset
MIDAS Predictor 30 MiB RAM BERT-based model trained on a semantic classes subset of MIDAS dataset
NER 2.2 GiB RAM, 5 GiB GPU extracts person names, names of locations, organizations from uncased text
News API annotator 80 MiB RAM extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key.
Personality Catcher 30 MiB RAM
Rake keywords 40 MiB RAM extracts keywords from utterances with the help of RAKE algorithm
Relative Persona Extractor 50 MiB RAM Annotator utilizing Sentence Ranker to rank persona sentences and selecting N_SENTENCES_OT_RETURN the most relevant sentences
Sentrewrite 200 MiB RAM rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components
Sentseg 1 GiB RAM allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation
Spacy Nounphrases 180 MiB RAM extracts nounphrases using Spacy and filters out generic ones
Speech Function Classifier a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade
Speech Function Predictor yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier
Spelling Preprocessing 30 MiB RAM pattern-based component to rewrite different colloquial expressions to a more formal style of conversation
Topic recommendation 40 MiB RAM offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4).
User Persona Extractor 40 MiB RAM determines which age category the user belongs to based on some key words
Wiki Parser 100 MiB RAM extracts Wikidata triplets for the entities detected with Entity Linking
Wiki Facts 1.7 GiB RAM

Available Russian Annotators

Name Requirements Description
Badlisted words 50 MiB RAM Аннотатор детекции нецензурных слов основан на лемматизации с помощью pymorphy2 и проверки по словарю нецензурных слов.
Entity Detection 3 GiB RAM Аннотатор извлечения не именованных сущностей и определения их типа для русского языка нижнего регистра на основе на основе нейросетевой модели ruBERT (PyTorch).
Entity Linking 300 MiB RAM Аннотатор связывания (нахождения Wikidata id) сущностей, извлеченных с помощью Entity detection, на основе дистиллированной модели ruBERT.
Intent Catcher 1.8 GiB RAM, 5 Gib GPU Аннотатор детектирования специальных намерений пользователя на основе многоязычной модели Universal Sentence Encoding.
NER 1.8 GiB RAM, 5 Gib GPU Аннотатор извлечения именованных сущностей для русского языка нижнего регистра на основе нейросетевой модели Conversational ruBERT (PyTorch).
Sentseg 2.4 GiB RAM, 5 Gib GPU Аннотатор восстановления пунктуации для русского языка нижнего регистра на основе нейросетевой модели ruBERT (PyTorch). Модель обучена на русскоязычных субтитрах.
Spacy Annotator 250 MiB RAM Аннотатор токенизации и аннотирования токенов на основе библиотеки spacy и входящей в нее модели “ru_core_news_sm”.
Spelling Preprocessing 4.5 GiB RAM Аннотатор исправления опечаток и грамматических ошибок на основе модели расстояния Левенштейна. Используется предобученная модель из библиотеки DeepPavlov.
Toxic Classification 1.9 GiB RAM, 1.3 Gib GPU Классификатор токсичности для фильтрации реплик пользователя от Сколтеха
Wiki Parser 100 MiB RAM Аннотатор извлечения триплетов из Wikidata для сущностей, извлеченных с помощью Entity detection.
DialogRPT 3.9 GiB RAM, 2.2 GiB GPU Сервис оценки вероятности реплики понравиться пользователю (updown) на основе ранжирующей модели DialogRPT, которая дообучена на основе генеративной модели Russian DialoGPT на комментариев с сайта Пикабу.

Developing Own Annotator

TBD

Registering Own Annotator

Registering a new annotator is done by adding its configuration to the docker-compose.yml located in the root directory, and to pipeline_conf.json located in the /agent subdirectory of the repository. After adding new annotator, it should be built and run using docker-compose -f docker-compose.yml up --build _annotator_name_, building and running agent by running command docker-compose -f docker-compose.yml build agent followed by docker-compose -f docker-compose.yml up agent.

You must register an annotator before you can test it with your own copy of Deepy.

Provided Annotators

Currently, we provide two annotators for the basic distribution:

  • Emotion Classification
  • Spell Checking Classification

We plan to eventually ship them as part of the DeepPavlov Library and then re-add them to the Deepy distributions as native DeepPavlov Library components.

Resources