-
Notifications
You must be signed in to change notification settings - Fork 75
Annotators
Annotators are components (connectors/services) that annotate a given user's utterance.
An example of an annotator is NER: this annotator may return a dictionary with tokens
and tags
keys:
{"tokens": ["Paris"], "tags": ["I-LOC"]}
Another example is Sentiment Classification annotator. It can return a list of labels, e.g.:
["neutral", "speech"]
Name | Requirements | Description |
---|---|---|
ASR | 40 MiB RAM | calculates overall ASR confidence for a given utterance and grades it as either very low, low, medium, or high (for Amazon markup) |
Badlisted words | 150 MiB RAM | detects words and phrases from the badlist |
Combined classification | 1.5 GiB RAM, 3.5 GiB GPU | BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification |
COMeT Atomic | 2 GiB RAM, 1.1 GiB GPU | Commonsense prediction models COMeT Atomic |
COMeT ConceptNet | 2 GiB RAM, 1.1 GiB GPU | Commonsense prediction models COMeT ConceptNet |
Convers Evaluator Annotator | 1 GiB RAM, 4.5 GiB GPU | is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous |
Entity Detection | 1.5 GiB RAM, 3.2 GiB GPU | extracts entities and their types from utterances |
Entity Linking | 640 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection |
Entity Storer | 220 MiB RAM | a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state |
Fact Random | 50 MiB RAM | returns random facts for the given entity (for entities from user utterance) |
Fact Retrieval | 7.4 GiB RAM, 1.2 GiB GPU | extracts facts from Wikipedia and wikiHow |
Intent Catcher | 1.7 GiB RAM, 2.4 GiB GPU | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps |
KBQA | 2 GiB RAM, 1.4 GiB GPU | answers user's factoid questions based on Wikidata KB |
MIDAS Classification | 1.1 GiB RAM, 4.5 GiB GPU | BERT-based model trained on a semantic classes subset of MIDAS dataset |
MIDAS Predictor | 30 MiB RAM | BERT-based model trained on a semantic classes subset of MIDAS dataset |
NER | 2.2 GiB RAM, 5 GiB GPU | extracts person names, names of locations, organizations from uncased text |
News API annotator | 80 MiB RAM | extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key. |
Personality Catcher | 30 MiB RAM | |
Rake keywords | 40 MiB RAM | extracts keywords from utterances with the help of RAKE algorithm |
Relative Persona Extractor | 50 MiB RAM | Annotator utilizing Sentence Ranker to rank persona sentences and selecting N_SENTENCES_OT_RETURN the most relevant sentences |
Sentrewrite | 200 MiB RAM | rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components |
Sentseg | 1 GiB RAM | allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation |
Spacy Nounphrases | 180 MiB RAM | extracts nounphrases using Spacy and filters out generic ones |
Speech Function Classifier | a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade | |
Speech Function Predictor | yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier | |
Spelling Preprocessing | 30 MiB RAM | pattern-based component to rewrite different colloquial expressions to a more formal style of conversation |
Topic recommendation | 40 MiB RAM | offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4). |
User Persona Extractor | 40 MiB RAM | determines which age category the user belongs to based on some key words |
Wiki Parser | 100 MiB RAM | extracts Wikidata triplets for the entities detected with Entity Linking |
Wiki Facts | 1.7 GiB RAM |
Name | Requirements | Description |
---|---|---|
Badlisted words | 50 MiB RAM | Аннотатор детекции нецензурных слов основан на лемматизации с помощью pymorphy2 и проверки по словарю нецензурных слов. |
Entity Detection | 3 GiB RAM | Аннотатор извлечения не именованных сущностей и определения их типа для русского языка нижнего регистра на основе на основе нейросетевой модели ruBERT (PyTorch). |
Entity Linking | 300 MiB RAM | Аннотатор связывания (нахождения Wikidata id) сущностей, извлеченных с помощью Entity detection, на основе дистиллированной модели ruBERT. |
Intent Catcher | 1.8 GiB RAM, 5 Gib GPU | Аннотатор детектирования специальных намерений пользователя на основе многоязычной модели Universal Sentence Encoding. |
NER | 1.8 GiB RAM, 5 Gib GPU | Аннотатор извлечения именованных сущностей для русского языка нижнего регистра на основе нейросетевой модели Conversational ruBERT (PyTorch). |
Sentseg | 2.4 GiB RAM, 5 Gib GPU | Аннотатор восстановления пунктуации для русского языка нижнего регистра на основе нейросетевой модели ruBERT (PyTorch). Модель обучена на русскоязычных субтитрах. |
Spacy Annotator | 250 MiB RAM | Аннотатор токенизации и аннотирования токенов на основе библиотеки spacy и входящей в нее модели “ru_core_news_sm”. |
Spelling Preprocessing | 4.5 GiB RAM | Аннотатор исправления опечаток и грамматических ошибок на основе модели расстояния Левенштейна. Используется предобученная модель из библиотеки DeepPavlov. |
Toxic Classification | 1.9 GiB RAM, 1.3 Gib GPU | Классификатор токсичности для фильтрации реплик пользователя от Сколтеха |
Wiki Parser | 100 MiB RAM | Аннотатор извлечения триплетов из Wikidata для сущностей, извлеченных с помощью Entity detection. |
DialogRPT | 3.9 GiB RAM, 2.2 GiB GPU | Сервис оценки вероятности реплики понравиться пользователю (updown) на основе ранжирующей модели DialogRPT, которая дообучена на основе генеративной модели Russian DialoGPT на комментариев с сайта Пикабу. |
TBD
Registering a new annotator is done by adding its configuration to the docker-compose.yml
located in the root directory, and to pipeline_conf.json
located in the /agent
subdirectory of the repository. After adding new annotator, it should be built and run using docker-compose -f docker-compose.yml up --build _annotator_name_
, building and running agent by running command docker-compose -f docker-compose.yml build agent
followed by docker-compose -f docker-compose.yml up agent
.
You must register an annotator before you can test it with your own copy of Deepy.
Currently, we provide two annotators for the basic distribution:
- Emotion Classification
- Spell Checking Classification
We plan to eventually ship them as part of the DeepPavlov Library and then re-add them to the Deepy distributions as native DeepPavlov Library components.
- Annotators @ ReadTheDocs