Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

fm444fm · 2018-09-26T09:41:41Z

Is it possible to use doccano interactively, i.e. have it exchange data with another service, for example with a python program (in isolation or inside a jupyter notebook) through API.

The use case I am envisioning is to connect doccano to a spacy NLP pipeline, where the annotated sequences from spacy NLP processing are fed inside doccano for control and annotation, allowing the user to correct mistakes in the assignments (e.g. NER assignment or POS tagging) and whenever a new sentence is confirmed by the user, it is sent back to the pipeline controller which uses it as part of dynamic training of the internal model to improve it's efficiency.

In other words using doccano for a semi supervised learning with interactive machine teaching.

If yes, please explain a little how to proceed about integrating and data exchange. Right now I only could see it exporting the performed annotations as csv in IOB format.

icoxfog417 · 2018-09-27T00:05:07Z

Thank you for a nice feature request! We consider doccano supports active learning/semi-supervised situation.

You can call API because doccano works REST API based architecture. But we now refactoring code and documentation, so please give us a little time to show an example 🙏

We also consider the feature for importing labeled dataset. It's easy than labeling by API dynamically. Does this import feature is enough for your request?

API Based Labeling

Pros: Dynamically label the data and train the external model (online learning).
Cons: A little hard to implement due to a security issue (because it requires external API call), and a user should launch an external model server.

Labeled Data Import

Pros: Easy to implement and easy to use.
Cons: Can't support dynamic process.

fm444fm · 2018-10-09T18:48:57Z

Thank you for the answers. The labeled data import is definitely easier like you say, both to develop and to use. On the other hand it would not offer as many opportunities as the other possibility: if doccano could be used for online/dynamic learning I think it could become an even more interesting and flexible application.

jamesmf · 2018-10-15T02:19:11Z

If the API piece is tricky because of security and the inconvenience of needing a separate service, and the labeled data import is static, what about an admin-only page for applying a model to unlabeled examples and potentially also updating the model?

When creating a project, you could specify a model to associate with the project, perhaps using a module name (autolabeler_module="en_core_web_sm" for example)
When clicking an "apply_model" button on the project card, the app would pull all unlabeled documents, apply a specific function of the model, and save the result to annotations

Might be a middle-ground in terms of both functionality and implementation difficulty.

jamesmf · 2018-10-16T03:29:29Z

I implemented a Command class that lets you run python manage.py autolabel project_id and it loads a model and labels all of the documents in that project. In its current state it also adds new labels to the project if they don't exist (that should probably be optional)

The main problem with this implementation is that I don't have a great way to specify a document has been reviewed. So while I can add Annotations to every Document and specify that they are all manual=False (which lets me pull only the documents that were labeled manually for training), what I don't have is a way of changing those annotations from manual=False to reviewed or something similar.

I'm imagining that the simplest way to do that would be to either

add an endpoint for 'reviewed document' that gets called every time a user clicks past an autolabeled document
add a button (preferably shortcut-keyed) that specifies "this user has reviewed the automatic annotations and approved them"

How does the feature/autolabeling branch intend on handling the difference between a manual=False annotation and a "reviewed" annotation?

Here's the implementation. Once I come up with a decent means of iterating on this, I'll PR something similar

from django.core.management.base import BaseCommand, CommandError
from server.models import (Document, Project, Label,
                          SequenceAnnotation, User)
import en_core_web_md
import string

# if we want to add new labels, decide on some new colors for them
SOME_COLORS_TO_CHOOSE_FROM = ["#001f3f", "#0074D9", "#7FDBFF",
                              "#39CCCC", "#3D9970", "#2ECC40",
                              "#01FF70", "#FFDC00", "#FF4136",
                              "#FF851B", "#85144b", "#B10DC9",
                              "#111111"]

def get_new_shortcut(proj_id):
    """
    Since (project_id, shortcut_key) must be unique for a set of labels,
    we need to check what's taken and pick a unique shortcut key if we want
    to add a new label
    """
    labels = Label.objects.filter(project_id=proj_id)
    existing = set([label.shortcut for label in labels])
    kc = set(string.ascii_lowercase)
    diff = kc - existing
    return list(diff)[0]


def load_model(model_str):
    """
    Loads a model given an input string (could work differently)
    """
    if model_str == "en_core_web_md":
        import en_core_web_md
        model_func = en_core_web_md.load()
    if model_str == "seed_data":
        pass
    return model_func


class Command(BaseCommand):
    help = 'Loads a model and labels all documents in a given project'

    def add_arguments(self, parser):
        parser.add_argument('project_id', type=int)
        parser.add_argument('model_str', default='en_core_web_md')

    def handle(self, *args, **options):
        """
        Loads a model, gets all documents in the given project, and calls that
        model on each document. Optionally, it can create new labels for
        entities that it finds that don't exist in your project.
        """
        project_id = options['project_id']
        model_str = options['model_str']
        user = User.objects.get(username='admin')
        print("loading model")
        nlp_model = load_model(model_str)
        print("model loaded")
        project = Project.objects.get(pk=project_id)
        docs = Document.objects.filter(project_id=project_id)
        docs = docs[:2]
        
        # keep track of next label color, next label shortcut
        labels_created = 0
        next_color = SOME_COLORS_TO_CHOOSE_FROM[labels_created]
        next_short = get_new_shortcut(project_id)
        for doc in docs:
            parsed = nlp_model(doc.text)
            for ent in parsed.ents:
                elabel = ent.label_
                estart = ent.start_char
                eend = ent.end_char
                proj_label, created = Label.objects.get_or_create(text=elabel,
                                                                  project=project,
                                                                  defaults={'background_color': next_color,
                                                                            'shortcut': next_short})
                # keep track of next label color, next label shortcut
                if created:
                    labels_created = (labels_created + 1) % len(SOME_COLORS_TO_CHOOSE_FROM)
                    next_color = SOME_COLORS_TO_CHOOSE_FROM[labels_created]
                    next_short = get_new_shortcut(project_id)
                seq_ann_args = dict(document=doc, user=user, 
                                    label=proj_label, start_offset=estart,
                                    end_offset=eend, manual=False)
                ann, c = SequenceAnnotation.objects.get_or_create(**seq_ann_args)

icoxfog417 · 2018-10-20T00:31:40Z

Thank you for the great suggestion. As you proposed, we considered the way to set project-specific model.

One of the demerit of this way is that user have to decide the model structure before starting annotation. You may think we can change the model during the annotation (active-learning), but there is a research that suggests active-learning is not transferable (so autolabeling feature is pending).

How transferable are the datasets collected by active learners?

Because of this, we consider to implement simple/easy way at first. And additional features support would depends on future reseaches.

DSLituiev · 2019-01-03T00:21:57Z

@icoxfog417 The piece of research is really interesting. However, practically speaking IMO people still will be using active learning even in face of these considerations because it is much much less expensive. I would request you to reconsider including this pull request.

vivekverma239 · 2019-03-31T14:47:37Z

Hii, is there anyone working on this feature?

Hironsan · 2019-05-07T06:21:25Z

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

Feature/admin run model

icoxfog417 · 2019-09-03T14:08:41Z

Continue the discussion on #6.

swicaksono · 2019-10-10T03:10:21Z

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

Hi, I implemented this service to automatically fetch any document that already annotated. But when I run the program, why it always get Response 403? This is quite strange because when I go to the rest Django endpoint in a browser i.e v1/projects, it is not returning any error.

shuiruge · 2019-12-06T09:16:51Z

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

Hi, I implemented this service to automatically fetch any document that already annotated. But when I run the program, why it always get Response 403? This is quite strange because when I go to the rest Django endpoint in a browser i.e v1/projects, it is not returning any error.

This is caused by the authorization. Simply modify the __init__ method by

    def __init__(self, entrypoint, username, password):
        self.entrypoint = entrypoint
        self.client = requests.Session()

        # authorization
        response = self.client .post(f'{self.entrypoint}/v1/auth-token',
                                     data={'username': username,
                                           'password': password})
        token = response.json()['token']
        self.client .headers.update({'Authorization': f'Token {token}',
                                     'Accept': 'application/json'})

inspired by https://github.com/afparsons/doccano_api_client/blob/ad251cd05b49216b19a6319cb9ce1326d08ce484/doccano_api_client/__init__.py#L93

icoxfog417 added the enhancement Improvement on existing feature label Sep 26, 2018

icoxfog417 mentioned this issue Oct 5, 2018

Loading data from API call #9

Closed

icoxfog417 mentioned this issue Oct 31, 2018

Is there any development plan of doccano? #26

Closed

Hironsan added feature request feature request for doccano and removed enhancement Improvement on existing feature labels Feb 12, 2019

icoxfog417 mentioned this issue May 9, 2019

Feature request: Add Auto-labeling #191

Closed

omriallouche referenced this issue in gong-io/doccano Jul 2, 2019

Merge pull request #6 from strelok2012/feature/admin-run-model

38f2428

Feature/admin run model

icoxfog417 closed this as completed Sep 3, 2019

rbagd mentioned this issue Oct 2, 2019

Any documentation about using as REST service #299

Closed

arianpasquali mentioned this issue May 13, 2020

Fail to login using api #771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

fm444fm commented Sep 26, 2018

icoxfog417 commented Sep 27, 2018 •

edited

fm444fm commented Oct 9, 2018

jamesmf commented Oct 15, 2018

jamesmf commented Oct 16, 2018

icoxfog417 commented Oct 20, 2018

DSLituiev commented Jan 3, 2019

vivekverma239 commented Mar 31, 2019

Hironsan commented May 7, 2019

icoxfog417 commented Sep 3, 2019

swicaksono commented Oct 10, 2019

shuiruge commented Dec 6, 2019 •

edited

Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

Comments

fm444fm commented Sep 26, 2018

icoxfog417 commented Sep 27, 2018 • edited

fm444fm commented Oct 9, 2018

jamesmf commented Oct 15, 2018

jamesmf commented Oct 16, 2018

icoxfog417 commented Oct 20, 2018

DSLituiev commented Jan 3, 2019

vivekverma239 commented Mar 31, 2019

Hironsan commented May 7, 2019

icoxfog417 commented Sep 3, 2019

swicaksono commented Oct 10, 2019

shuiruge commented Dec 6, 2019 • edited

icoxfog417 commented Sep 27, 2018 •

edited

shuiruge commented Dec 6, 2019 •

edited