Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive usage of Doccano for semisupervised learning and interactive machine teaching #6

Closed
fm444fm opened this issue Sep 26, 2018 · 11 comments
Labels
feature request feature request for doccano

Comments

@fm444fm
Copy link

fm444fm commented Sep 26, 2018

Is it possible to use doccano interactively, i.e. have it exchange data with another service, for example with a python program (in isolation or inside a jupyter notebook) through API.

The use case I am envisioning is to connect doccano to a spacy NLP pipeline, where the annotated sequences from spacy NLP processing are fed inside doccano for control and annotation, allowing the user to correct mistakes in the assignments (e.g. NER assignment or POS tagging) and whenever a new sentence is confirmed by the user, it is sent back to the pipeline controller which uses it as part of dynamic training of the internal model to improve it's efficiency.

In other words using doccano for a semi supervised learning with interactive machine teaching.

If yes, please explain a little how to proceed about integrating and data exchange. Right now I only could see it exporting the performed annotations as csv in IOB format.

@icoxfog417 icoxfog417 added the enhancement Improvement on existing feature label Sep 26, 2018
@icoxfog417
Copy link
Contributor

icoxfog417 commented Sep 27, 2018

Thank you for a nice feature request! We consider doccano supports active learning/semi-supervised situation.

You can call API because doccano works REST API based architecture. But we now refactoring code and documentation, so please give us a little time to show an example 🙏

We also consider the feature for importing labeled dataset. It's easy than labeling by API dynamically. Does this import feature is enough for your request?

API Based Labeling

  • Pros: Dynamically label the data and train the external model (online learning).
  • Cons: A little hard to implement due to a security issue (because it requires external API call), and a user should launch an external model server.

Labeled Data Import

  • Pros: Easy to implement and easy to use.
  • Cons: Can't support dynamic process.

@fm444fm
Copy link
Author

fm444fm commented Oct 9, 2018

Thank you for the answers. The labeled data import is definitely easier like you say, both to develop and to use. On the other hand it would not offer as many opportunities as the other possibility: if doccano could be used for online/dynamic learning I think it could become an even more interesting and flexible application.

@jamesmf
Copy link

jamesmf commented Oct 15, 2018

If the API piece is tricky because of security and the inconvenience of needing a separate service, and the labeled data import is static, what about an admin-only page for applying a model to unlabeled examples and potentially also updating the model?

  • When creating a project, you could specify a model to associate with the project, perhaps using a module name (autolabeler_module="en_core_web_sm" for example)
  • When clicking an "apply_model" button on the project card, the app would pull all unlabeled documents, apply a specific function of the model, and save the result to annotations

Might be a middle-ground in terms of both functionality and implementation difficulty.

@jamesmf
Copy link

jamesmf commented Oct 16, 2018

I implemented a Command class that lets you run python manage.py autolabel project_id and it loads a model and labels all of the documents in that project. In its current state it also adds new labels to the project if they don't exist (that should probably be optional)

The main problem with this implementation is that I don't have a great way to specify a document has been reviewed. So while I can add Annotations to every Document and specify that they are all manual=False (which lets me pull only the documents that were labeled manually for training), what I don't have is a way of changing those annotations from manual=False to reviewed or something similar.

I'm imagining that the simplest way to do that would be to either

  • add an endpoint for 'reviewed document' that gets called every time a user clicks past an autolabeled document
  • add a button (preferably shortcut-keyed) that specifies "this user has reviewed the automatic annotations and approved them"

How does the feature/autolabeling branch intend on handling the difference between a manual=False annotation and a "reviewed" annotation?

Here's the implementation. Once I come up with a decent means of iterating on this, I'll PR something similar

from django.core.management.base import BaseCommand, CommandError
from server.models import (Document, Project, Label,
                          SequenceAnnotation, User)
import en_core_web_md
import string

# if we want to add new labels, decide on some new colors for them
SOME_COLORS_TO_CHOOSE_FROM = ["#001f3f", "#0074D9", "#7FDBFF",
                              "#39CCCC", "#3D9970", "#2ECC40",
                              "#01FF70", "#FFDC00", "#FF4136",
                              "#FF851B", "#85144b", "#B10DC9",
                              "#111111"]

def get_new_shortcut(proj_id):
    """
    Since (project_id, shortcut_key) must be unique for a set of labels,
    we need to check what's taken and pick a unique shortcut key if we want
    to add a new label
    """
    labels = Label.objects.filter(project_id=proj_id)
    existing = set([label.shortcut for label in labels])
    kc = set(string.ascii_lowercase)
    diff = kc - existing
    return list(diff)[0]


def load_model(model_str):
    """
    Loads a model given an input string (could work differently)
    """
    if model_str == "en_core_web_md":
        import en_core_web_md
        model_func = en_core_web_md.load()
    if model_str == "seed_data":
        pass
    return model_func


class Command(BaseCommand):
    help = 'Loads a model and labels all documents in a given project'

    def add_arguments(self, parser):
        parser.add_argument('project_id', type=int)
        parser.add_argument('model_str', default='en_core_web_md')

    def handle(self, *args, **options):
        """
        Loads a model, gets all documents in the given project, and calls that
        model on each document. Optionally, it can create new labels for
        entities that it finds that don't exist in your project.
        """
        project_id = options['project_id']
        model_str = options['model_str']
        user = User.objects.get(username='admin')
        print("loading model")
        nlp_model = load_model(model_str)
        print("model loaded")
        project = Project.objects.get(pk=project_id)
        docs = Document.objects.filter(project_id=project_id)
        docs = docs[:2]
        
        # keep track of next label color, next label shortcut
        labels_created = 0
        next_color = SOME_COLORS_TO_CHOOSE_FROM[labels_created]
        next_short = get_new_shortcut(project_id)
        for doc in docs:
            parsed = nlp_model(doc.text)
            for ent in parsed.ents:
                elabel = ent.label_
                estart = ent.start_char
                eend = ent.end_char
                proj_label, created = Label.objects.get_or_create(text=elabel,
                                                                  project=project,
                                                                  defaults={'background_color': next_color,
                                                                            'shortcut': next_short})
                # keep track of next label color, next label shortcut
                if created:
                    labels_created = (labels_created + 1) % len(SOME_COLORS_TO_CHOOSE_FROM)
                    next_color = SOME_COLORS_TO_CHOOSE_FROM[labels_created]
                    next_short = get_new_shortcut(project_id)
                seq_ann_args = dict(document=doc, user=user, 
                                    label=proj_label, start_offset=estart,
                                    end_offset=eend, manual=False)
                ann, c = SequenceAnnotation.objects.get_or_create(**seq_ann_args)

@icoxfog417
Copy link
Contributor

Thank you for the great suggestion. As you proposed, we considered the way to set project-specific model.

One of the demerit of this way is that user have to decide the model structure before starting annotation. You may think we can change the model during the annotation (active-learning), but there is a research that suggests active-learning is not transferable (so autolabeling feature is pending).

How transferable are the datasets collected by active learners?

Because of this, we consider to implement simple/easy way at first. And additional features support would depends on future reseaches.

@DSLituiev
Copy link

@icoxfog417 The piece of research is really interesting. However, practically speaking IMO people still will be using active learning even in face of these considerations because it is much much less expensive. I would request you to reconsider including this pull request.

@Hironsan Hironsan added feature request feature request for doccano and removed enhancement Improvement on existing feature labels Feb 12, 2019
@vivekverma239
Copy link

Hii, is there anyone working on this feature?

@Hironsan
Copy link
Member

Hironsan commented May 7, 2019

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

@icoxfog417
Copy link
Contributor

Continue the discussion on #6.

@swicaksono
Copy link

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

Hi, I implemented this service to automatically fetch any document that already annotated. But when I run the program, why it always get Response 403? This is quite strange because when I go to the rest Django endpoint in a browser i.e v1/projects, it is not returning any error.

@shuiruge
Copy link

shuiruge commented Dec 6, 2019

The internal processing of doccano is realized by Web API. Therefore, the annotation from the program can be written as follows:

import requests


class Client(object):

    def __init__(self, entrypoint='http://127.0.0.1:8000', username=None, password=None):
        self.entrypoint = entrypoint
        self.client = requests.Session()
        self.client.auth = (username, password)

    def fetch_projects(self):
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.get(url)
        return response

    def create_project(self, name, description, project_type):
        mapping = {'SequenceLabeling': 'SequenceLabelingProject',
                   'DocumentClassification': 'TextClassificationProject',
                   'Seq2seq': 'Seq2seqProject'}
        data = {
            'name': name,
            'project_type': project_type,
            'description': description,
            'guideline': 'Hello',
            'resourcetype': mapping[project_type]
        }
        url = f'{self.entrypoint}/v1/projects'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_documents(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.get(url)
        return response.json()

    def add_document(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_labels(self, project_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.get(url)
        return response.json()

    def add_label(self, project_id, text):
        data = {
            'text': text
        }
        url = f'{self.entrypoint}/v1/projects/{project_id}/labels'
        response = self.client.post(url, json=data)
        return response.json()

    def fetch_annotations(self, project_id, doc_id):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.get(url)
        return response.json()

    def annotate(self, project_id, doc_id, data):
        url = f'{self.entrypoint}/v1/projects/{project_id}/docs/{doc_id}/annotations'
        response = self.client.post(url, json=data)
        return response.json()


if __name__ == '__main__':
    client = Client(username='username', password='password')
    project = client.create_project(name='NER project',
                                    description='example',
                                    project_type='SequenceLabeling')
    doc = client.add_document(project_id=project['id'],
                              text='Obama')
    label = client.add_label(project_id=project['id'],
                             text='PERSON')
    data = {
        'start_offset': 0,
        'end_offset': 5,
        'label': label['id'],
        'prob': 0.8
    }
    client.annotate(project_id=project['id'],
                    doc_id=doc['id'],
                    data=data)
    annotations = client.fetch_annotations(project_id=project['id'],
                                           doc_id=doc['id'])

In the future, we plan to enable automatic labeling from the web.

Hi, I implemented this service to automatically fetch any document that already annotated. But when I run the program, why it always get Response 403? This is quite strange because when I go to the rest Django endpoint in a browser i.e v1/projects, it is not returning any error.

This is caused by the authorization. Simply modify the __init__ method by

    def __init__(self, entrypoint, username, password):
        self.entrypoint = entrypoint
        self.client = requests.Session()

        # authorization
        response = self.client .post(f'{self.entrypoint}/v1/auth-token',
                                     data={'username': username,
                                           'password': password})
        token = response.json()['token']
        self.client .headers.update({'Authorization': f'Token {token}',
                                     'Accept': 'application/json'})

inspired by https://github.com/afparsons/doccano_api_client/blob/ad251cd05b49216b19a6319cb9ce1326d08ce484/doccano_api_client/__init__.py#L93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature request for doccano
Projects
None yet
Development

No branches or pull requests

8 participants