Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import labeled dataset #11

Closed
ismaeIfm opened this issue Oct 11, 2018 · 30 comments
Closed

Import labeled dataset #11

ismaeIfm opened this issue Oct 11, 2018 · 30 comments
Labels
enhancement Improvement on existing feature
Projects

Comments

@ismaeIfm
Copy link

ismaeIfm commented Oct 11, 2018

Feature Request: import labeled data sets in BIO format. Like:

SOCCER	O
-	O
JAPAN	B-LOC
GET	O
LUCKY	O
WIN	O
,	O
CHINA	B-PER
IN	O
SURPRISE	O
DEFEAT	O
.	O

Nadim	B-PER
Ladki	I-PER

AL-AIN	B-LOC
,	O
United	B-LOC
Arab	I-LOC
Emirates	I-LOC
1996-12-06 O

Btw, I love your tool, thanks for doing it open source

@icoxfog417 icoxfog417 added the enhancement Improvement on existing feature label Oct 11, 2018
@icoxfog417
Copy link
Contributor

Thank you for your feature request! We already hear the requests that to import the annotated dataset.
We now consider how to deal with this feature, so please wait for a little!

@ismaeIfm
Copy link
Author

Also, I think it's important to have a flag for verified annotation. Similar to labelImg feature.

@icoxfog417 icoxfog417 changed the title Import dataset Import labeled dataset Oct 20, 2018
@feng-1985
Copy link

how to import translation labeled dataset? For some labeld data, there are some wrong label, so it's useful to edit this label.

@serzh
Copy link
Contributor

serzh commented Dec 19, 2018

Hey guys, so what is the status with importing labeled data? It looks like very important feature.

@BrambleXu
Copy link
Contributor

BrambleXu commented Dec 19, 2018

This feature will progress after json export feature is implemented. And such labeled data could be imported as json file.

@DSLituiev DSLituiev mentioned this issue Jan 3, 2019
@aribornstein
Copy link

aribornstein commented Jan 12, 2019

+1 on this feature it would be great to have the following two capabilities

  1. Import a list of pre-annotated sequence data so that annotators can validate and make changes
  2. Import a list of tags so that users don't need to input labels manually this would allow the tool to be used for more complex annotation tasks such as POS, SRL, Co-reference etc

@aribornstein
Copy link

aribornstein commented Jan 12, 2019

I saw you have an open branch with a template but that it hasn't been updated in 5months is this something you'd be open to contributions for or are you working on a release? https://github.com/chakki-works/doccano/tree/feature/auto_labeling

@BrambleXu
Copy link
Contributor

BrambleXu commented Jan 16, 2019

@aribornstein Right now I am not working on this feature. I am planning to implement this feature by using the metadata feature proposed by @serzh. So this feature has to wait after #55 is merged.

@BrambleXu
Copy link
Contributor

Here I share my thought about this feature, taken from #57 (comment). If you have some adivce, please share with us.

How to link labels to document

In order to link annotation with document, we can make use of metadata feature #55. If a user import the data with external_id, this will be contained in the export file.

Input file may look like this:

import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1}

and the exported file will look like this:

output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["news"], "username": "root", "metadata": {"external_id": 1}}

Input and output example for document classification task

If a user wants to import data with some labels, the user has to provide an external_id.
For document classification task, the JSON file should looks like this:

import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "labels": ["label1", "label2"]}

In the annoation process, we add label3 to this document.
output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels":  ["label1", "label2", "labels3"], "username": "root", "metadata": {"external_id": 1}}

Input and output example for sequence labeling task

Just for a simple demo, I only shows the example of JSON format. The CSV follow the same idea.

For sequence labeling task, the imported annotation key should be names as entities.
import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "entities": [[0, 3, "ORG"], [11, 17, "GPE"]]}

In the annoation process, we add annotate British as GPE.

output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "entities": [[0, 3, "ORG"], [11, 17, "GPE"], [34, 41, "GPE"]], "username": "root", "metadata": {"external_id": 1}}

The key names of labels and entities should be the same with the output format.

About the shortcut

Right now, each label has to be set with a shortcut while creating the label. I am working on this #73 so that a shortcut will be optional. In this case, if the user import data with labels. These label can have no shortcut. This feature will be helpful for importing labeled data.

Import labeled dataset implementation process

metadata feature -> optional shortcut enhancement -> import labeled dataset

@DSLituiev
Copy link

DSLituiev commented Jan 28, 2019

@BrambleXu
Let me comment from my experience of implementing sequence annotation import in this branch on my fork.

First, there is a design compatibility issue with the current master branch.
particularly bulk import in view.py, L131. Bulk import the way it is implemented is good and easy for a Document-only content, but it becomes complicated when one has to import linked annotations (which also involves creation of labels on the flight). There are several ways to accelerate that e.g. using transactions, but it does not seem that recent design changes can accommodate any of those tricks. Another way is to run bulk import and store IDs -- but that requires accumulating annotations in-memory, which is not an elegant solution.

Second, as to format, I am using dictionary to store entities, because it is a cleaner, and JSON allows it, so why not. Also I adding 'title' to be able to track document external IDs.

{'title': 'doc1',
'text': 'EU rejects German call to boycott British lamb.'
'seq_annotation':[
    {'start':0, 'end': 2, 'label': 'place'},
    {'start':11, 'end': 17, 'label': 'place'},
    {'start':34, 'end': 41, 'label': 'place'}
    ]
}

@BrambleXu
Copy link
Contributor

@DSLituiev

First, accumulating annotations in-memory is not an elegant solution indeed, but the first step is important. We might find a more elegant way while developing.

As for the second discussion about format setting, list and dictionary both work well. We use list due to the same format used by spaCy.

@JoshuaPostel
Copy link

@BrambleXu, will you please elaborate on:

If a user wants to import data with some labels, the user has to provide an external_id.

As I understand it, the metadata feature is primarily meant for external use (see #55 comment):

The main use-case is to be able to match exported data with your existing corpus by providing external_id in the imported file, although there are other use-cases.

Depending on a user supplied value to link a document and annotation sounds riskier than using the document ForeignKey in the annotation tables.

Notes while trying to implement feature using ForeignKey

I also ran into the some of the bulk_create concerns raised by @DSLituiev. The bulk_create documentation states:

If the model’s primary key is an AutoField it does not retrieve and set the primary key attribute, as save() does, unless the database backend supports it (currently PostgreSQL).

  1. A workaround I have attempted is to replace all models.AutoField with models.UUIDField in 0001_initial.py, however, I am running into issues getting models.UUIDField and sqlite3 to work together.

  2. The other approach would be to use transactions like @DSLituiev suggested. Replacing bulk_creates with this approach looks promising to me.

Do either of these approaches look acceptable? If so, let me know and I will continue to work in that direction.

@BrambleXu
Copy link
Contributor

@JoshuaPostel

If a user wants to import data with some labels, the user has to provide an external_id.

This means using external_id to link document and its corresponding labels. In my understanding, document ForeignKey you said is project, right? If so, how do you know which document has what labels?

When we import data, text would be saved to Document class, and labels would be saved to Label class. But this saving process is under by bulk_create, so we don't know which labels belong to which document. That is where external_id works. In DocumentAnnotation class, we can use external_id to link Labels.id to Document.id. So we can know which document has what labels.

image 1

Models Structure

Right now only the Document class has the metadata filed (contain external_id). You can consider adding external_id attribute to Label class. So you can first bulk_create Labels and Document. And then link them together according to the task type.

Summary this up. First, use bulk_createto create Document and Label class respectively. And then use external_id to link Labels to Document for a specific task (DocumentAnnotation and so on)

As for the your approach for implementation by ForeignKey, the second way it better than first one.

@DSLituiev
Copy link

So if I understand correctly, @BrambleXu, you suggest to store labels separately from the documents, right?

My assumption, and I believe @JoshuaPostel 's , was that the labels are stored together with the documents. This way one does not need to keep track of 'external_id' / foreign key.

Though I believe that having something like external_id or document_id is a generally good practice, storing labels separately from docs might be less desirable from user's perspective.

@BrambleXu
Copy link
Contributor

BrambleXu commented Feb 14, 2019

@DSLituiev

Yes, this is just my approach to implement this feature and my thought is not changing the model structure as far as possible.

Of course, it would be great to store labels and documents together without external_id. But this might be hard to achieve by bulk_create without a big change of model structure. You can see the Label class is related to DocumentAnnotation, SequenceAnnotation task classes, and also the Project class.

If we just consider this feature, it might be good to store labels and documents together. But considering later implementation for the new task (like relation annotation), is it really a good choice to change the model structure? Right now I can not give a better solution due to lack of data structure design experience. That's why I choose to change the model as less as possible.

I am not against your approach, and I am also looking forward to seeing a better implementation.

@BrambleXu
Copy link
Contributor

To those who have the instrest to contribute for this feature.

metadata feature -> optional shortcut enhancement -> import labeled dataset

The metadata feature and optional shortcut enhancement are both implemented. As for the import labeled dataset feature, I won't work on this according to the priority. Welcome anyone who have the instrest to contribute for this feature.

This feature is a little 'big', which needs front (extracting data from database and show the imported labels in web page, Vue, HTML), and backend (save label and document to database and link labels to document, Django). So I recommend those developer who works on this feature can first implement the backend part (as a PR). We can help to implement the front part. Of course, it would be wonderful to see both sides finished in one PR.

@JoshuaPostel
Copy link

My apologies, I should have been more explicit in how I have attempted to link document and annotation via document ForeignKey. Hopefully the pseudocode below helps to demonstrate that it is possible to link the annotation to the appropriate document and label without the use of metadata.

#modify json_to_documents to return annotations if 'label' is a field
documents, annotations = DataUpload.json_to_documents(project, file)

for document, annotation in zip(documents, annotations):
    document.id = uuid.uuid4()
    document.save()
    try:
        label = Label.objects.get(
            project=document.project,
            text=annotation.text))
    except:
        #see @DSLituiev's approach in #57
        label = create_label(...)
    label.save()
    annotation.document = document.id
    annotation.label = label.id
    annotation.save()

I am still working through details of how to implement this efficiently, but I believe it is possible with transactions since bulk_create itself is just a specific implementation of transaction.atomic (see source code).

If this feature can be implemented using only the internal model, I think it should (even if it requires a bit more work up front). I can see the metadata approach introducing many bugs down the road. For example, what happens if the user uploads two different documents with the same external_id? The table keys already manage these sorts of issues, and we can avoid adding more work for user.

@JoshuaPostel
Copy link

In regards to your comment about model structure @BrambleXu:

If we just consider this feature, it might be good to store labels and documents together. But considering later implementation for the new task (like relation annotation), is it really a good choice to change the model structure? Right now I can not give a better solution due to lack of data structure design experience. That's why I choose to change the model as less as possible.

I think the current model is ok. I imagine storing document and label together might cause issues for @serzh's work on polymorphysm #66.

@BrambleXu
Copy link
Contributor

@JoshuaPostel
Thanks for your explanation. I am clear with your approach. I will ask the status of #66 for @serzh.

The external_id is only for the advanced use case for linking to external data, and those users have not supposed to upload different documents with the same external_id because they know what they are exactly doing.

@BrambleXu
Copy link
Contributor

We dig into the implementation detail too much. It is time to backstep to reconsider what the purpose of this import labeled data feature.

Here I want to take a survey, what user case do you guys need this feature?

  1. Just confirm the labeled data by eye?
  2. Correct the wrong labeled data?
  3. You need to label a very large dataset, and it is impossible by mankind power. So you train a classifier with little labeled data, and predict unlabeled data. Then upload the predicted labeled data for correction. Next download the corrected labeld data to train a better classifier to predict data. Back and forth until get a godd performance. This use case is no need to consifer after the autolabeling feature is implemented.
  4. Augment current dataset with other labeled data. This can be done manually without much effort.

If you have other use case, please tell us. I want to know which use case is the most suitable case for this feature. This will very important for determining whether or not this feature worth so much effort.

@JoshuaPostel
Copy link

My use case would be 3, but I do not think the autolabeling feature would be sufficient. My understanding of #18 is that it will only support spacy models. The models in my use case would be difficult to incorporate.

Doccano is an outstanding annotation tool that could be useful for the multitude of human-in-the-loop NLP models that are beyond the scope of spacy. Strong input/output functionality would expand the use cases of Doccano.

@DSLituiev
Copy link

DSLituiev commented Feb 15, 2019

@BrambleXu in my case it is Human-in-the-loop application (3). I am not sure whether I am getting what you mean by 4 and what is the difference between 1 and 2.
For me, as for @JoshuaPostel, spacy is far from doing what I need. For this purpose, having a sklearn-style calls integrated would be great. I.e. a user provides a python file with a defined model class:

# within modelfile.py
class MyModel()

model = MyModel(parameter=124)

Backend imports that class

from modelfile import model

and uses it to retrain the model with new data after each new 50 annotated instances (model.fit()), and updates inference to the un-manually-labelled set (model.predict).

@machakux
Copy link

machakux commented Feb 18, 2019

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I think storing documents together with labels might simplify things.

I it will be decided to store annotations together with the document, the document model could be something like

class Document(models.Model):
    project = models.ForeignKey(Project, related_name='documents', on_delete=models.CASCADE)
    text = models.TextField()
    labels = models.TextField() #  or ManyToManyField() or ArrayField()
    annotations = models.TextField()   # or ManyToManyField() or JSONField() 
    seq2seq_annotations = models.TextField()  # or ManyToManyField()  or ArrayField()
    metadata = models.TextField(default='{}')  #  or JSONField()
    # ...

Django has several third packages like https://github.com/dmkoch/django-jsonfield which can be used to provide a bit more flexible data structures. And if you will be using Postgresql Django has native/in-built fields for JSON, Arrays and more, see https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/ .

Assuming the basic functionality does not involve updating existing documents, the import will not need to account for an external_id although users might be allowed to upload them as metadata for their own future references.

Allowing users to update existing documents through bulk upload could be limited to admin interface or command line interface as an advanced functionality for users who are sure with what they are doing. Here users can be allowed to provide the real id field (the object primary key), which means if an object with the provided id already exists in a database it will be updated otherwise a new object will be created.

If documents and annotations will be stored together it may also make easier to utilize existing tools like django-import-export especially for imports via admin interface.

@ismaeIfm
Copy link
Author

Same as @machakux:

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I'm also against using the spacy format, doccano should have a very simple generic input and output, I think a particular format can be easily done in a preprocess step. I've always thought of doccano as the labelImg for text.

Hironsan added a commit that referenced this issue Feb 26, 2019
Corresponding to issue #11
Hironsan added a commit that referenced this issue Feb 26, 2019
Corresponding to issue #11
phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019
@phoerious
Copy link

I created a pull request that implements re-import of existing data with annotations.

phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019
phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019
Hironsan added a commit that referenced this issue Mar 1, 2019
phoerious added a commit to phoerious/doccano that referenced this issue Mar 5, 2019
@lfoppiano
Copy link

Just a comment about this feature, the import of an existing dataset should work also with already tokenised input data.

In this way I can generate the annotation automatically with any tool and keep the same tokenisation while correcting the annotations.

{
"spans": [{"start": 58, "end": 72, "token_start": 16, "token_end": 25, "label": "supercon"}],
"tokens": [{"text": "We", "start": 0, "end": 2, "id": 0}, {"text": " ", "start": 2, "end": 3, "id": 1}] ,
"text": "We have measured the [...]"
}

What do you think?

@phoerious
Copy link

phoerious commented Mar 6, 2019

Until doccano has some built-in feature that does anything with tokens, you can simply put them in the meta data and they will be imported (and later properly exported again) alongside text and annotations with my patch from above.

@Hironsan
Copy link
Member

Hironsan commented Mar 12, 2019

I thoroughly redesigned APIs and models and supported labeled dataset import.

Task x format is as follows:

Plain CSV JSON CoNLL
Text Classification ○(single label) X
Sequence Labeling X
Seq2seq X

We can confirm the detailed format in an upload page:

image

This is not a perfect feature. This is the first step.
There are some bugs and performance problems.
So welcome your opinions and feedbacks.

Thank you for your feedback and contribution.

@JoshuaPostel
Copy link

@Hironsan thank you very much for your hard work implementing this feature, PR #110 looks great!

@JoshuaPostel
Copy link

If it is of any use, below is a summary of my tests uploading labeled classification documents in json format.

  • 4K documents took ~10 seconds to upload
  • 40K documents took ~1.5 minutes to upload
  • 400K documents took ~13 minutes to upload

During this time, all other users and ulrs hang (I don't have much experience with this, but I believe this is due to threading). These times are roughly twice as long as when I tested @phoerious's approach in #97.

I think the times above are reasonable given the scope/workflow of Doccano and do not need further optimization presently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement on existing feature
Projects
No open projects
v1.0.0
To Do
Development

Successfully merging a pull request may close this issue.