Import labeled dataset #11

ismaeIfm · 2018-10-11T20:58:35Z

Feature Request: import labeled data sets in BIO format. Like:

SOCCER	O
-	O
JAPAN	B-LOC
GET	O
LUCKY	O
WIN	O
,	O
CHINA	B-PER
IN	O
SURPRISE	O
DEFEAT	O
.	O

Nadim	B-PER
Ladki	I-PER

AL-AIN	B-LOC
,	O
United	B-LOC
Arab	I-LOC
Emirates	I-LOC
1996-12-06 O

Btw, I love your tool, thanks for doing it open source

The text was updated successfully, but these errors were encountered:

icoxfog417 · 2018-10-11T23:44:42Z

Thank you for your feature request! We already hear the requests that to import the annotated dataset.
We now consider how to deal with this feature, so please wait for a little!

ismaeIfm · 2018-10-12T22:57:23Z

Also, I think it's important to have a flag for verified annotation. Similar to labelImg feature.

feng-1985 · 2018-10-31T23:46:52Z

how to import translation labeled dataset? For some labeld data, there are some wrong label, so it's useful to edit this label.

serzh · 2018-12-19T12:02:39Z

Hey guys, so what is the status with importing labeled data? It looks like very important feature.

BrambleXu · 2018-12-19T12:23:23Z

This feature will progress after json export feature is implemented. And such labeled data could be imported as json file.

aribornstein · 2019-01-12T07:21:28Z

+1 on this feature it would be great to have the following two capabilities

Import a list of pre-annotated sequence data so that annotators can validate and make changes
Import a list of tags so that users don't need to input labels manually this would allow the tool to be used for more complex annotation tasks such as POS, SRL, Co-reference etc

aribornstein · 2019-01-12T18:23:37Z

I saw you have an open branch with a template but that it hasn't been updated in 5months is this something you'd be open to contributions for or are you working on a release? https://github.com/chakki-works/doccano/tree/feature/auto_labeling

BrambleXu · 2019-01-16T04:54:54Z

@aribornstein Right now I am not working on this feature. I am planning to implement this feature by using the metadata feature proposed by @serzh. So this feature has to wait after #55 is merged.

BrambleXu · 2019-01-28T02:13:38Z

Here I share my thought about this feature, taken from #57 (comment). If you have some adivce, please share with us.

How to link labels to document

In order to link annotation with document, we can make use of metadata feature #55. If a user import the data with external_id, this will be contained in the export file.

Input file may look like this:

import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1}

and the exported file will look like this:

output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["news"], "username": "root", "metadata": {"external_id": 1}}

Input and output example for document classification task

If a user wants to import data with some labels, the user has to provide an external_id.
For document classification task, the JSON file should looks like this:

import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "labels": ["label1", "label2"]}

In the annoation process, we add label3 to this document.
output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels":  ["label1", "label2", "labels3"], "username": "root", "metadata": {"external_id": 1}}

Input and output example for sequence labeling task

Just for a simple demo, I only shows the example of JSON format. The CSV follow the same idea.

For sequence labeling task, the imported annotation key should be names as entities.
import.json

{"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "entities": [[0, 3, "ORG"], [11, 17, "GPE"]]}

In the annoation process, we add annotate British as GPE.

output.json

{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "entities": [[0, 3, "ORG"], [11, 17, "GPE"], [34, 41, "GPE"]], "username": "root", "metadata": {"external_id": 1}}

The key names of labels and entities should be the same with the output format.

About the shortcut

Right now, each label has to be set with a shortcut while creating the label. I am working on this #73 so that a shortcut will be optional. In this case, if the user import data with labels. These label can have no shortcut. This feature will be helpful for importing labeled data.

Import labeled dataset implementation process

metadata feature -> optional shortcut enhancement -> import labeled dataset

DSLituiev · 2019-01-28T05:27:14Z

@BrambleXu
Let me comment from my experience of implementing sequence annotation import in this branch on my fork.

First, there is a design compatibility issue with the current master branch.
particularly bulk import in view.py, L131. Bulk import the way it is implemented is good and easy for a Document-only content, but it becomes complicated when one has to import linked annotations (which also involves creation of labels on the flight). There are several ways to accelerate that e.g. using transactions, but it does not seem that recent design changes can accommodate any of those tricks. Another way is to run bulk import and store IDs -- but that requires accumulating annotations in-memory, which is not an elegant solution.

Second, as to format, I am using dictionary to store entities, because it is a cleaner, and JSON allows it, so why not. Also I adding 'title' to be able to track document external IDs.

{'title': 'doc1',
'text': 'EU rejects German call to boycott British lamb.'
'seq_annotation':[
    {'start':0, 'end': 2, 'label': 'place'},
    {'start':11, 'end': 17, 'label': 'place'},
    {'start':34, 'end': 41, 'label': 'place'}
    ]
}

BrambleXu · 2019-01-30T05:40:51Z

@DSLituiev

First, accumulating annotations in-memory is not an elegant solution indeed, but the first step is important. We might find a more elegant way while developing.

As for the second discussion about format setting, list and dictionary both work well. We use list due to the same format used by spaCy.

JoshuaPostel · 2019-02-14T00:54:00Z

@BrambleXu, will you please elaborate on:

If a user wants to import data with some labels, the user has to provide an external_id.

As I understand it, the metadata feature is primarily meant for external use (see #55 comment):

The main use-case is to be able to match exported data with your existing corpus by providing external_id in the imported file, although there are other use-cases.

Depending on a user supplied value to link a document and annotation sounds riskier than using the document ForeignKey in the annotation tables.

Notes while trying to implement feature using `ForeignKey`

I also ran into the some of the bulk_create concerns raised by @DSLituiev. The bulk_create documentation states:

If the model’s primary key is an AutoField it does not retrieve and set the primary key attribute, as save() does, unless the database backend supports it (currently PostgreSQL).

A workaround I have attempted is to replace all models.AutoField with models.UUIDField in 0001_initial.py, however, I am running into issues getting models.UUIDField and sqlite3 to work together.
The other approach would be to use transactions like @DSLituiev suggested. Replacing bulk_creates with this approach looks promising to me.

Do either of these approaches look acceptable? If so, let me know and I will continue to work in that direction.

BrambleXu · 2019-02-14T06:27:45Z

@JoshuaPostel

If a user wants to import data with some labels, the user has to provide an external_id.

This means using external_id to link document and its corresponding labels. In my understanding, document ForeignKey you said is project, right? If so, how do you know which document has what labels?

When we import data, text would be saved to Document class, and labels would be saved to Label class. But this saving process is under by bulk_create, so we don't know which labels belong to which document. That is where external_id works. In DocumentAnnotation class, we can use external_id to link Labels.id to Document.id. So we can know which document has what labels.

Models Structure

Right now only the Document class has the metadata filed (contain external_id). You can consider adding external_id attribute to Label class. So you can first bulk_create Labels and Document. And then link them together according to the task type.

Summary this up. First, use bulk_createto create Document and Label class respectively. And then use external_id to link Labels to Document for a specific task (DocumentAnnotation and so on)

As for the your approach for implementation by ForeignKey, the second way it better than first one.

DSLituiev · 2019-02-14T06:42:51Z

So if I understand correctly, @BrambleXu, you suggest to store labels separately from the documents, right?

My assumption, and I believe @JoshuaPostel 's , was that the labels are stored together with the documents. This way one does not need to keep track of 'external_id' / foreign key.

Though I believe that having something like external_id or document_id is a generally good practice, storing labels separately from docs might be less desirable from user's perspective.

BrambleXu · 2019-02-14T07:08:48Z

@DSLituiev

Yes, this is just my approach to implement this feature and my thought is not changing the model structure as far as possible.

Of course, it would be great to store labels and documents together without external_id. But this might be hard to achieve by bulk_create without a big change of model structure. You can see the Label class is related to DocumentAnnotation, SequenceAnnotation task classes, and also the Project class.

If we just consider this feature, it might be good to store labels and documents together. But considering later implementation for the new task (like relation annotation), is it really a good choice to change the model structure? Right now I can not give a better solution due to lack of data structure design experience. That's why I choose to change the model as less as possible.

I am not against your approach, and I am also looking forward to seeing a better implementation.

BrambleXu · 2019-02-14T07:15:22Z

To those who have the instrest to contribute for this feature.

metadata feature -> optional shortcut enhancement -> import labeled dataset

The metadata feature and optional shortcut enhancement are both implemented. As for the import labeled dataset feature, I won't work on this according to the priority. Welcome anyone who have the instrest to contribute for this feature.

This feature is a little 'big', which needs front (extracting data from database and show the imported labels in web page, Vue, HTML), and backend (save label and document to database and link labels to document, Django). So I recommend those developer who works on this feature can first implement the backend part (as a PR). We can help to implement the front part. Of course, it would be wonderful to see both sides finished in one PR.

JoshuaPostel · 2019-02-14T22:59:32Z

My apologies, I should have been more explicit in how I have attempted to link document and annotation via document ForeignKey. Hopefully the pseudocode below helps to demonstrate that it is possible to link the annotation to the appropriate document and label without the use of metadata.

#modify json_to_documents to return annotations if 'label' is a field
documents, annotations = DataUpload.json_to_documents(project, file)

for document, annotation in zip(documents, annotations):
    document.id = uuid.uuid4()
    document.save()
    try:
        label = Label.objects.get(
            project=document.project,
            text=annotation.text))
    except:
        #see @DSLituiev's approach in #57
        label = create_label(...)
    label.save()
    annotation.document = document.id
    annotation.label = label.id
    annotation.save()

I am still working through details of how to implement this efficiently, but I believe it is possible with transactions since bulk_create itself is just a specific implementation of transaction.atomic (see source code).

If this feature can be implemented using only the internal model, I think it should (even if it requires a bit more work up front). I can see the metadata approach introducing many bugs down the road. For example, what happens if the user uploads two different documents with the same external_id? The table keys already manage these sorts of issues, and we can avoid adding more work for user.

JoshuaPostel · 2019-02-14T23:03:08Z

In regards to your comment about model structure @BrambleXu:

If we just consider this feature, it might be good to store labels and documents together. But considering later implementation for the new task (like relation annotation), is it really a good choice to change the model structure? Right now I can not give a better solution due to lack of data structure design experience. That's why I choose to change the model as less as possible.

I think the current model is ok. I imagine storing document and label together might cause issues for @serzh's work on polymorphysm #66.

BrambleXu · 2019-02-15T00:44:09Z

@JoshuaPostel
Thanks for your explanation. I am clear with your approach. I will ask the status of #66 for @serzh.

The external_id is only for the advanced use case for linking to external data, and those users have not supposed to upload different documents with the same external_id because they know what they are exactly doing.

BrambleXu · 2019-02-15T01:05:37Z

We dig into the implementation detail too much. It is time to backstep to reconsider what the purpose of this import labeled data feature.

Here I want to take a survey, what user case do you guys need this feature?

Just confirm the labeled data by eye?
Correct the wrong labeled data?
You need to label a very large dataset, and it is impossible by mankind power. So you train a classifier with little labeled data, and predict unlabeled data. Then upload the predicted labeled data for correction. Next download the corrected labeld data to train a better classifier to predict data. Back and forth until get a godd performance. This use case is no need to consifer after the autolabeling feature is implemented.
Augment current dataset with other labeled data. This can be done manually without much effort.

If you have other use case, please tell us. I want to know which use case is the most suitable case for this feature. This will very important for determining whether or not this feature worth so much effort.

JoshuaPostel · 2019-02-15T16:12:27Z

My use case would be 3, but I do not think the autolabeling feature would be sufficient. My understanding of #18 is that it will only support spacy models. The models in my use case would be difficult to incorporate.

Doccano is an outstanding annotation tool that could be useful for the multitude of human-in-the-loop NLP models that are beyond the scope of spacy. Strong input/output functionality would expand the use cases of Doccano.

DSLituiev · 2019-02-15T19:40:40Z

@BrambleXu in my case it is Human-in-the-loop application (3). I am not sure whether I am getting what you mean by 4 and what is the difference between 1 and 2.
For me, as for @JoshuaPostel, spacy is far from doing what I need. For this purpose, having a sklearn-style calls integrated would be great. I.e. a user provides a python file with a defined model class:

# within modelfile.py
class MyModel()

model = MyModel(parameter=124)

Backend imports that class

from modelfile import model

and uses it to retrain the model with new data after each new 50 annotated instances (model.fit()), and updates inference to the un-manually-labelled set (model.predict).

machakux · 2019-02-18T11:34:37Z

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I think storing documents together with labels might simplify things.

I it will be decided to store annotations together with the document, the document model could be something like

class Document(models.Model):
    project = models.ForeignKey(Project, related_name='documents', on_delete=models.CASCADE)
    text = models.TextField()
    labels = models.TextField() #  or ManyToManyField() or ArrayField()
    annotations = models.TextField()   # or ManyToManyField() or JSONField() 
    seq2seq_annotations = models.TextField()  # or ManyToManyField()  or ArrayField()
    metadata = models.TextField(default='{}')  #  or JSONField()
    # ...

Django has several third packages like https://github.com/dmkoch/django-jsonfield which can be used to provide a bit more flexible data structures. And if you will be using Postgresql Django has native/in-built fields for JSON, Arrays and more, see https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/ .

Assuming the basic functionality does not involve updating existing documents, the import will not need to account for an external_id although users might be allowed to upload them as metadata for their own future references.

Allowing users to update existing documents through bulk upload could be limited to admin interface or command line interface as an advanced functionality for users who are sure with what they are doing. Here users can be allowed to provide the real id field (the object primary key), which means if an object with the provided id already exists in a database it will be updated otherwise a new object will be created.

If documents and annotations will be stored together it may also make easier to utilize existing tools like django-import-export especially for imports via admin interface.

ismaeIfm · 2019-02-18T21:00:36Z

Same as @machakux:

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I'm also against using the spacy format, doccano should have a very simple generic input and output, I think a particular format can be easily done in a preprocess step. I've always thought of doccano as the labelImg for text.

Corresponding to issue #11

…occano#11

Resolves doccano#11

phoerious · 2019-02-26T09:58:02Z

I created a pull request that implements re-import of existing data with annotations.

Resolves doccano#11

Corresponding to issue #11

Resolves doccano#11

lfoppiano · 2019-03-06T03:44:06Z

Just a comment about this feature, the import of an existing dataset should work also with already tokenised input data.

In this way I can generate the annotation automatically with any tool and keep the same tokenisation while correcting the annotations.

{
"spans": [{"start": 58, "end": 72, "token_start": 16, "token_end": 25, "label": "supercon"}],
"tokens": [{"text": "We", "start": 0, "end": 2, "id": 0}, {"text": " ", "start": 2, "end": 3, "id": 1}] ,
"text": "We have measured the [...]"
}

What do you think?

phoerious · 2019-03-06T08:21:28Z

Until doccano has some built-in feature that does anything with tokens, you can simply put them in the meta data and they will be imported (and later properly exported again) alongside text and annotations with my patch from above.

Hironsan · 2019-03-12T13:30:37Z

I thoroughly redesigned APIs and models and supported labeled dataset import.

Task x format is as follows:

	Plain	CSV	JSON	CoNLL
Text Classification	○	○(single label)	○	X
Sequence Labeling	○	X	○	○
Seq2seq	○	○	○	X

We can confirm the detailed format in an upload page:

This is not a perfect feature. This is the first step.
There are some bugs and performance problems.
So welcome your opinions and feedbacks.

Thank you for your feedback and contribution.

JoshuaPostel · 2019-03-13T14:27:51Z

@Hironsan thank you very much for your hard work implementing this feature, PR #110 looks great!

JoshuaPostel · 2019-03-13T14:54:18Z

If it is of any use, below is a summary of my tests uploading labeled classification documents in json format.

4K documents took ~10 seconds to upload
40K documents took ~1.5 minutes to upload
400K documents took ~13 minutes to upload

During this time, all other users and ulrs hang (I don't have much experience with this, but I believe this is due to threading). These times are roughly twice as long as when I tested @phoerious's approach in #97.

I think the times above are reasonable given the scope/workflow of Doccano and do not need further optimization presently.

…c_regression Change text classifier to logistic regression

icoxfog417 added the enhancement Improvement on existing feature label Oct 11, 2018

icoxfog417 changed the title ~~Import dataset~~ Import labeled dataset Oct 20, 2018

BrambleXu added this to candidates in v1.0.0 Nov 9, 2018

BrambleXu mentioned this issue Nov 14, 2018

Support importing datasets in JSON format #17

Closed

DSLituiev mentioned this issue Jan 3, 2019

Read seq ann #57

Closed

Hironsan added a commit that referenced this issue Feb 26, 2019

Add file uploader for CoNLL format

5fb46c3

Corresponding to issue #11

Hironsan added a commit that referenced this issue Feb 26, 2019

Add file uploader for CoNLL format

04e2e55

Corresponding to issue #11

phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019

Implement import of annotations and update of existing data, resolves d…

7cd4b29

…occano#11

phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019

Allow annotations import and update of existing data

e63d4c5

Resolves doccano#11

phoerious mentioned this issue Feb 26, 2019

Implement import of annotations and update of existing data #97

Closed

phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019

Allow annotations import and update of existing data

0590271

Resolves doccano#11

phoerious added a commit to phoerious/doccano that referenced this issue Feb 26, 2019

Allow annotations import and update of existing data

5c31ff9

Resolves doccano#11

Hironsan added a commit that referenced this issue Mar 1, 2019

Add file uploader to support plain, csv, json and CoNLL format

5af4581

Corresponding to issue #11

phoerious added a commit to phoerious/doccano that referenced this issue Mar 5, 2019

Allow annotations import and update of existing data

4b78a0d

Resolves doccano#11

Hironsan mentioned this issue Mar 12, 2019

Redesign code and support labeled dataset import #110

Merged

Hironsan closed this as completed in #110 Mar 12, 2019

BrambleXu mentioned this issue Mar 15, 2019

Alignment for import data format and export data format #113

Closed

omriallouche referenced this issue in gong-io/doccano Jul 2, 2019

Merge pull request #11 from Honeyfy/change_text_classifier_to_logisti…

5a1c5f1

…c_regression Change text classifier to logistic regression

akdanwlf mentioned this issue Feb 14, 2021

Help needed to import dataset with labels #1202

Closed

krannnn mentioned this issue Mar 15, 2022

ERROR: Service 'nginx' failed to build : Build failed // yarn.lock ; git protocol #1737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import labeled dataset #11

Import labeled dataset #11

ismaeIfm commented Oct 11, 2018 •

edited

icoxfog417 commented Oct 11, 2018

ismaeIfm commented Oct 12, 2018

feng-1985 commented Oct 31, 2018

serzh commented Dec 19, 2018

BrambleXu commented Dec 19, 2018 •

edited

aribornstein commented Jan 12, 2019 •

edited

aribornstein commented Jan 12, 2019 •

edited

BrambleXu commented Jan 16, 2019 •

edited

BrambleXu commented Jan 28, 2019

DSLituiev commented Jan 28, 2019 •

edited

BrambleXu commented Jan 30, 2019

JoshuaPostel commented Feb 14, 2019

BrambleXu commented Feb 14, 2019

DSLituiev commented Feb 14, 2019

BrambleXu commented Feb 14, 2019 •

edited

BrambleXu commented Feb 14, 2019

JoshuaPostel commented Feb 14, 2019

JoshuaPostel commented Feb 14, 2019

BrambleXu commented Feb 15, 2019

BrambleXu commented Feb 15, 2019

JoshuaPostel commented Feb 15, 2019

DSLituiev commented Feb 15, 2019 •

edited

machakux commented Feb 18, 2019 •

edited

ismaeIfm commented Feb 18, 2019

phoerious commented Feb 26, 2019

lfoppiano commented Mar 6, 2019

phoerious commented Mar 6, 2019 •

edited

Hironsan commented Mar 12, 2019 •

edited

JoshuaPostel commented Mar 13, 2019

JoshuaPostel commented Mar 13, 2019

Import labeled dataset #11

Import labeled dataset #11

Comments

ismaeIfm commented Oct 11, 2018 • edited

icoxfog417 commented Oct 11, 2018

ismaeIfm commented Oct 12, 2018

feng-1985 commented Oct 31, 2018

serzh commented Dec 19, 2018

BrambleXu commented Dec 19, 2018 • edited

aribornstein commented Jan 12, 2019 • edited

aribornstein commented Jan 12, 2019 • edited

BrambleXu commented Jan 16, 2019 • edited

BrambleXu commented Jan 28, 2019

How to link labels to document

Input and output example for document classification task

Input and output example for sequence labeling task

About the shortcut

Import labeled dataset implementation process

DSLituiev commented Jan 28, 2019 • edited

BrambleXu commented Jan 30, 2019

JoshuaPostel commented Feb 14, 2019

Notes while trying to implement feature using ForeignKey

BrambleXu commented Feb 14, 2019

DSLituiev commented Feb 14, 2019

BrambleXu commented Feb 14, 2019 • edited

BrambleXu commented Feb 14, 2019

JoshuaPostel commented Feb 14, 2019

JoshuaPostel commented Feb 14, 2019

BrambleXu commented Feb 15, 2019

BrambleXu commented Feb 15, 2019

JoshuaPostel commented Feb 15, 2019

DSLituiev commented Feb 15, 2019 • edited

machakux commented Feb 18, 2019 • edited

ismaeIfm commented Feb 18, 2019

phoerious commented Feb 26, 2019

lfoppiano commented Mar 6, 2019

phoerious commented Mar 6, 2019 • edited

Hironsan commented Mar 12, 2019 • edited

JoshuaPostel commented Mar 13, 2019

JoshuaPostel commented Mar 13, 2019

ismaeIfm commented Oct 11, 2018 •

edited

BrambleXu commented Dec 19, 2018 •

edited

aribornstein commented Jan 12, 2019 •

edited

aribornstein commented Jan 12, 2019 •

edited

BrambleXu commented Jan 16, 2019 •

edited

DSLituiev commented Jan 28, 2019 •

edited

Notes while trying to implement feature using `ForeignKey`

BrambleXu commented Feb 14, 2019 •

edited

DSLituiev commented Feb 15, 2019 •

edited

machakux commented Feb 18, 2019 •

edited

phoerious commented Mar 6, 2019 •

edited

Hironsan commented Mar 12, 2019 •

edited