New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import labeled dataset #11
Comments
Thank you for your feature request! We already hear the requests that to import the annotated dataset. |
Also, I think it's important to have a flag for verified annotation. Similar to labelImg feature. |
how to import translation labeled dataset? For some labeld data, there are some wrong label, so it's useful to edit this label. |
Hey guys, so what is the status with importing labeled data? It looks like very important feature. |
This feature will progress after json export feature is implemented. And such labeled data could be imported as json file. |
+1 on this feature it would be great to have the following two capabilities
|
I saw you have an open branch with a template but that it hasn't been updated in 5months is this something you'd be open to contributions for or are you working on a release? https://github.com/chakki-works/doccano/tree/feature/auto_labeling |
@aribornstein Right now I am not working on this feature. I am planning to implement this feature by using the metadata feature proposed by @serzh. So this feature has to wait after #55 is merged. |
Here I share my thought about this feature, taken from #57 (comment). If you have some adivce, please share with us. How to link labels to documentIn order to link annotation with document, we can make use of metadata feature #55. If a user import the data with external_id, this will be contained in the export file. Input file may look like this:
{"text": "EU rejects German call to boycott British lamb.", "external_id": 1} and the exported file will look like this:
{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["news"], "username": "root", "metadata": {"external_id": 1}} Input and output example for document classification taskIf a user wants to import data with some labels, the user has to provide an
{"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "labels": ["label1", "label2"]} In the annoation process, we add {"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["label1", "label2", "labels3"], "username": "root", "metadata": {"external_id": 1}} Input and output example for sequence labeling taskJust for a simple demo, I only shows the example of JSON format. The CSV follow the same idea. For sequence labeling task, the imported annotation key should be names as {"text": "EU rejects German call to boycott British lamb.", "external_id": 1, "entities": [[0, 3, "ORG"], [11, 17, "GPE"]]} In the annoation process, we add annotate
{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "entities": [[0, 3, "ORG"], [11, 17, "GPE"], [34, 41, "GPE"]], "username": "root", "metadata": {"external_id": 1}} The key names of About the shortcutRight now, each label has to be set with a shortcut while creating the label. I am working on this #73 so that a shortcut will be optional. In this case, if the user import data with labels. These label can have no shortcut. This feature will be helpful for importing labeled data. Import labeled dataset implementation process
|
@BrambleXu First, there is a design compatibility issue with the current master branch. Second, as to format, I am using dictionary to store entities, because it is a cleaner, and JSON allows it, so why not. Also I adding 'title' to be able to track document external IDs.
|
First, accumulating annotations in-memory is not an elegant solution indeed, but the first step is important. We might find a more elegant way while developing. As for the second discussion about format setting, list and dictionary both work well. We use list due to the same format used by spaCy. |
@BrambleXu, will you please elaborate on:
As I understand it, the metadata feature is primarily meant for external use (see #55 comment):
Depending on a user supplied value to link a document and annotation sounds riskier than using the Notes while trying to implement feature using
|
This means using When we import data, text would be saved to
Right now only the Document class has the metadata filed (contain Summary this up. First, use As for the your approach for implementation by |
So if I understand correctly, @BrambleXu, you suggest to store labels separately from the documents, right? My assumption, and I believe @JoshuaPostel 's , was that the labels are stored together with the documents. This way one does not need to keep track of 'external_id' / foreign key. Though I believe that having something like external_id or document_id is a generally good practice, storing labels separately from docs might be less desirable from user's perspective. |
Yes, this is just my approach to implement this feature and my thought is not changing the model structure as far as possible. Of course, it would be great to store labels and documents together without If we just consider this feature, it might be good to store labels and documents together. But considering later implementation for the new task (like relation annotation), is it really a good choice to change the model structure? Right now I can not give a better solution due to lack of data structure design experience. That's why I choose to change the model as less as possible. I am not against your approach, and I am also looking forward to seeing a better implementation. |
To those who have the instrest to contribute for this feature.
The This feature is a little 'big', which needs front (extracting data from database and show the imported labels in web page, Vue, HTML), and backend (save label and document to database and link labels to document, Django). So I recommend those developer who works on this feature can first implement the backend part (as a PR). We can help to implement the front part. Of course, it would be wonderful to see both sides finished in one PR. |
My apologies, I should have been more explicit in how I have attempted to link document and annotation via #modify json_to_documents to return annotations if 'label' is a field
documents, annotations = DataUpload.json_to_documents(project, file)
for document, annotation in zip(documents, annotations):
document.id = uuid.uuid4()
document.save()
try:
label = Label.objects.get(
project=document.project,
text=annotation.text))
except:
#see @DSLituiev's approach in #57
label = create_label(...)
label.save()
annotation.document = document.id
annotation.label = label.id
annotation.save() I am still working through details of how to implement this efficiently, but I believe it is possible with transactions since If this feature can be implemented using only the internal model, I think it should (even if it requires a bit more work up front). I can see the metadata approach introducing many bugs down the road. For example, what happens if the user uploads two different documents with the same |
In regards to your comment about model structure @BrambleXu:
I think the current model is ok. I imagine storing |
@JoshuaPostel The |
We dig into the implementation detail too much. It is time to backstep to reconsider what the purpose of this Here I want to take a survey, what user case do you guys need this feature?
If you have other use case, please tell us. I want to know which use case is the most suitable case for this feature. This will very important for determining whether or not this feature worth so much effort. |
My use case would be 3, but I do not think the autolabeling feature would be sufficient. My understanding of #18 is that it will only support spacy models. The models in my use case would be difficult to incorporate. Doccano is an outstanding annotation tool that could be useful for the multitude of human-in-the-loop NLP models that are beyond the scope of spacy. Strong input/output functionality would expand the use cases of Doccano. |
@BrambleXu in my case it is Human-in-the-loop application (3). I am not sure whether I am getting what you mean by 4 and what is the difference between 1 and 2.
Backend imports that class
and uses it to retrain the model with new data after each new 50 annotated instances ( |
I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2). I think storing documents together with labels might simplify things. I it will be decided to store annotations together with the document, the document model could be something like class Document(models.Model):
project = models.ForeignKey(Project, related_name='documents', on_delete=models.CASCADE)
text = models.TextField()
labels = models.TextField() # or ManyToManyField() or ArrayField()
annotations = models.TextField() # or ManyToManyField() or JSONField()
seq2seq_annotations = models.TextField() # or ManyToManyField() or ArrayField()
metadata = models.TextField(default='{}') # or JSONField()
# ... Django has several third packages like https://github.com/dmkoch/django-jsonfield which can be used to provide a bit more flexible data structures. And if you will be using Postgresql Django has native/in-built fields for JSON, Arrays and more, see https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/ . Assuming the basic functionality does not involve updating existing documents, the import will not need to account for an Allowing users to update existing documents through bulk upload could be limited to admin interface or command line interface as an advanced functionality for users who are sure with what they are doing. Here users can be allowed to provide the real If documents and annotations will be stored together it may also make easier to utilize existing tools like django-import-export especially for imports via admin interface. |
Same as @machakux:
I'm also against using the spacy format, doccano should have a very simple generic input and output, I think a particular format can be easily done in a preprocess step. I've always thought of doccano as the labelImg for text. |
I created a pull request that implements re-import of existing data with annotations. |
Just a comment about this feature, the import of an existing dataset should work also with already tokenised input data. In this way I can generate the annotation automatically with any tool and keep the same tokenisation while correcting the annotations. {
"spans": [{"start": 58, "end": 72, "token_start": 16, "token_end": 25, "label": "supercon"}],
"tokens": [{"text": "We", "start": 0, "end": 2, "id": 0}, {"text": " ", "start": 2, "end": 3, "id": 1}] ,
"text": "We have measured the [...]"
} What do you think? |
Until doccano has some built-in feature that does anything with tokens, you can simply put them in the meta data and they will be imported (and later properly exported again) alongside text and annotations with my patch from above. |
I thoroughly redesigned APIs and models and supported labeled dataset import. Task x format is as follows:
We can confirm the detailed format in an upload page: This is not a perfect feature. This is the first step. Thank you for your feedback and contribution. |
If it is of any use, below is a summary of my tests uploading labeled classification documents in json format.
During this time, all other users and ulrs hang (I don't have much experience with this, but I believe this is due to threading). These times are roughly twice as long as when I tested @phoerious's approach in #97. I think the times above are reasonable given the scope/workflow of Doccano and do not need further optimization presently. |
…c_regression Change text classifier to logistic regression
Feature Request: import labeled data sets in BIO format. Like:
Btw, I love your tool, thanks for doing it open source
The text was updated successfully, but these errors were encountered: