Update/doccano 1.5.5 (#14)

* Fixing Data Annotation Issues When uploading datasets, the code uses a `bulk_create` to upload Examples and Labels. It then filters the data from the database based on when it was created. However, [Django doesn't enforce the list order when calling filter](https://stackoverflow.com/questions/7163640/what-is-the-default-order-of-a-list-returned-from-a-django-filter-call) unless ordering is specified. The previous behavior mismatched labels and examples. When this was shown in the UI, the data would show labels for incorrect examples (i.e. a label for message #2 would be shown on message #1). This fix enforces that the data is returned in the order it was inserted so that the data, label pair is as expected. * move later to copy files in Dockerfile.prod * fix client-side types about comment as backend returns * add annotation link in commentList page * Add admin interface for AutoLabelingConfigs. Solves doccano#1423 Thanks to @uklft for the idea. * Sort imports * Return a Response with a status if the task is not yet ready. * Remove unneeded query Bulk create returns the created objects in the same order as they have been added. In Postgres, the query was wrong, because ordering was not guaranteed. * Remove unneed import * removing debugging statement * iss1348: fix colors when importing labels Signed-off-by: Dimid Duchovny <dimidd@localize.city> * Updated various dependency and image versions * Python version pinning fix * update cloudformation template to modify the sample env file, now that all the config params are stored in environment variables as per commit 5728636 * show a check button for annotators * filter by role in the confirm API * add a property to the ExampleState model * separate confirm status for each role or user * fix flake8 * fix TestExampleStateConfirmCollaborative * fix isort * move ExampleSerializer tests to test_document.py * add tests * Sequence labelling: fix background color in dark mode * add confirmed count to statistics api response * receive confirmed count value in frontend statistics models * make progress data per role * show progress of each role * not display legend of bar-chart * Increase the allowed max length for uploaded dataset filepath * Bump django from 3.2.4 to 3.2.5 Bumps [django](https://github.com/django/django) from 3.2.4 to 3.2.5. - [Release notes](https://github.com/django/django/releases) - [Commits](django/django@3.2.4...3.2.5) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Add EntityEditor * Fix flake8 warnings * Update Dockerfiles * Add v-annotator * Update ner demo * Update sequence labeling page * Support RTL in sequence labeling * Update index.md * Update package * Add fields to SequenceLabelingProject * Update serializer in ProjectDetail * Enable to handle allowOverlapping and graphemeMode option in sequence labeling page * Enable to create project with allowOverlapping and graphemeMode option * Remove unused import * Update v-annotator to fix the problem The problem occurred when the user changes the state of RTL. Once the state changes, the entities are visually disappeared. * Show shortcut key on menu * Add explanation for nested mode * Add explanation for grapheme mode * Update shortcut on menu * Update package version * Enable to pass grapheme-mode to EntityEditor.vue * Add explanation for project creation * Support doccano init on windows * Fix cli * Add dependency, fix doccano#1481 * Update cli, fix doccano#1408 * Add explanation on create user, close doccano#1410 * Update faq, close doccano#1496 * Remove old tests * Update test config * Update components, fix doccano#1541 * Add test for FormGuideline component * Update the name of test case * Apply linter * Update eslint config * Update docker-compose.dev.yml, fix doccano#1536 * Change example id from auto field to uuid field * Update import method of urls * Add test cases for ingest classification data * Move test data * Rename classification.jsonl * Fix CoNLLDataset * Add test cases for ingesting sequence labeling data * Refactor test_tasks.py * Move test data * Add test cases for ingesting seq2seq data * Update test cases for ingesting data to check mapping * Improve error handling for jsonl parser * Improve error handling for json parser * Improve error handling for excel parser * Add csv test case * Add conll test case * Change doc/example id type from number to string * Update order of examples * Revert primary key change * Add migration file * Update task queue command to support windows * Create FUNDING.yml * Update README.md * Update compose files, fix doccano#1546 * Update CsvWriter, fix doccano#1497 * Sort exported labels, fix doccano#1466 * Add keyboard shortcut back to accept button * Add how to use PostgreSQL * Assign label colors automatically * Add a test case for generating color function * Fix typo: injest -> ingest * Add PostgreSQL related env in docker compose mode * Update README.md * Add a validator to the text field * Enable to ingest lines without errors even if an exception occurs during parsing * Fix TextLineDataset to raise exception * Enable to delete relation if one of the entities are deleted * Update Span model * Add a migration * Refactor CoNLLDataset * Enable to return line number of exception occured * Update Cleaner to change error the message by the project type * Install mdi font * Set icons locally * Support offline font * Remove font awesome script * Add a demo image to show it in offline environment * Fix speech to text demo * Remove unused scripts * Update publish-image.yml * Enable to list all labels * Fix unique constraint * Add clean up after closing menu * Update the way of clean up selected items * Wrap by nexttick * Update Dockerfile to change the default value of DEBUG, fix doccano#1457 * Update cleanup method * Update unique constraint of Span * Handle unique constraint exception * Add try/catch to update/delete method * Show number of deleting rows only in confirm dialog, resolve doccano#1077 * Speed up fetching comment Co-authored-by: zanussbaum <zanussbaum@gmail.com> Co-authored-by: youichiro <cinnamon416@gmail.com> Co-authored-by: ayanamizuta <ayanamizuta832@gmail.com> Co-authored-by: Roland Szabo <rolisz@gmail.com> Co-authored-by: Dimid Duchovny <dimidd@localize.city> Co-authored-by: rcarew@xelerance.com <rcarew@xelerance.com> Co-authored-by: Dale Evans <dale.evans@mycanadapayday.com> Co-authored-by: Colin Darie <colin@darie.eu> Co-authored-by: Yosua Michael M <yosua.maranatha@grabtaxi.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Hironsan <light.tree.1.13@gmail.com> Co-authored-by: Hiroki Nakayama <hiroki.nakayama.py@gmail.com> Co-authored-by: Talha Oz <oztalha@users.noreply.github.com> Co-authored-by: Fynn Schmitt-Ulms <fynnsu@outlook.com> Co-authored-by: Zader Zheng <yumaoshu@gmail.com> Co-authored-by: Gerhard Haß <gerhard.hass@neofonie.de>
ghontolux · Jan 25, 2022 · b490247 · b490247
1 parent cdc31e2
commit b490247
Show file tree

Hide file tree

Showing 92 changed files with 1,205 additions and 702 deletions.
diff --git a/.github/workflows/publish-image.yml b/.github/workflows/publish-image.yml
@@ -5,10 +5,9 @@ on:
     - cron: '0 10 * * *' # everyday at 10am
   push:
     branches:
-      - '**'
+      - master
     tags:
       - 'v*.*.*'
-  pull_request:
 
 jobs:
   docker:

diff --git a/Dockerfile b/Dockerfile
@@ -57,7 +57,7 @@ RUN chown -R doccano:doccano .
 VOLUME /data
 ENV DATABASE_URL="sqlite:////data/doccano.db"
 
-ENV DEBUG="True"
+ENV DEBUG="False"
 ENV SECRET_KEY="change-me-in-production"
 ENV PORT="8000"
 ENV WORKERS="2"

diff --git a/README.md b/README.md
@@ -68,6 +68,49 @@ doccano task
 
 Go to <http://127.0.0.1:8000/>.
 
+By default, sqlite3 is used for the default database. If you want to use PostgreSQL, install the additional dependency:
+
+```bash
+pip install 'doccano[postgresql]'
+```
+
+Create an .env file with variables in the following format, each on a new line:
+
+```bash
+POSTGRES_USER=doccano
+POSTGRES_PASSWORD=doccano
+POSTGRES_DB=doccano
+```
+
+Then, pass it to docker run with the --env-file flag:
+
+```bash
+docker run --rm -d \
+    -p 5432:5432 \
+    -v postgres-data:/var/lib/postgresql/data \
+    --env-file .env \
+    postgres:13.3-alpine
+```
+
+And set `DATABASE_URL` environment variable:
+
+```bash
+# Please replace each variable.
+DATABASE_URL=postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@localhost:5432/${POSTGRES_DB}?sslmode=disable
+```
+
+Now run the command as before:
+
+```bash
+doccano init
+doccano createuser --username admin --password pass
+doccano webserver --port 8000
+
+# In another terminal.
+# Don't forget to set DATABASE_URL
+doccano task
+```
+
 ### Docker
 
 As a one-time setup, create a Docker container as follows:
@@ -107,12 +150,22 @@ _Note for Windows developers:_ Be sure to configure git to correctly handle line
 git clone https://github.com/doccano/doccano.git --config core.autocrlf=input
 ```
 
-Set the superuser account credentials in the `./config/env.example` file:
+Then, create an `.env` file with variables in the following format(see [./config/.env.example](https://github.com/doccano/doccano/blob/master/config/.env.example)):
 
 ```plain
+# platform settings
 ADMIN_USERNAME=admin
 ADMIN_PASSWORD=password
 ADMIN_EMAIL=admin@example.com
+
+# rabbit mq settings
+RABBITMQ_DEFAULT_USER=doccano
+RABBITMQ_DEFAULT_PASS=doccano
+
+# database settings
+POSTGRES_USER=doccano
+POSTGRES_PASSWORD=doccano
+POSTGRES_DB=doccano
 ```
 
 #### Production

diff --git a/backend/api/migrations/0018_alter_label_background_color.py b/backend/api/migrations/0018_alter_label_background_color.py
@@ -0,0 +1,19 @@
+# Generated by Django 3.2.8 on 2021-11-17 05:56
+
+import api.models
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('api', '0017_example_uuid'),
+    ]
+
+    operations = [
+        migrations.AlterField(
+            model_name='label',
+            name='background_color',
+            field=models.CharField(default=api.models.generate_random_hex_color, max_length=7),
+        ),
+    ]
diff --git a/backend/api/migrations/0019_auto_20211124_0506.py b/backend/api/migrations/0019_auto_20211124_0506.py
@@ -0,0 +1,30 @@
+# Generated by Django 3.2.8 on 2021-11-24 05:06
+
+from django.db import migrations, models
+import django.db.models.expressions
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('api', '0018_alter_label_background_color'),
+    ]
+
+    operations = [
+        migrations.AlterUniqueTogether(
+            name='span',
+            unique_together=set(),
+        ),
+        migrations.AddConstraint(
+            model_name='span',
+            constraint=models.CheckConstraint(check=models.Q(('start_offset__gte', 0)), name='startOffset >= 0'),
+        ),
+        migrations.AddConstraint(
+            model_name='span',
+            constraint=models.CheckConstraint(check=models.Q(('end_offset__gte', 0)), name='endOffset >= 0'),
+        ),
+        migrations.AddConstraint(
+            model_name='span',
+            constraint=models.CheckConstraint(check=models.Q(('start_offset__lt', django.db.models.expressions.F('end_offset'))), name='start < end'),
+        ),
+    ]
diff --git a/backend/api/migrations/0020_merge_20211203_1558.py b/backend/api/migrations/0020_merge_20211203_1558.py
@@ -0,0 +1,14 @@
+# Generated by Django 3.2.9 on 2021-12-03 15:58
+
+from django.db import migrations
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('api', '0018_merge_20211110_1607'),
+        ('api', '0019_auto_20211124_0506'),
+    ]
+
+    operations = [
+    ]
diff --git a/backend/api/models.py b/backend/api/models.py
@@ -1,3 +1,4 @@
+import random
 import string
 import uuid
 from typing import Literal
@@ -106,6 +107,10 @@ def is_task_of(self, task: Literal['text', 'image', 'speech']):
         return task == 'image'
 
 
+def generate_random_hex_color():
+    return f'#{random.randint(0, 0xFFFFFF):06x}'
+
+
 class Label(models.Model):
     text = models.CharField(max_length=100, db_index=True)
     prefix_key = models.CharField(
@@ -131,7 +136,7 @@ class Label(models.Model):
         on_delete=models.CASCADE,
         related_name='labels'
     )
-    background_color = models.CharField(max_length=7, default='#209cee')
+    background_color = models.CharField(max_length=7, default=generate_random_hex_color)
     text_color = models.CharField(max_length=7, default='#ffffff')
     created_at = models.DateTimeField(auto_now_add=True, db_index=True)
     updated_at = models.DateTimeField(auto_now=True)
@@ -289,18 +294,36 @@ class Span(Annotation):
     start_offset = models.IntegerField()
     end_offset = models.IntegerField()
 
-    def clean(self):
-        if self.start_offset >= self.end_offset:
-            raise ValidationError('start_offset > end_offset')
+    def validate_unique(self, exclude=None):
+        allow_overlapping = getattr(self.example.project, 'allow_overlapping', False)
+        is_collaborative = self.example.project.collaborative_annotation
+        if allow_overlapping:
+            super().validate_unique(exclude=exclude)
+            return
+
+        overlapping_span = Span.objects.exclude(id=self.id).filter(example=self.example).filter(
+            models.Q(start_offset__gte=self.start_offset, start_offset__lt=self.end_offset) |
+            models.Q(end_offset__gt=self.start_offset, end_offset__lte=self.end_offset) |
+            models.Q(start_offset__lte=self.start_offset, end_offset__gte=self.end_offset)
+        )
+        if is_collaborative:
+            if overlapping_span.exists():
+                raise ValidationError('This overlapping is not allowed in this project.')
+        else:
+            if overlapping_span.filter(user=self.user).exists():
+                raise ValidationError('This overlapping is not allowed in this project.')
+
+    def save(self, force_insert=False, force_update=False, using=None,
+             update_fields=None):
+        self.full_clean()
+        super().save(force_insert, force_update, using, update_fields)
 
     class Meta:
-        unique_together = (
-            'example',
-            'user',
-            'label',
-            'start_offset',
-            'end_offset'
-        )
+        constraints = [
+            models.CheckConstraint(check=models.Q(start_offset__gte=0), name='startOffset >= 0'),
+            models.CheckConstraint(check=models.Q(end_offset__gte=0), name='endOffset >= 0'),
+            models.CheckConstraint(check=models.Q(start_offset__lt=models.F('end_offset')), name='start < end')
+        ]
 
 
 class EntitySpan(Annotation):

diff --git a/backend/api/tasks.py b/backend/api/tasks.py
@@ -10,9 +10,9 @@
 from .models import Example, Label, Project, EntitySpan
 from .views.download.factory import create_repository, create_writer
 from .views.download.service import ExportApplicationService
-from .views.upload.exception import FileParseException
-from .views.upload.factory import (get_data_class, get_dataset_class,
-                                   get_label_class)
+from .views.upload.exception import FileParseException, FileParseExceptions
+from .views.upload.factory import (create_cleaner, get_data_class,
+                                   get_dataset_class, get_label_class)
 from .views.upload.utils import append_field
 
 logger = get_task_logger(__name__)
@@ -89,7 +89,7 @@ def create(self, examples, user, project):
 
 
 @shared_task
-def injest_data(user_id, project_id, filenames, format: str, **kwargs):
+def ingest_data(user_id, project_id, filenames, format: str, **kwargs):
     project = get_object_or_404(Project, pk=project_id)
     user = get_object_or_404(get_user_model(), pk=user_id)
     response = {'error': []}
@@ -110,6 +110,7 @@ def injest_data(user_id, project_id, filenames, format: str, **kwargs):
         label_class=Label,
         annotation_class=project.get_annotation_class()
     )
+    cleaner = create_cleaner(project)
     while True:
         try:
             example = next(it)
@@ -118,6 +119,13 @@ def injest_data(user_id, project_id, filenames, format: str, **kwargs):
         except FileParseException as err:
             response['error'].append(err.dict())
             continue
+        except FileParseExceptions as err:
+            response['error'].extend(list(err))
+            continue
+        try:
+            example.clean(cleaner)
+        except FileParseException as err:
+            response['error'].append(err.dict())
 
         buffer.add(example)
         if buffer.is_full():

diff --git a/backend/api/tests/api/test_annotation.py b/backend/api/tests/api/test_annotation.py
@@ -1,7 +1,7 @@
 from rest_framework import status
 from rest_framework.reverse import reverse
 
-from ...models import DOCUMENT_CLASSIFICATION, Category
+from ...models import DOCUMENT_CLASSIFICATION, SEQUENCE_LABELING, Category
 from .utils import (CRUDMixin, make_annotation, make_doc, make_label,
                     make_user, prepare_project)
 
@@ -79,11 +79,17 @@ def test_denies_unauthenticated_user_to_annotate(self):
 class TestAnnotationDetail(CRUDMixin):
 
     def setUp(self):
-        self.project = prepare_project(task=DOCUMENT_CLASSIFICATION)
+        self.project = prepare_project(task=SEQUENCE_LABELING)
         self.non_member = make_user()
         doc = make_doc(self.project.item)
         label = make_label(self.project.item)
-        annotation = make_annotation(task=DOCUMENT_CLASSIFICATION, doc=doc, user=self.project.users[0])
+        annotation = make_annotation(
+            task=SEQUENCE_LABELING,
+            doc=doc,
+            user=self.project.users[0],
+            start_offset=0,
+            end_offset=1
+        )
         self.data = {'label': label.id}
         self.url = reverse(viewname='annotation_detail', args=[self.project.item.id, doc.id, annotation.id])
 

diff --git a/backend/api/tests/api/test_comment.py b/backend/api/tests/api/test_comment.py
@@ -50,7 +50,7 @@ def setUp(self):
     def test_allows_project_member_to_list_comments(self):
         for member in self.project.users:
             response = self.assert_fetch(member, status.HTTP_200_OK)
-            self.assertEqual(len(response.data), 1)
+            self.assertEqual(response.data['count'], 1)
 
     def test_denies_non_project_member_to_list_comments(self):
         self.assert_fetch(self.non_member, status.HTTP_403_FORBIDDEN)
@@ -70,7 +70,7 @@ def test_allows_project_member_to_delete_comments(self):
         for member in self.project.users:
             self.assert_bulk_delete(member, status.HTTP_204_NO_CONTENT)
             response = self.client.get(self.url)
-            self.assertEqual(len(response.data), 0)
+            self.assertEqual(response.data['count'], 0)
 
     def test_denies_non_project_member_to_delete_comments(self):
         self.assert_fetch(self.non_member, status.HTTP_403_FORBIDDEN)

diff --git a/backend/api/tests/api/utils.py b/backend/api/tests/api/utils.py
@@ -49,7 +49,8 @@ def make_project(
         task: str,
         users: List[str],
         roles: List[str] = None,
-        collaborative_annotation=False):
+        collaborative_annotation=False,
+        **kwargs):
     create_default_roles()
 
     # create users.
@@ -70,7 +71,8 @@ def make_project(
         _model=project_model,
         project_type=task,
         users=users,
-        collaborative_annotation=collaborative_annotation
+        collaborative_annotation=collaborative_annotation,
+        **kwargs
     )
 
     # assign roles to the users.
@@ -111,18 +113,18 @@ def make_auto_labeling_config(project):
     return mommy.make('AutoLabelingConfig', project=project)
 
 
-def make_annotation(task, doc, user):
+def make_annotation(task, doc, user, **kwargs):
     annotation_model = {
         DOCUMENT_CLASSIFICATION: 'Category',
         SEQUENCE_LABELING: 'Span',
         SEQ2SEQ: 'TextLabel',
         SPEECH2TEXT: 'TextLabel',
         ENTITY_RECOGNITION: 'EntitySpan'
     }.get(task)
-    return mommy.make(annotation_model, example=doc, user=user)
+    return mommy.make(annotation_model, example=doc, user=user, **kwargs)
 
 
-def prepare_project(task: str = 'Any', collaborative_annotation=False):
+def prepare_project(task: str = 'Any', collaborative_annotation=False, **kwargs):
     return make_project(
         task=task,
         users=['admin', 'approver', 'annotator'],
@@ -131,7 +133,8 @@ def prepare_project(task: str = 'Any', collaborative_annotation=False):
             settings.ROLE_ANNOTATION_APPROVER,
             settings.ROLE_ANNOTATOR,
         ],
-        collaborative_annotation=collaborative_annotation
+        collaborative_annotation=collaborative_annotation,
+        **kwargs
     )
 
 

diff --git a/backend/api/tests/data/seq2seq/example.csv b/backend/api/tests/data/seq2seq/example.csv
@@ -1,5 +1,5 @@
 text,label
+,label2
 exampleA,label1
 exampleB,
-,label2
 ,
diff --git a/backend/api/tests/data/sequence_labeling/example_overlapping.jsonl b/backend/api/tests/data/sequence_labeling/example_overlapping.jsonl
@@ -0,0 +1 @@
+{"text": "exampleA", "label": [[0, 1, "LOC"], [0, 1, "LOC"]], "meta": {"wikiPageID": 1}}