Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doccano is not exporting data set after SequenceLabeling #1724

Closed
Daremitsu1 opened this issue Mar 8, 2022 · 5 comments · Fixed by #1799
Closed

Doccano is not exporting data set after SequenceLabeling #1724

Daremitsu1 opened this issue Mar 8, 2022 · 5 comments · Fixed by #1799
Labels
bug Something isn't working question Further information is requested
Projects

Comments

@Daremitsu1
Copy link

Hello,

I have annoted a dataset using Doccano Sequence Labelling. However it is not exporting the data set. It is exporting only in the case of approved documents and even then it is downloading a blank file. Tried using windows chrome/firefox and yet the same problem remained.

  • Operating System: Windows 10
  • Python Version Used: 3.8
  • When you install doccano: today
  • How did you install doccano (Heroku button etc): command line
@Tradunsky
Copy link

Cannot reproduce on the latest doccano docker version.
I created Sequence Labelling project and annotated one sample as admin role, then exported the dataset without clicking on Exporting only annotated samples checkmark, here is what I had in the file I downloaded:
image

@Hironsan Hironsan added bug Something isn't working question Further information is requested labels Mar 17, 2022
@Daremitsu1
Copy link
Author

Cannot reproduce on the latest doccano docker version. I created Sequence Labelling project and annotated one sample as admin role, then exported the dataset without clicking on Exporting only annotated samples checkmark, here is what I had in the file I downloaded: image

I am not using any docker version. I have tried to export the dataset after annoting them but even in chrome/firefox/edge nothing is being exported.

@Hironsan Hironsan added this to To do in v1.7.0 Mar 22, 2022
@dhirajsuvarna
Copy link
Contributor

Yes, this is an issue.
After clicking the "Export" button, the file gets exported to C:\Users\<username>\doccano\media with the filename all.jsonl.

However, the data present in it is truncated to contain only 23 lines. doccano is not exporting more than 23 lines!!!

@Hironsan
Copy link
Member

I can export the long file(more than 23 lines).

In any way, some reproducible procedure is required to fix the problem.

@dhirajsuvarna
Copy link
Contributor

dhirajsuvarna commented Mar 28, 2022

@Hironsan
I got the issue in my case -

It is exporting only 23 lines because on the 24th Line its getting an error -

50e367] raised unexpected: UnicodeEncodeError('charmap', '{"id": 640, "data": "2000 2ml 22G x 1 1/4\\" (0.7 x 30mm) 1234567 YYYY-MM-DD YYYY-MM-DD 1000022514403 80% 5) 4) 3) 2) 1) Ctra. Mequinenza, s/n. -22520- Fraga, Spain +34 974 470900 Ctra. Mequinenza, s/n. -22520- Fraga, Spain Becton Dickinson, S.A. 碧迪公司 400-821-3091 348 458 307728 20163152896", "label": [[9, 34, "SIZE-DESCRIPTION"]], "FileName": "1000022514403 307728 CL.ai"}\r\n', 246, 250, 'character maps to <undefined>')

Below is the entire trace

Traceback (most recent call last):
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\site-packages\celery\app\trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\site-packages\celery\app\trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\site-packages\backend\data_export\celery_tasks.py", line 19, in export_dataset
    filepath = service.export(export_approved)
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\site-packages\backend\data_export\pipeline\services.py", line 12, in export
    filepath = self.writer.write(records)
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\site-packages\backend\data_export\pipeline\writers.py", line 42, in write
    f.write(f"{line}\n")
  File "C:\Users\10347298\.conda\envs\annotationenv\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 246-249: character maps to <undefined>

Probable Solution -

In the file - doccano/backend/data_export/pipeline/writers.py

Change this line to include the encoding utf-8

f = open(filename, mode="a", encoding='utf-8')

I have created a PR for this - #1754
This is my first PR, let me know if I have done anything wrong.

@Hironsan Hironsan moved this from To do to In progress in v1.7.0 Apr 13, 2022
v1.7.0 automation moved this from In progress to Done Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
No open projects
v1.7.0
  
Done
Development

Successfully merging a pull request may close this issue.

4 participants