Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multimodal label-studio export reader & doc #2615

Merged
merged 9 commits into from
Jan 12, 2023

Conversation

MountPOTATO
Copy link
Contributor

This tool is to help user to transform the exported label annotation data from a data labeling platform Label-Studio (https://labelstud.io/) and generate the pandas Dataframe for Autogluon multimodal input. In this way use can build up a labelstudio-autogluon workflow, label the data through Label-Studio and then feed the data to Autogluon with a few lines of simple code to adjust the data.
So far there are 3 task template available, including image-classification (image), named entity recognition(text) and user-customized template. Other templates are WIP.
A documentation for this feature is attached to this PR.

Description of changes:

  • add from_labelstudio.py to autogluon/multimodal/src/autogluon/multimodal/utils
  • a documentation folder label-studio-export-reader to autogluon/examples/automm

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-2615-51ee391 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2615/51ee391/index.html

Copy link
Contributor

@bryanyzhu bryanyzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, left a few comments.

@sxjscience Please let us know what else is missing from current PR, thank you.

@@ -0,0 +1,173 @@
# Label-Studio Export file reader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use _ in file name, e.g., LabelStudio_export_file_reader.md


params:
- path: str: the path of the exported file
- data_columns: list[str]: the key/column names of the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a specific example of what data_columns and label_columns look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the specific usage of this two params in the docs, and address the doc link in the code for the next commit. Is it acceptable or should I provide examples as well in the source code as a comment?

if len(label_studio_json) == 0:
raise ValueError("ERROR: empty export file")

if "annotations" in label_studio_json[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why label_studio_json[0]? How many elements are there in label_studio_json? Usually we want to avoid hard coded stuff, like index 0, unless there is good reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exporting annotations in label-studio, "JSON" and "JSON-MIN" are two different options with the same file extension ".json" that contains a list of annotation dicts. (their examples can be seen in https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks and https://labelstud.io/guide/export.html#JSON-MIN, basically JSON-MIN is the simplified version of JSON).
This line of code here is to check if the export file is from "JSON" or "JSON-MIN". Currently I just use a simple check on whether one of the elements (an annotation) has a key "annotations" that exists in JSON but not in JSON-MIN. Still I'm finding better ways to distinguish them and any suggestion is welcomed.


else:
split_lst = s.split("/")
if split_lst[2] == "local-files":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, how many elements are there in split_lst and what does split_lst[2] represent?

from autogluon.multimodal.utils import LabelStudioReader

# initialize LabelStudioReader with default localhost host
ls=LabelStudioReader()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need extra whitespaces, e.g., ls = LabelStudioReader. There are also many instances like this below, please add whitespaces accordingly.



"""
Usage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to put the docstring inside “LabelStudioReader” class.

@sxjscience
Copy link
Collaborator

Actually, should we rename utils/from_labelstudio.py to utils/labelstudio.py ?

@github-actions
Copy link

github-actions bot commented Jan 5, 2023

Job PR-2615-f886e8e is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2615/f886e8e/index.html

Copy link
Contributor

@bryanyzhu bryanyzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can fill in more details later. @zhiqiangdon @cheungdaven @FANGAreNotGnu @suzhoum @yongxinw Please help to review this PR. This PR enables AG to train on labeled data using LabelStudio, which is a quite useful feature.

Copy link
Collaborator

@sxjscience sxjscience left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Awesome feature!

@sxjscience sxjscience merged commit 21d3ad7 into autogluon:master Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants