Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: brill_iconclass #30

Closed
1 task done
davanstrien opened this issue Jul 7, 2022 · 12 comments
Closed
1 task done

Add dataset: brill_iconclass #30

davanstrien opened this issue Jul 7, 2022 · 12 comments
Assignees
Labels
dataset Dataset to be added

Comments

@davanstrien
Copy link
Collaborator

A URL for this dataset

https://labs.brill.com/ictestset/

Dataset description

A test dataset and challenge to apply machine learning to collections described with the Iconclass classification system.

Iconclass is a metadata standard used by some LAM institutions. This dataset is of particular interest for the following reasons:

  • the dataset includes images relevant to the LAM domain (art history)
  • some images contain multiple iconclass labels whilst others contain a single one
  • the iconclass schema itself poses an interesting machine learning challenge since the notation of the schema is broken down into parts. This means that whilst the dataset can be used as a image classification dataset is likely to benefit from more bespoke sequence prediction approaches.

Dataset modality

Image

Dataset licence

Creative Commons Public Domain Dedication and Certification

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

posthumus@brill.com

@davanstrien davanstrien added the candidate-dataset Proposed dataset to be added label Jul 7, 2022
@davanstrien davanstrien changed the title Add dataset: [brill_iconclass] Add dataset: brill_iconclass Jul 7, 2022
@davanstrien davanstrien added the dataset Dataset to be added label Jul 7, 2022
@cakiki
Copy link
Member

cakiki commented Jul 8, 2022

Looks great! Removing the candidate tag 😃

@cakiki cakiki removed the candidate-dataset Proposed dataset to be added label Jul 8, 2022
@davanstrien
Copy link
Collaborator Author

#self-assign

@davanstrien
Copy link
Collaborator Author

#ready-for-review

@github-actions github-actions bot added the ready for review Issue ready to be reviewed by maintainers label Jul 12, 2022
@davanstrien
Copy link
Collaborator Author

I have written a script for loading this one: https://huggingface.co/datasets/biglam/cultural_heritage_metadata_accuracy

@albertvillanova I didn't make this one streaming. I did have a version that supported streaming but it was quite a bit slower to load. It's possible I missed something obvious though.

I used the following to generate examples in the streaming version:

    def _generate_examples(self, download_dir):
        with ZipFile(download_dir) as myzip:
            with myzip.open("data.json") as json_file:
                data = json.load(json_file)
                for row, item in enumerate(data.items()):
                    filepath, labels = item
                    image = Image.open(myzip.open(filepath))
                    yield row, {"image": image, "label": labels}

My own feeling is that streaming is less important for this one but if I've missed an obvious way of supporting streaming happy to hear it!

@epoz
Copy link

epoz commented Jul 14, 2022

BTW, I am not affiliated with Brill any more, so my contact address should be updated. You can use info@iconclass.org for testset related matters.

@epoz
Copy link

epoz commented Jul 14, 2022

Is it an idea to add the core data of the Iconclass system as a dataset?

@davanstrien
Copy link
Collaborator Author

BTW, I am not affiliated with Brill any more, so my contact address should be updated. You can use info@iconclass.org for testset related matters.

I will update that 🙂

@davanstrien
Copy link
Collaborator Author

Is it an idea to add the core data of the Iconclass system as a dataset?

That would be great. I had originally planned to also add a configuration of this dataset that had the 'translation' of the iconclass labels i.e. turing the iconclass code into the associated description. I know there used to be a Python library that allowed for these queries but I think it's no longer maintained? Adding the core data as a dataset would both be nice as its own dataset but could also potentially be used as a way of doing this 'translation'.

@epoz
Copy link

epoz commented Jul 14, 2022

Ouch, yes. that Python library was also made by me, but has terrible (read: non-existent) documentation.
Would love to update it, but there have been other things clamouring for attention.

If there is interest in using it, will galvanise me to give it some spit-and-polish.

@davanstrien
Copy link
Collaborator Author

Ouch, yes. that Python library was also made by me, but has terrible (read: non-existent) documentation. Would love to update it, but there have been other things clamouring for attention.

If there is interest in using it, will galvanise me to give it some spit-and-polish.

That would be great, happy to offer some help with that if useful. Its possible some of the functionality could be replicated by having the underlying data available on the hub but some areas might be better served by a specific library.

@epoz
Copy link

epoz commented Jul 15, 2022

I have updated the location of this testset, it is now on:
https://iconclass.org/testset/

@davanstrien
Copy link
Collaborator Author

I have updated the location of this testset, it is now on: iconclass.org/testset

Thanks for letting me know — just updated the URLs.

@davanstrien davanstrien removed the ready for review Issue ready to be reviewed by maintainers label Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Dataset to be added
Development

No branches or pull requests

3 participants