Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing annotations and images in migrated datasets #672

Open
spikelynch opened this issue Apr 13, 2022 · 5 comments
Open

Editing annotations and images in migrated datasets #672

spikelynch opened this issue Apr 13, 2022 · 5 comments
Assignees

Comments

@spikelynch
Copy link
Collaborator

Testing steps

  1. Spun up master (ie same as production weedai)
  2. Created a small dataset - weedcoco.json and four images
  3. Updated weedai to dataset-versions-frontend
  4. Migrated the repository to ocfl using the weedcoco/repo/repository.py tool
  5. Tried to edit the dataset, uploading a new weedcoco.json which referred to one additional image

on initialising edit, got a set of warnings because the images in the original dataset did not have mappings from hashed file names to original filenames in Redis. This is normal - none of the migrated datasets will have redis mappings.

No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0ba53efa94d454025b5d.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0cbb11d6b65d175c16bc.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0c0d4f9eb31a36c9c49c.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/5a60b66cd6771f5c982d.jpg

On uploading the new weedcoco.json, the frontend reports that there are five missing images (not one, as expected)

On uploading the five images, got the following error in creating thumbnails:

Traceback (most recent call last):
File "/code/weedid/tasks.py", line 144, in update_index_and_thumbnails
thumbnailing(Path(thumbnails_dir), Path(repository_dir), upload_id)
File "/usr/local/lib/python3.7/site-packages/weedcoco/index/thumbnailing.py", line 80, in thumbnailing
coco_by_filename[filename],
2022-04-13 01:24:29,301: WARNING/ForkPoolWorker-4] KeyError: '0cbb11d6b65d175c16bc-.jpg'

This means that the thumbnailer can't find the annotations for an image.

Need to think through how the editing code will work in the situation where we have a dataset without original image files (ie all of the existing dataset) or if we just rule out editing them somehow. I think that it should be possible to make it work for them.

@spikelynch spikelynch self-assigned this Apr 13, 2022
@spikelynch
Copy link
Collaborator Author

A v2 of the object was created in the ocfl, but it's got extra versions of the four original images, because of the issue where the same image uploaded twice is not bitwise identical (the exif stripping problem).

It shouldn't have let me upload five images - what I need to work out is how to get it to recognise that if it has four images, and I give it a weedcoco.json with those four images and an extra one, to only ask for one.

In the first test I didn't do this - I was using weedcoco from my laptop, which had filenames like 'image1.jpg' rather than hashes.

@spikelynch
Copy link
Collaborator Author

Testing run 2:

  1. Spun up master (ie same as production weedai)
  2. Created a small dataset - weedcoco.json and four images
  3. Download the dataset as a zipfile from the master weedai frontend
  4. Updated weedai to dataset-versions-frontend
  5. Migrated the repository to ocfl using the weedcoco/repo/repository.py tool
  6. With the dataset downloaded from step 3 above, modify the weedcoco.json to add a new image file with an annotation
  7. Edit the dataset in weedai and upload the modified weedcoco and the new image file

This still complained about not remapping the images, but successfully indexed the new dataset with v2.

The ocfl deduplication issue #665 seems to have come back - in the resulting ocfl, there are nine image files, with the four images in the first dataset appearing twice.

Currently trying to track down where/when the original four files have changes in the editing process - I may be working on a branch which doesn't have the bug fix for that

@spikelynch
Copy link
Collaborator Author

Have now tested it with the ocfl dedupe bugfix in place, and it's not duplicating the original images.

@spikelynch
Copy link
Collaborator Author

Here is a diagram of the editing process where the original dataset was migrated from the pre-ocfl repository, and doesn't have mappings back to the original filenames.

My original diagram showed how it works if the original dataset was post-ocfl:
#660 (comment)

The problems I was having arise if the updated weedcoco.json refers to files using their original filenames. The system has no way to map these to the hashed image files, so it calculates the wrong number of missing images, and can get missing key exceptions when it tries to thumbnail them.

A way to avoid this is to base the updated weedcoco.json on a zipfile which you download from WeedAI, because this version has the hashed image names: here's how that looks

editing_versions_migrated

*when the redis map is built, it's going to be a mix of references to old pre-ocfl hashes and new unhashed image names, for eg:

  • "image+annotationhash1.jpg" -> "oldimagehash1.jpg"
  • "image+annotationhash2.jpg" -> "oldimagehash2.jpg"
  • "image+annotationhash3.jpg" -> "image-3.jpg"

One way to avoid all this complication would be to restrict whether existing datasets can be edited, or in what way we can edit them. IE:

  • old-style datasets can have the metadata and agcontext updated, but not annotations or images
  • old-style datasets can't be edited at all through the frontend
  • we allow updating annotations and images, but provide guidance for the user and catch error conditions

If the user edits a dataset and uploads a weedcoco.json which doesn't have mappings in redis, that means it's an old-style dataset. The upload stepper could then check to see if the weedcoco.json image filenames (which will be old hashes) match the filenames in the dataset. If they do, then the update should work ok.

If they don't, it means they're using a weedcoco.json with the original files. We could give the user a message like "to update this dataset, use this weedcoco.json as a starting point" and provide a link to download the weedcoco.json which has hashed filenames.

@spikelynch spikelynch changed the title Test migration failed Editing annotations and images in migrated datasets May 5, 2022
@spikelynch
Copy link
Collaborator Author

editing_migrated

Diagram of the three scenarios - the one which causes problems is the third, where the user uploads a weedcoco.json where none of the referenced images are in the existing dataset.

The most likely cause of this is the error I hit - uploading an edited version of the weedcoco.json which has filenames before they were hashed (and which are unrecoverable because there's no redis mapping for them).

It would be possible to trigger the same condition by uploading a dataset, then changing all of the images and weedcoco.json refs to different filenames, and uploading that - but I think that's a perverse case and one we can discount in this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant