Editing annotations and images in migrated datasets #672

spikelynch · 2022-04-13T02:02:14Z

Testing steps

Spun up master (ie same as production weedai)
Created a small dataset - weedcoco.json and four images
Updated weedai to dataset-versions-frontend
Migrated the repository to ocfl using the weedcoco/repo/repository.py tool
Tried to edit the dataset, uploading a new weedcoco.json which referred to one additional image

on initialising edit, got a set of warnings because the images in the original dataset did not have mappings from hashed file names to original filenames in Redis. This is normal - none of the migrated datasets will have redis mappings.

No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0ba53efa94d454025b5d.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0cbb11d6b65d175c16bc.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/0c0d4f9eb31a36c9c49c.jpg
No match found for 216c6388-92c3-4fab-9ff2-ca40503dea34/5a60b66cd6771f5c982d.jpg

On uploading the new weedcoco.json, the frontend reports that there are five missing images (not one, as expected)

On uploading the five images, got the following error in creating thumbnails:

Traceback (most recent call last):
File "/code/weedid/tasks.py", line 144, in update_index_and_thumbnails
thumbnailing(Path(thumbnails_dir), Path(repository_dir), upload_id)
File "/usr/local/lib/python3.7/site-packages/weedcoco/index/thumbnailing.py", line 80, in thumbnailing
coco_by_filename[filename],
2022-04-13 01:24:29,301: WARNING/ForkPoolWorker-4] KeyError: '0cbb11d6b65d175c16bc-.jpg'

This means that the thumbnailer can't find the annotations for an image.

Need to think through how the editing code will work in the situation where we have a dataset without original image files (ie all of the existing dataset) or if we just rule out editing them somehow. I think that it should be possible to make it work for them.

The text was updated successfully, but these errors were encountered:

spikelynch · 2022-04-13T04:13:21Z

A v2 of the object was created in the ocfl, but it's got extra versions of the four original images, because of the issue where the same image uploaded twice is not bitwise identical (the exif stripping problem).

It shouldn't have let me upload five images - what I need to work out is how to get it to recognise that if it has four images, and I give it a weedcoco.json with those four images and an extra one, to only ask for one.

In the first test I didn't do this - I was using weedcoco from my laptop, which had filenames like 'image1.jpg' rather than hashes.

spikelynch · 2022-04-13T05:47:24Z

Testing run 2:

Spun up master (ie same as production weedai)
Created a small dataset - weedcoco.json and four images
Download the dataset as a zipfile from the master weedai frontend
Updated weedai to dataset-versions-frontend
Migrated the repository to ocfl using the weedcoco/repo/repository.py tool
With the dataset downloaded from step 3 above, modify the weedcoco.json to add a new image file with an annotation
Edit the dataset in weedai and upload the modified weedcoco and the new image file

This still complained about not remapping the images, but successfully indexed the new dataset with v2.

The ocfl deduplication issue #665 seems to have come back - in the resulting ocfl, there are nine image files, with the four images in the first dataset appearing twice.

Currently trying to track down where/when the original four files have changes in the editing process - I may be working on a branch which doesn't have the bug fix for that

spikelynch · 2022-04-13T07:05:01Z

Have now tested it with the ocfl dedupe bugfix in place, and it's not duplicating the original images.

spikelynch · 2022-04-14T02:05:55Z

Here is a diagram of the editing process where the original dataset was migrated from the pre-ocfl repository, and doesn't have mappings back to the original filenames.

My original diagram showed how it works if the original dataset was post-ocfl:
#660 (comment)

The problems I was having arise if the updated weedcoco.json refers to files using their original filenames. The system has no way to map these to the hashed image files, so it calculates the wrong number of missing images, and can get missing key exceptions when it tries to thumbnail them.

A way to avoid this is to base the updated weedcoco.json on a zipfile which you download from WeedAI, because this version has the hashed image names: here's how that looks

*when the redis map is built, it's going to be a mix of references to old pre-ocfl hashes and new unhashed image names, for eg:

"image+annotationhash1.jpg" -> "oldimagehash1.jpg"
"image+annotationhash2.jpg" -> "oldimagehash2.jpg"
"image+annotationhash3.jpg" -> "image-3.jpg"

One way to avoid all this complication would be to restrict whether existing datasets can be edited, or in what way we can edit them. IE:

old-style datasets can have the metadata and agcontext updated, but not annotations or images
old-style datasets can't be edited at all through the frontend
we allow updating annotations and images, but provide guidance for the user and catch error conditions

If the user edits a dataset and uploads a weedcoco.json which doesn't have mappings in redis, that means it's an old-style dataset. The upload stepper could then check to see if the weedcoco.json image filenames (which will be old hashes) match the filenames in the dataset. If they do, then the update should work ok.

If they don't, it means they're using a weedcoco.json with the original files. We could give the user a message like "to update this dataset, use this weedcoco.json as a starting point" and provide a link to download the weedcoco.json which has hashed filenames.

spikelynch · 2022-05-05T06:45:49Z

Diagram of the three scenarios - the one which causes problems is the third, where the user uploads a weedcoco.json where none of the referenced images are in the existing dataset.

The most likely cause of this is the error I hit - uploading an edited version of the weedcoco.json which has filenames before they were hashed (and which are unrecoverable because there's no redis mapping for them).

It would be possible to trigger the same condition by uploading a dataset, then changing all of the images and weedcoco.json refs to different filenames, and uploading that - but I think that's a perverse case and one we can discount in this release.

spikelynch self-assigned this Apr 13, 2022

spikelynch changed the title ~~Test migration failed~~ Editing annotations and images in migrated datasets May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editing annotations and images in migrated datasets #672

Editing annotations and images in migrated datasets #672

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 14, 2022

spikelynch commented May 5, 2022

Editing annotations and images in migrated datasets #672

Editing annotations and images in migrated datasets #672

Comments

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 13, 2022

spikelynch commented Apr 14, 2022

spikelynch commented May 5, 2022