Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-2630: Wrong height and width metadata for JPEG images #255

Merged
merged 3 commits into from Dec 2, 2019

Conversation

@dameikle
Copy link
Member

dameikle commented Oct 30, 2018

  • Added extraction of image height/width from ExifSubIFDDirectory for compressed images
  • Include directory name as key qualifier for Exif directories to avoid clashes
- Added extraction of image height/width from ExifSubIFDDirectory for compressed images
- Include directory name as key qualifier for Exif directories to avoid clashes
@dameikle dameikle requested a review from tballison Oct 30, 2018
@dameikle

This comment has been minimized.

Copy link
Member Author

dameikle commented Oct 30, 2018

Hey @tballison - given the key name clashes for the Exif metadata I am proposing to add the directory name as the qualifier, hence the request for a review.

I was tempted to do this for all directories to make it clean but worry about downstream code that rely on the current values in 1.x stream.

I also thought about not doing it, but without doing it for at least Exif, we will continue to give the wrong value here without some logic to have a key hierarchy in CopyUnknownFieldsHandler.

@tballison

This comment has been minimized.

Copy link
Contributor

tballison commented Oct 30, 2018

I don't know enough about exif to be useful on this. If there are only a few standard directories (say, ExifSubIFDDirectory or ExifIFD0Descriptor), could we make those static property prefixes or static properties? Given that I don't know what we'll find in exif, I'm very hesitant about using the literal directory name. If we do go with the literal directory name in some cases, we should prefix that. WDYT?

@dameikle

This comment has been minimized.

Copy link
Member Author

dameikle commented Oct 30, 2018

@tballison Thanks for looking over it - my main worry was changing the behaviour. There is a fixed structure that one should expect with data included in them only if written, so we could make them static property prefixes. I think the directory name is safe in this instance given how the format works and agree the prefix route - its easy enough to make this fixed against those directories. I'll go with that.

@tballison

This comment has been minimized.

Copy link
Contributor

tballison commented Oct 30, 2018

Sorry, @dameikle, I should have addressed the change in behavior...the actual reason you asked for a second pair of eyes. 😆 I agree that changes in behavior are bad. IIUC, though, we'd be fixing what we're currently doing, which is over-writing info, right?

If we did something like this in branch_1x:

    if (directory instanceof ExifDirectoryBase) {
        metadata.set(directory.getName() + ":" + tag.getTagName(), tag.getDescription());
        metadata.set(tag.getTagName(), tag.getDescription());
    } else {
        metadata.set(EXIF_ROOT + ":" + tag.getTagName(), tag.getDescription());
        metadata.set(tag.getTagName(), tag.getDescription());
    }

That would maintain the same current (wrong) over-writing behavior, and introduce new tag names. I have no idea what an appropriate prefix would be for EXIF_ROOT...something static and documented and appropriate.

@dameikle

This comment has been minimized.

Copy link
Member Author

dameikle commented Nov 9, 2018

I was thinking something similar but stopped as it is not always going to Exif metadata that is processed by the Extractor. I am beginning to think it might be the right thing to actually fix the overwriting, unless you completely disagree.

@tballison

This comment has been minimized.

Copy link
Contributor

tballison commented Nov 10, 2018

I'm ok with correcting bad behavior. :D

bq. going to Exif metadata that is processed by the Extractor
What do you mean by this exactly? In that portion of the code, it might not be Exif metadata?

@dameikle

This comment has been minimized.

Copy link
Member Author

dameikle commented Nov 10, 2018

Yes, it could be one of different types of metadata directories - Exif, IPTC, XMP, ICC, etc. The challenge is really that Exif uses the same tag name in different directories, so unless keyed it overwrites.

@saitho

This comment has been minimized.

Copy link

saitho commented May 24, 2019

What's the status on this? :)

@tballison

This comment has been minimized.

Copy link
Contributor

tballison commented Oct 23, 2019

Unless there are objections, let's put this in 1.23?

@dameikle

This comment has been minimized.

Copy link
Member Author

dameikle commented Oct 27, 2019

@tballison - I agree, let's go with this one

@tballison tballison merged commit 90880e1 into apache:branch_1x Dec 2, 2019
tballison added a commit that referenced this pull request Dec 2, 2019
* TIKA-2630:
- Added extraction of image height/width from ExifSubIFDDirectory for compressed images
- Include directory name as key qualifier for Exif directories to avoid clashes

* TIKA-2630: Tidied up code

# Conflicts:
#	tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.