Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract more standard metadata from binary files #78754

Merged
merged 9 commits into from
Nov 23, 2021

Conversation

dadoonet
Copy link
Member

@dadoonet dadoonet commented Oct 6, 2021

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

  • content,
  • title,
  • author,
  • keywords,
  • date,
  • content_type,
  • content_length,
  • language.

Tika has a list of more standard properties which can be extracted:

  • modified,
  • format,
  • identifier,
  • contributor,
  • coverage,
  • modifier,
  • creator_tool,
  • publisher,
  • relation,
  • rights,
  • source,
  • type,
  • description,
  • print_date,
  • metadata_date,
  • latitude,
  • longitude,
  • altitude,
  • rating,
  • comments

This commit exposes those new fields.

Related to #22339.

@dadoonet dadoonet marked this pull request as ready for review October 6, 2021 12:38
@martijnvg martijnvg added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Oct 6, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@martijnvg martijnvg added >enhancement and removed Team:Data Management Meta label for data/management team labels Oct 6, 2021
@dakrone dakrone requested a review from masseyke October 14, 2021 15:25
@masseyke
Copy link
Member

@elasticmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Member

@elasticmachine run elasticsearch-ci/rest-compatibility

@masseyke
Copy link
Member

@elasticmachine run elasticsearch-ci/bwc

@masseyke
Copy link
Member

@dadoonet the code looks good to me, but I think some of the counters need to be updated for the integration tests. I've just rerun the 3 failing builds because the results had aged out.

@dadoonet
Copy link
Member Author

but I think some of the counters need to be updated for the integration tests

Ha! I forgot about the integration tests! :)

Should be ok now.

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to elastic#22339.
@masseyke
Copy link
Member

@elasticmachine run elasticsearch-ci/rest-compatibility

@dadoonet
Copy link
Member Author

@masseyke What do you think should be fixed on my end to pass the rest-compatibility test?

@masseyke
Copy link
Member

@dadoonet I'm not sure. I don't know why the test is failing the way it is because it appears you have fixed it. I am about to re-pull your branch and run the test locally.

@dadoonet
Copy link
Member Author

I have no idea about what the rest-compatibility is doing but may be it's because new fields have been added so when you compare a previous version with the current, it looks like a breaking change?
May be it needs to be "annotated" as breaking somewhere?

@masseyke
Copy link
Member

Yeah you are right that the test sees this as a breaking change since the fields weren't in 7.x. And it doesn't look like we have the version information of the requesting node in the AttachmentProcessor so we could not selectively filter out fields (which is how we handle this in other places). I'm trying to figure out what the best way forward is. It might be as you say to just declare this a breaking change (and verify that it does not actually break anything).

As we added new fields, this test is failing the bwc tests.

We can not have access to `RestApiVersion` in the `AttachmentProcessor` so we can't decide to produce or not the fields depending on the version.

As the change is trivial and not removing any existing field, we could skip this regression test.
@dadoonet
Copy link
Member Author

I updated the tests to skip bwc tests with 7.x versions. Let see if it's ok.

The other option would be to change the code and find a way to propagate the REST API version to the AttachmentProcessor and do not emit new fields when the version is lower than 8.x.

@dadoonet
Copy link
Member Author

@masseyke Do you know what should I do to fix the test?

# Conflicts:
#	plugins/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
@masseyke
Copy link
Member

@dadoonet sorry for the slow reply. I'm going to take a look at this today. But I think that the problem is that it is actually a breaking change for backwards compatibility so it might be tricky to fix.

@dadoonet
Copy link
Member Author

So I could make this configurable? Like having a setting which activates the new feature?

@masseyke
Copy link
Member

Yeah I think it'll have to be something like that. I believe I had looked previously, and we don't have access to the version of the calling server like we do in a lot of REST calls. That's the way we usually get around this kind of thing -- if the version of the caller is detected as 7.x, we return what 7.x expects to see.

@masseyke
Copy link
Member

@elasticmachine run elasticsearch-ci/bwc

@masseyke masseyke added v8.1.0 auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Nov 22, 2021
Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@dadoonet dadoonet merged commit 564ff9d into elastic:master Nov 23, 2021
@dadoonet dadoonet self-assigned this Nov 23, 2021
@dadoonet
Copy link
Member Author

Thanks!

@dadoonet dadoonet deleted the pr/tika-more-props branch November 23, 2021 09:35
@dadoonet
Copy link
Member Author

It has not been backported to 8.0 branch. Is there anything to do to trigger the backport action?

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 24, 2021
* upstream/master: (29 commits)
  Fix typo (elastic#80925)
  Increase docker compose timeouts for CI builds
  TSDB: fix error without feature flag (elastic#80945)
  [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964)
  Explicit write methods for always-missing values (elastic#80958)
  TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933)
  [ML] Removing temporary debug (elastic#80956)
  Remove unused ConnectTransportException#node (elastic#80944)
  Reinterpret dots in field names as object structure (elastic#79922)
  Remove obsolete typed legacy index templates (elastic#80937)
  Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172)
  TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932)
  Cleanup SLM History Item .equals (elastic#80938)
  Rework breaking changes for new structure (elastic#80907)
  [DOCS] Fix elasticsearch-reset-password typo (elastic#80919)
  [ML] No need to use parent task client when internal infer delegates (elastic#80905)
  Fix shadowed vars pt6 (elastic#80899)
  add ignore info (elastic#80924)
  Fix several potential circuit breaker leaks in Aggregators (elastic#79676)
  Extract more standard metadata from binary files (elastic#78754)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 24, 2021
* upstream/master: (319 commits)
  Fix typo (elastic#80925)
  Increase docker compose timeouts for CI builds
  TSDB: fix error without feature flag (elastic#80945)
  [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964)
  Explicit write methods for always-missing values (elastic#80958)
  TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933)
  [ML] Removing temporary debug (elastic#80956)
  Remove unused ConnectTransportException#node (elastic#80944)
  Reinterpret dots in field names as object structure (elastic#79922)
  Remove obsolete typed legacy index templates (elastic#80937)
  Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172)
  TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932)
  Cleanup SLM History Item .equals (elastic#80938)
  Rework breaking changes for new structure (elastic#80907)
  [DOCS] Fix elasticsearch-reset-password typo (elastic#80919)
  [ML] No need to use parent task client when internal infer delegates (elastic#80905)
  Fix shadowed vars pt6 (elastic#80899)
  add ignore info (elastic#80924)
  Fix several potential circuit breaker leaks in Aggregators (elastic#79676)
  Extract more standard metadata from binary files (elastic#78754)
  ...
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Nov 29, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to elastic#22339.

Co-authored-by: Keith Massey <keith.massey@elastic.co>
masseyke added a commit that referenced this pull request Nov 29, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to #22339.

Co-authored-by: Keith Massey <keith.massey@elastic.co>

Co-authored-by: David Pilato <david@pilato.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v8.0.0-rc1 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants