Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix protocol metadata #789

Merged
merged 8 commits into from
Mar 3, 2020
Merged

Prefix protocol metadata #789

merged 8 commits into from
Mar 3, 2020

Conversation

jnioche
Copy link
Contributor

@jnioche jnioche commented Feb 20, 2020

fixes #776
@sebastian-nagel is that what you had in mind?

…ccidentally overwritten; fixes #776

- prefix "etag" metadata key in HTTP protocol implementations
@sebastian-nagel
Copy link
Contributor

Looks good. What about the ETag header? Should also get prefixed by default. See 582ead1 ?

@sebastian-nagel
Copy link
Contributor

One remark (haven't verified whether this is a problem or not): what happens if there is already some metadata stored in the status index from a prior fetch (eg. "protocol.etag" or just "etag") and now a fetch is made without the "ETag" header in the response? Is the prior value left unchanged? Should it be cleared/removed? - The prefix would allow to safely clear old protocol metadata.

@sebastian-nagel
Copy link
Contributor

Handling of cookies also requires changes regarding the prefixed metadata. Here cleaning up the metadata would break handling of cookies: the set-cookie value must be preserved unless a new one is sent by the server. I'll update branch 776 regarding ETag and cookies.

@sebastian-nagel
Copy link
Contributor

Also, the WARC writer requires changes as it reads the HTTP headers stored in response metadata.

…ccidentally overwritten; fixes #776

- prefix "set-cookie" metadata key when read in HTTP protocol implementations
- add note in default configuration about protocol metadata prefixes and
  metadata.persist and metadata.transfer
- implement method in Metadata to insert/update of metadata with prefix
…ccidentally overwritten; fixes #776

- WARC module: access protocol metadata using the configured prefix
…ccidentally overwritten; fixes #776

- read from metadata using the protocol metadata prefix:
  * HTTP Content-Type
  * info whether content payload has been trimmed during fetch
@sebastian-nagel
Copy link
Contributor

Also the parsers required changes to read HTTP Content-Type and status whether content is trimmed.

@jnioche
Copy link
Contributor Author

jnioche commented Feb 26, 2020

thanks @sebastian-nagel , will have a look tomorrow

@jnioche
Copy link
Contributor Author

jnioche commented Feb 27, 2020

instead of changing all the key names (which makes the code a bit more difficult to read) - what about adding getFirstValue(key, prefix) to metadata?

@jnioche
Copy link
Contributor Author

jnioche commented Feb 27, 2020

as a general observation, this is quite a few changes for an issue which is largely hypothetical, but let's see if we can get the right balance

@sebastian-nagel
Copy link
Contributor

getFirstValue(key, prefix)

To retrieve the cookies also getValues(key, prefix) is required.

I totally agree about the amount of changes. I also didn't expect that many. However, it isn't hypothetical regarding the last-modified date format which is still an open issue (#777) - all because the field is (also) filled from an external server and there are some limitations on the date format (see #492).

@jnioche
Copy link
Contributor Author

jnioche commented Feb 27, 2020

@sebastian-nagel have added the 2 methods to Metadata.
Need to get my head around putAll(Metadata m, String prefix, boolean cleanup)

@sebastian-nagel
Copy link
Contributor

use getValue methods with prefix

+1 looks good

@jnioche
Copy link
Contributor Author

jnioche commented Mar 2, 2020

@sebastian-nagel do we really need _public void putAll(Metadata m, String prefix, boolean cleanup)?

we're only using it iin 2 places and with false.

I'll add a default value for protocol.md.prefix in the default config of 'protocol.'; anyone upgrading on an existing crawl will have to override it to an empty string - I'll make this clear in the release notes. Leaving it to an empty string by default means no one will set it even for new crawls. What do you think?

@sebastian-nagel
Copy link
Contributor

Not necessarily. The the discussion above why I've added the cleanup option and why it cannot be used. The other motivation was to avoid unnecessary String concatenations if the prefix is empty.

I'm in course to run a test crawl with the changes applied, I'll let you know about the results later today.

@jnioche
Copy link
Contributor Author

jnioche commented Mar 2, 2020

cleanup -> found your previous comment. Let's remove that method then.

test-> thanks, I'll wait to hear from you before I merge

@sebastian-nagel
Copy link
Contributor

Hi @jnioche, the test crawl went well. Passing of protocol metadata to parsers, WARC writers seems to work. Unfortunately, I forgot to set http.use.cookies = true in the config - but cookies are saved in ES status index in metadata ("protocol%2Eset-cookie").

@jnioche jnioche merged commit b19bf71 into master Mar 3, 2020
@jnioche jnioche deleted the 776 branch March 3, 2020 14:18
@jnioche
Copy link
Contributor Author

jnioche commented Mar 3, 2020

Merged. Thanks @sebastian-nagel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten
2 participants