-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix protocol metadata #789
Conversation
…ccidentally overwritten; fixes #776
…ccidentally overwritten; fixes #776 - prefix "etag" metadata key in HTTP protocol implementations
Looks good. What about the ETag header? Should also get prefixed by default. See 582ead1 ? |
One remark (haven't verified whether this is a problem or not): what happens if there is already some metadata stored in the status index from a prior fetch (eg. "protocol.etag" or just "etag") and now a fetch is made without the "ETag" header in the response? Is the prior value left unchanged? Should it be cleared/removed? - The prefix would allow to safely clear old protocol metadata. |
Handling of cookies also requires changes regarding the prefixed metadata. Here cleaning up the metadata would break handling of cookies: the set-cookie value must be preserved unless a new one is sent by the server. I'll update branch 776 regarding ETag and cookies. |
Also, the WARC writer requires changes as it reads the HTTP headers stored in response metadata. |
…ccidentally overwritten; fixes #776 - prefix "set-cookie" metadata key when read in HTTP protocol implementations - add note in default configuration about protocol metadata prefixes and metadata.persist and metadata.transfer - implement method in Metadata to insert/update of metadata with prefix
…ccidentally overwritten; fixes #776 - WARC module: access protocol metadata using the configured prefix
…ccidentally overwritten; fixes #776 - read from metadata using the protocol metadata prefix: * HTTP Content-Type * info whether content payload has been trimmed during fetch
Also the parsers required changes to read HTTP Content-Type and status whether content is trimmed. |
thanks @sebastian-nagel , will have a look tomorrow |
instead of changing all the key names (which makes the code a bit more difficult to read) - what about adding getFirstValue(key, prefix) to metadata? |
as a general observation, this is quite a few changes for an issue which is largely hypothetical, but let's see if we can get the right balance |
To retrieve the cookies also getValues(key, prefix) is required. I totally agree about the amount of changes. I also didn't expect that many. However, it isn't hypothetical regarding the last-modified date format which is still an open issue (#777) - all because the field is (also) filled from an external server and there are some limitations on the date format (see #492). |
@sebastian-nagel have added the 2 methods to Metadata. |
+1 looks good |
@sebastian-nagel do we really need _public void putAll(Metadata m, String prefix, boolean cleanup)? we're only using it iin 2 places and with false. I'll add a default value for protocol.md.prefix in the default config of 'protocol.'; anyone upgrading on an existing crawl will have to override it to an empty string - I'll make this clear in the release notes. Leaving it to an empty string by default means no one will set it even for new crawls. What do you think? |
Not necessarily. The the discussion above why I've added the cleanup option and why it cannot be used. The other motivation was to avoid unnecessary String concatenations if the prefix is empty. I'm in course to run a test crawl with the changes applied, I'll let you know about the results later today. |
cleanup -> found your previous comment. Let's remove that method then. test-> thanks, I'll wait to hear from you before I merge |
Hi @jnioche, the test crawl went well. Passing of protocol metadata to parsers, WARC writers seems to work. Unfortunately, I forgot to set http.use.cookies = true in the config - but cookies are saved in ES status index in metadata ("protocol%2Eset-cookie"). |
Merged. Thanks @sebastian-nagel |
fixes #776
@sebastian-nagel is that what you had in mind?