Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proquest redirects preventing access events from being registered #666

Open
chryslovelace opened this issue Nov 7, 2022 · 2 comments
Open

Comments

@chryslovelace
Copy link

We have an issue where access events for Proquest are not being registered due to their use of redirects. Here are some snippets of sessions that demonstrate this issue:

RoEEmPk8iBS5zWn [29/Nov/2021:22:41:55 -0500] "GET https://www.proquest.com:443/docview/1475116173?pq-origsite=primo&accountid=14709 HTTP/1.1" 302 0
RoEEmPk8iBS5zWn [29/Nov/2021:22:41:56 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
RoEEmPk8iBS5zWn [29/Nov/2021:22:41:56 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1798

rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/docview/1819126361?pq-origsite=primo HTTP/1.1" 302 0
rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1782

KnObbZigVCjacdA [27/Nov/2021:23:50:49 -0500] "GET https://www.proquest.com:443/docview/1295901959?pq-origsite=primo&accountid=14709 HTTP/1.1" 302 0
KnObbZigVCjacdA [27/Nov/2021:23:50:50 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
KnObbZigVCjacdA [27/Nov/2021:23:50:50 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1798

The url in each of the first lines here includes the document id, and proquest/parser.js seems like it should be picking up this url format, but they are presumably being ignored due to the 302 redirect and/or empty content. The actual content is delivered in the third request, but the id is no longer present in the url to be extracted, so the access event can't be properly registered.

In some previous correspondence our organization had asked whether multiple lines could be combined to make a determination of an access event and the response was that it was not possible. Is this still the case given this issue? If not, is there a way that the initial request here can count as the access event, so those identifiers can be extracted?

@tporquet
Copy link
Contributor

tporquet commented Dec 8, 2022

Hello and sorry for the delay of our reaction...
Those lines are indeed ignored by ezpaarse by default.
You could setup ezpaarse globally not to ignore 302 status lines but it is a global parameter, see: https://ezpaarse-project.github.io/ezpaarse/configuration/parametres.html#ezpaarse-filter-status
We are thinking about allowing that feature on a parser basis (instead of a global parameter) to keep the processing load as low as possible (in a typical log file, we filter out 90-95% of the log lines)

@tporquet
Copy link
Contributor

tporquet commented Dec 8, 2022

As for you second question of combining multiple lines to make a determination of an access event, which is obviously linked to the 302 situation, we are also thinking on either:

  • making specific parsers be able of processing specific 302 lines containing identifiers and generating access events from that element only ("simple" solution)
  • or modifying our default mechanism (where each log lines is considered for itself for generating an access event) and be able of caching and linking multiple log lines to generate one access event (this solution is obviously heavier and will be considered only if the simple solution is not acceptable).

NB: The only usecase where we keep a memory of a previous access event is for the counter deduplication algorithm where we filter access events if the same resources is accessed by the same user-session or user-id in a short timespan (10 to 30 seconds, depending on the resource format).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants