Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service provider: ListIdentifierHandler (?) truncating resumptionToken #170

Open
Ajmma opened this issue Aug 10, 2023 · 7 comments
Open
Labels
question Further information is requested service-provider Related to Service Provider implementation

Comments

@Ajmma
Copy link

Ajmma commented Aug 10, 2023

Hello, i've got weird issue during full range identifiers harvest of http://dlibra.umcs.lublin.pl

io.gdcc.xoai.serviceprovider.exceptions.InvalidOAIResponse: OAI responded with code: badResumptionToken
        at io.gdcc.xoai.serviceprovider.parsers.ListIdentifiersParser.hasNext(ListIdentifiersParser.java:41)
        at io.gdcc.xoai.serviceprovider.handler.ListIdentifierHandler.nextIteration(ListIdentifierHandler.java:64)
        at io.gdcc.xoai.serviceprovider.lazy.ItemIterator.hasNext(ItemIterator.java:31)

I've investigated this exception and figured out that resumptionToken is truncated at some point during the processing.
Each value assigned to resumptionToken variable in ListIdentifierHandler class looks like this:

2261E6AEAC7E55ECC864C955C7231E63ListIdentifiers1691656059787_DL_LAST_ITEM_50_DL_METADATA_mets
2261E6AEAC7E55ECC864C955C7231E63ListIdentifiers1691656059787_DL_LAST_ITEM_100_DL_METADATA_mets
...
2261E6AEAC7E55ECC864C955C7231E63ListIdentifiers1691656059787_DL_LAST_ITEM_1500_DL_METADATA_mets
2261E6AEAC7E55ECC864C955C7231E63ListIdentifiers1691656059787_DL_LAST_ITEM_1550_DL_METADATA_mets
2261E6AEAC7E55ECC864C955C7231E63ListIdentifiers1691656059787_DL_LAST_ITEM_1600_DL_

Everytime resumptionToken is truncated on the same resumptionToken last item = 1600

Resumption token from this source for 1600 looks like this:
<resumptionToken completeListSize="45695" cursor="1550" expirationDate="2023-08-10T11:12:09Z">4ECCFC571D6632484E8D04ECFF3214A3ListIdentifiers1691656753766_DL_LAST_ITEM_1600_DL_METADATA_mets</resumptionToken>

Can you guys check it out? I would be grateful.

@poikilotherm
Copy link
Member

poikilotherm commented Aug 10, 2023

Hi there, thanks for opening the bug report.

To make the process easier: are you using the library on its own or within Dataverse?
Could you provide a simple reproducer so one can see the error and dig in from there?

Thank you!

@poikilotherm poikilotherm added bug Something isn't working service-provider Related to Service Provider implementation labels Aug 10, 2023
@Ajmma
Copy link
Author

Ajmma commented Aug 11, 2023

Sorry for the short description of my problem.

I'm using library that's not within Dataverse.

I'm trying to harvest all of the identifiers from specific library, that offers oai-pmh protocol.
First link that retrieve portion of identifiers looks like this:
http://dlibra.umcs.lublin.pl/dlibra/oai-pmh-repository.xml?verb=ListIdentifiers&until=2023-08-11T23%3A59%3A59Z&metadataPrefix=mets

Response i've got from oai looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="http://dlibra.umcs.lublin.pl/style/common/xsl/oai-style.xsl"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
	<responseDate>2023-08-11T14:09:07Z</responseDate>
	<request metadataPrefix="mets" verb="ListIdentifiers" until="2023-08-11T23:59:59Z">
	http://bc.umcs.pl/oai-pmh-repository.xml</request>
	<ListIdentifiers>
	
	<header>
		<identifier>oai:bc.umcs.pl:630</identifier>
	    <datestamp>2009-04-29T13:12:12Z</datestamp>
		  <setSpec>dLibraDigitalLibrary:CulturalHeritage</setSpec> 	      
		  <setSpec>dLibraDigitalLibrary:CulturalHeritage:Journals</setSpec> 	      
		  <setSpec>dLibraDigitalLibrary</setSpec> 	    
    </header>
	<header>
		<identifier>oai:bc.umcs.pl:606</identifier>
	    <datestamp>2009-04-29T11:41:18Z</datestamp>
		  <setSpec>dLibraDigitalLibrary:CulturalHeritage</setSpec> 	      
		  <setSpec>dLibraDigitalLibrary:CulturalHeritage:Journals</setSpec> 	      
		  <setSpec>dLibraDigitalLibrary</setSpec> 	    
    </header>
		  
	... 47 headers later ...
	
	<header>
		<identifier>oai:bc.umcs.pl:615</identifier>
	    <datestamp>2009-04-29T11:45:29Z</datestamp>
		  <setSpec>dLibraDigitalLibrary:CulturalHeritage</setSpec>
 	      <setSpec>dLibraDigitalLibrary:CulturalHeritage:Journals</setSpec> 	      
		  <setSpec>dLibraDigitalLibrary</setSpec> 	    
	</header>
  <resumptionToken completeListSize="45695" cursor="0" expirationDate="2023-08-11T14:39:08Z">25A85CC8B70F1723A26EF05A6F2C10A6ListIdentifiers1691755748361_DL_LAST_ITEM_50_DL_METADATA_mets</resumptionToken>	
  </ListIdentifiers>
</OAI-PMH>

(You probably also can get similar response with copy-paste url into browser.)

Resumption token differs only in number after LAST_ITEM
F889BBA1B1F73EE9FD88BBF0F39ABDA7ListIdentifiers1691757140380_DL_LAST_ITEM_50_DL_METADATA_mets
F889BBA1B1F73EE9FD88BBF0F39ABDA7ListIdentifiers1691757140380_DL_LAST_ITEM_100_DL_METADATA_mets
etc...
This is just value of the next resumptionToken cursor param (if present resumptionToken has _LAST_ITEM_50, next resumptionToken will have cursor param = 50)

After i get my portion of identifiers and resumptionToken for continuing my harvest, I prepare another url, this time with resumptionToken param, that looks like this: http://dlibra.umcs.lublin.pl/dlibra/oai-pmh-repository.xml?verb=ListIdentifiers&resumptionToken=F889BBA1B1F73EE9FD88BBF0F39ABDA7ListIdentifiers1691757140380_DL_LAST_ITEM_50_DL_METADATA_mets

and getting another portion of identifiers.

Processing goes fine, but when app reaches identifiers list with resumptionToken = _LAST_ITEM_1600, resumptionToken is truncated. Shortened resumptionToken is passed to my own CustomOaiClient (with JdkHttpOaiClient happens the same) as param. Then i'm creating url from params like this:

@Override
public InputStream execute(Parameters parameters) throws OAIRequestException {
        final HttpGet request = new HttpGet(parameters.toUrl(baseUrl)); 

and with that url I'm requesting for more identifiers. This url looks like this: http://dlibra.umcs.lublin.pl/dlibra/oai-pmh-repository.xml?verb=ListIdentifiers&resumptionToken=F889BBA1B1F73EE9FD88BBF0F39ABDA7ListIdentifiers1691757140380_DL_LAST_ITEM_1600_DL_

There are missing METADATA_mets part and because of that i get io.gdcc.xoai.serviceprovider.exceptions.InvalidOAIResponse: OAI responded with code: badResumptionToken exception

I was debugging and checking if maybe i was done unexpected token truncating during preparing for request.
I thought, maybe there are something with url params encoding, but before params got encoded, this specific resumptionToken is already truncated.

This resumptionToken is already truncated here:


after that, this text variable is assigned to resumptionToken variable and this resumptionToken is passed as param to another request. It's hard to tell what's going on here, that could truncate specific resumptionToken. This token is not significantly longer or i dont know, different? That's why I'm asking if maybe you could tell me what can go wrong here, or maybe it's just an issue.

This token is correctly passed directly from oai-pmh response. It's just extracting that token from oai-response into java code does something unexpected.

@Ajmma
Copy link
Author

Ajmma commented Aug 30, 2023

Hi guys,

Hope you're doing well. I wanted to draw your attention back to the GitHub issue I raised, which seems to have slipped off the radar. Your insights and expertise on this matter would be really helpful in making progress.

Looking forward to your involvement in resolving this. Thanks!

@poikilotherm
Copy link
Member

Sorry we were/are pretty busy with preparing Dataverse 6.0, so there were no cycles left to address this.

If you want to dive into this on your own, please feel free to give it a go! PRs much appreciated!

Personally, I'd try to do a recording of the HTTP data exchange and put it into WireMock, so we can test something and also keep the test around for the future.

@pdurbin
Copy link
Member

pdurbin commented Sep 1, 2023

I don't have much to add except that I'm linking to this issue from the intro of a new doc I wrote. 😄

And, yes @Ajmma I'd like to echo that you are very welcome to make a pull request! ❤️

@poikilotherm
Copy link
Member

Let me add a quick comment that I spent a few cycles yesterday, trying to create a reproducer. I wasn't really able to pin down repeatable fail conditions. Will push soon, so someone can play around with it some more.

@poikilotherm poikilotherm added question Further information is requested and removed bug Something isn't working labels Sep 15, 2023
@poikilotherm
Copy link
Member

poikilotherm commented Sep 15, 2023

Here's something to play around with: https://github.com/gdcc/xoai/tree/170-reproducer

I played with different combos of parameters, but was not able to reliably reproduce the problem in https://github.com/gdcc/xoai/blob/170-reproducer/xoai-service-provider/src/test/java/io/gdcc/xoai/serviceprovider/reproducers/Issue170IT.java

@Ajmma if you can provide a combo that works, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested service-provider Related to Service Provider implementation
Projects
None yet
Development

No branches or pull requests

3 participants