ResumptionToken

completeListSize

background: This section pertains to the optional attribute completeListSize in the resumptionToken of an oai-pmh ListRecords request.

I've also inquired of this issue on stackoverflow

The way we page through chunks of records when a client is harvesting from the MST is to sort by recordId and then add the previous highest recordId to the subsequent queries.

select straight_join 
  r.record_id,
  r.oai_datestamp,
  r.format_id,
  r.status,
  x.xml,
  max(u.date_updated) as date_updated
from 
  ${REPO_NAME}.record_updates u force index (idx_${REPO_NAME}_record_updates_record_id),
  ${REPO_NAME}.records r IGNORE index (idx_${REPO_NAME}_records_format_id),
  ${REPO_NAME}.records_xml x,
  ${REPO_NAME}.record_sets rs ignore index (idx_${REPO_NAME}_record_sets_set_id)
where r.record_id = x.record_id 
  and (r.record_id > ${RECORD_ID} or ${RECORD_ID} is null)
  and r.record_id = u.record_id
  and (u.date_updated > ${START_DATE} or ${START_DATE} is null)
  and u.date_updated <= ${END_DATE}
  and r.format_id = ${FORMAT_ID}
  and r.record_id = rs.record_id
  and rs.set_id = ${SET_ID}
group by u.record_id
order by u.record_id 
limit 5

this query is very fast (even as you page through millions of records):

$ export REPO_NAME=marctoxctransformation; export RECORD_ID=null; export START_DATE=null; export END_DATE=\'2020-01-01\'; export FORMAT_ID=7; export SET_ID=10;
$ echo "the_above_query" | mysql -u root --password=password

this command, however, is not fast

devuser@xcmst > date; echo "select straight_join 
   count(distinct u.record_id)
from 
  ${REPO_NAME}.record_updates u force index (idx_${REPO_NAME}_record_updates_record_id),
  ${REPO_NAME}.records r IGNORE index (idx_${REPO_NAME}_records_format_id),
  ${REPO_NAME}.records_xml x,
  ${REPO_NAME}.record_sets rs ignore index (idx_${REPO_NAME}_record_sets_set_id)
where r.record_id = x.record_id 
  and (r.record_id > ${RECORD_ID} or ${RECORD_ID} is null)
  and r.record_id = u.record_id
  and (u.date_updated > ${START_DATE} or ${START_DATE} is null)
  and u.date_updated <= ${END_DATE}
  and r.format_id = ${FORMAT_ID}
  and r.record_id = rs.record_id
  and rs.set_id = ${SET_ID}
" | mysql -u root --password=pass; date
Fri Dec 17 13:46:21 EST 2010
count(distinct u.record_id)
11772117
Fri Dec 17 13:50:56 EST 2010

The purpose of this query is to fill in the completeListSize attribute below

$ curl -s 'http://128.151.244.135:8080/OAIToolkit/oai-request.do?verb=ListRecords&metadataPrefix=marcxml' | xmllint --format - | tail -n 3
    <resumptionToken cursor="0" completeListSize="5828154">137|5000</resumptionToken>
  </ListRecords>
</OAI-PMH>

currently, the MST-0.3.0-RC2 excludes this optional attribute:

$ curl -s "http://128.151.244.146:8080/MetadataServicesToolkit/st/marctoxctransformation/oaiRepository?verb=ListRecords&metadataPrefix=xc"  | xmllint --format - | tail -n 3
    <resumptionToken>10|1000</resumptionToken>
  </ListRecords>
</OAI-PMH>

There are a few different options solutions to deal w/ this:

continue to exclude it * this is just sad and no one wants this
run a background thread which will report it when it finally gets it * this would be find for automated processes. It's still somewhat sad, though.
the initial request will take a loooong time * IMO, this is not a real solution
speed up the query (taking a guess if necessary) * I started down this route (which is the rest of this wiki page), but I think it's not a full solution. It needs to be combined with #2 above and possibly the record_counts tables (to optimize requests for everything)

I am currently exploring option #4 and this is the algorithm I'm working on: Keep track of how many records you've already gotten during this harvest (in the resumptionToken) - will be referred to below as numAlreadyHarvested.

if all records match the query
then you can just do a "select count(1) from records;"*** how to determine this? if at least one record doesn't match the input format
then false This demonstrates a strategy that lets you ask the question w/out worrying about the query taking a long time. This strategy is assumed to be used for all similar questions below, but I won't repeat for each step (although it's assumed to be implemented):
- preliminary query to make sure it isn't going to be super slow.
  if rows is > 100
  then false``` explain select count(*) from ${REPO_NAME}.records where format_id <> ${FORMAT_ID}

        ***if the following query > 0 <br />then false**```
select count(*)
from ${REPO_NAME}.records
where format_id <> ${FORMAT_ID}

  ***if at least one record doesn't have the input set as its one and only set <br />then false*****if there are any sets that don't match the input set <br />then false**```

select count(*) from ${REPO_NAME}.record_sets where set_id <> ${SET_ID}

        ***if the # of record\_sets doesn't match the # of records <br />then false**```
select 1 from dual where (select count(*) from records)=(select count(*) from record_sets)

      ***note: this means you shouldn't really use more than one set*** check record\_updates table outside of date
    ***if any records fall outside the date\_range <br /> then false**```

select count(*) from ${REPO_NAME}.record_updates where (${START_DATE} is not null and date_updated < ${START_DATE}) or (${END_DATE} is not null and date_updated > ${END_DATE})

        ***else (all of the above were false), then truth**1.**if no records match the query <br />then return 0*** how to determine?
      * if any one of the following things are true (this is the opposite of the above because above ALL of those queries had to be false.  In this scenario, any one of these queries can be true.)
      ***if no record matches the input format*** This is demonstrates the above strategy again, but checking for the opposite:
        * preliminary query to make sure it isn't going to be super slow.**<br />**if rows is < 100**<br />then continue w/ next query**```
explain select count(*)
from ${REPO_NAME}.records
where format_id = ${FORMAT_ID}

    ***if the following query is 0 <br />then truth**```

select count(*) from ${REPO_NAME}.records where format_id = ${FORMAT_ID}

      ***if no records have the input set as its set <br />then truth**```
select count(*)
from ${REPO_NAME}.record_sets
where set_id = ${SET_ID}

  ***if no records fall w/in the date\_range <br /> then truth**```

select count(*) from ${REPO_NAME}.record_updates where (${START_DATE} is null or date_updated > ${START_DATE}) and (${END_DATE} is null or date_updated < ${END_DATE})

  1. else if the actual query will return less than X records (**using startingId in the query**)
    * how to determine this?
      * this query is less than configurable amount (default: 1 million)

explain select straight_join count(distinct u.record_id) from ${REPO_NAME}.record_updates u force index (idx_${REPO_NAME}record_updates_record_id), ${REPO_NAME}.records r IGNORE index (idx${REPO_NAME}records_format_id), ${REPO_NAME}.records_xml x, ${REPO_NAME}.record_sets rs ignore index (idx${REPO_NAME}_record_sets_set_id) where r.record_id = x.record_id and (r.record_id > ${RECORD_ID} or ${RECORD_ID} is null) and r.record_id = u.record_id and (u.date_updated > ${START_DATE} or ${START_DATE} is null) and u.date_updated <= ${END_DATE} and r.format_id = ${FORMAT_ID} and r.record_id = rs.record_id and rs.set_id = ${SET_ID}

  1. else take a guess
    * determine % of records that have the input set

select explain select count(1) from ${REPO_NAME}.record_sets where set_id=${SET_ID}

    * determine % of records that have the input format

explain select count(1) from ${REPO_NAME}.records where format_id=${FORMAT_ID}

    * determine % of records that fall w/in the date range

explain select count(*) from ${REPO_NAME}.record_updates where (${START_DATE} is null or date_updated > ${START_DATE}) and (${END_DATE} is null or date_updated < ${END_DATE})

    * determine avg number of updates per record**following query is untested**```
select ((select count(*) from record_updates))/((select count(*) from records))

* guess = (% records with the format) X (% of records with the set) X (% of records in date range) X (avg # of updates per record)
  * If the guess is < than 1 million + numAlreadyHarvested, then use 1 million + numAlreadyHarvested**

tests

params for all tests

all tests return 2 records per request.
all tests use 3 as the param by which to guess

data set #	format 1	format 2	set 1	set 2	dr 1	dr 2	updates per rec
1	10	0	10	0	10	0	1
2	10	0	0	0	10	0	1
3	10	0	5	5	10	0	1
4	10	0	0	0	5	5	1
5	5	5	0	0	10	0	1
6	5	5	5	5	5	5	3

req #	fmt	set	date range
1	fmt1	none	none
2	fmt1	none	dr1
3	fmt1	none	dr2
5	fmt1	set1	none
6	fmt1	set2	none
7	fmt2	none	none
8	fmt1	set1	dr1

expected results of tests

test #	data set #	req #	completeListSize
1	1	1	10-
2	1	2	10
3	1	3	0
4	1	5	10-
5	1	6	0
6	1	7	0
7	2	1	10-
8	2	5	0
9	3	1	10-
10	3	5	5?,5-
11	4	1	10
12	4	2	5?,5-
13	5	1	5?,5-
14	6	2	2.5?,5-
15	6	8	1.25?,5-

Downloads
Installing the Toolkit
- Hardware Requirements
- Installing 3rd Party Tools
- Installing the Metadata Services Toolkit
  - In Windows
  - In Unix
- Configuring
  - Configuring the MST
  - Configuring Tomcat
  - Configuring MySQL
    - MySQL Permissions
    - MySQL Configurations
  - Configuring server
- Starting the MST
  - In Windows
  - In Unix
- Uninstalling and Reinstalling the MST
- Upgrading the MST
- Useful Info
Using the Toolkit
Services
- What is a service?
- What are Configuration 1 and Configuration 2?
- XC MARCXML Normalization
- MARCXML to XC Transformation
- DC to XC Transformation
  - Mappings
  - Example Input and Output Records
- MARC Aggregation
- Multiple Instances of the Same Service
  - How to install multiple instances of the same service
- Harvesting from an MST Service
  - How to harvest from an MST Service
How To Implement a Service
- Quick and Dirty Tutorial
- Details on the process method
- Testing your service
- AdvancedFeatures
- Contribute to a core service
About the XC Schema
MST Frequently Asked Questions
Performance Results
- RecordBreakdown
- MySQLCustomizations
Release Notes
Next Coding Period Summary
Glossary
Developer ScratchPad
- ServerChart
- Transformation 1.0
  - TransformationDocumentationNotes
  - new
    - TransformationDocumentation
  - old
    - AdditionalWorksAndExpressions
    - Transformation Service Documentation
    - TransformationServiceSteps
    - XcRoleTranslationTable
- AggregationServices
  - MarcAggregation
    - MySQL Tuning for MAS
    - Scratch Pad
  - TransformationTwoPointOh
  - old
    - FirstIteration
    - PriorDesign
- PackagingMST
- 1.0 Decisions
  - RepositoryUpdatesDeletes
  - RecordCountProblems
  - UIChanges
  - ServiceUpdates
  - LogsUI
- ReleaseWork
- QuickInstallNotes
- MST Implementation Details
  - OaiIdIndexAlgorithm
  - CacheDetails
  - MessageHandling
  - ServiceTests
  - ProcessingStepsExplained
  - ResumptionToken->completeListSize
  - UpdateDelete
  - OaiPmhImpl
- record counts
  - RecordCountsOnePtTwoPtOne
  - in production
  - how to log and display
  - RecordCountsOnePtZero
  - RecordCountTestRestarted
  - UrRecordCounts
  - RecordCountTesting
  - TransformationWackiness
- OaiImplementation
- Testing
  - randys-30
  - RegressionTests
- QuickRef
- UnicodeNormalization
- LoggingHelp
- CodeFormatPolicy
- SvnBranchingStrategy
- MultipleEclipseWorkspaces
- DeleteReaddServiceForRetest
- FileHarvests
- CharsetEncodingWithEric
- DrupalSolrOptimization
  - WorkPlan
  - MetricsForAssessment
  - IdeasForImprovement
  - RandomNotes
Wiki en español
- Servicios
  - Qué es un Servicio de Metadatos?
  - Servicio de Normalización XC MARCXML
  - Servicio de Transformación MARCXML a Esquema XC
    - Introducción al Servicio de Transformación
    - Modificaciones en el Servicio de Transformación
  - Servicio de Agregación MARC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResumptionToken

completeListSize

tests

Clone this wiki locally