Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving CSW harvest #28

Closed
valentinedwv opened this issue May 19, 2017 · 6 comments
Closed

Improving CSW harvest #28

valentinedwv opened this issue May 19, 2017 · 6 comments
Assignees

Comments

@valentinedwv
Copy link
Contributor

USGS sciencebase is a large collection.
Tried twice. Crashed at 32k and 245k records of 6000k.
Need a new techniques with large collections.

  • ways to pass in a custom filter parameter, they have "collections" which can be used to get smaller sets
  • resumable/restartable at a record count

https://my.usgs.gov/confluence/display/sciencebase/Catalog+Services

moving issue from catalog to here:
Esri/geoportal-server-catalog#67

@mhogeweg
Copy link
Member

does this also happen when harvesting into a local folder?

@valentinedwv
Copy link
Contributor Author

valentinedwv commented May 19, 2017

Was running to both a folder and server
With 6million records, was going to rewrite the folder to break it into ~1k blocks (or make an s3 store endpoint)

@valentinedwv
Copy link
Contributor Author

Assumed it's a connection to the csw server.

19-May-2017 12:36:53.488 INFO [HARVESTING] com.esri.geoportal.harvester.support.ProgressLogger.printStatusLog Harvesting of PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true progress: 141500
19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43 Error harvesting of PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
        at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
        at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
        ... 2 more

19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data.
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
        at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
        at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
        ... 2 more

19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportLogger.completed Completed processing task: PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportStatistics.completed Harvesting of PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true completed at Fri May 19 12:38:28 UTC 2017. No. succeded: 283135, no. failed: 2

@valentinedwv
Copy link
Contributor Author

One is server issue. dies at record 166666

https://www.sciencebase.gov/catalog/csw

<csw:GetRecords
xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
maxRecords="1"
startPosition="166666"

outputFormat="application/xml"
outputSchema="http://www.isotc211.org/2005/gmd"
resultType="results" service="CSW" version="2.0.2">
    <csw:Query typeNames="csw:Record">
        <csw:ElementSetName>full</csw:ElementSetName>
        <csw:Constraint version="1.1.0">
            <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc" xmlns="http://www.opengis.net/ogc"
            xmlns:gml="http://www.opengis.net/gml">
                <ogc:PropertyIsLike escape="" singleChar="_" wildCard="%">
                    <ogc:PropertyName>AnyText</ogc:PropertyName>
                    <ogc:Literal>well</ogc:Literal>
                </ogc:PropertyIsLike>
            </ogc:Filter>
        </csw:Constraint>
    </csw:Query>
</csw:GetRecords>

@pandzel-zz
Copy link

Pull request #72 provides ability to define 'AnyText' literal for any CSW input broker.

@zguo
Copy link
Collaborator

zguo commented Apr 5, 2019

search text filter implemented in harvester.

@zguo zguo closed this as completed Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants