New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup solr: should save all of /data/solr, not just the index #4
Conversation
The `catalog_search.ipynb` (https://pavics.ouranos.ca/jupyter/user/public/lab/tree/tutorial-notebooks/catalog_search.ipynb) notebook was failing with this error: owslib.wps.WPSException : {'code': 'NoApplicableCode', 'locator': 'None', 'text': 'Process error: method=wps_pavicsearch.py._handler, line=254, msg=Traceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicsearch.py", line 251, in _handler\n output_format=output_format)\n File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 973, in pavicsearch\n r.raise_for_status()\n File "/usr/lib/python2.7/dist-packages/requests/models.py", line 840, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nHTTPError: 400 Client Error: Bad Request for url: http://pavics.ouranos.ca:8983/solr/birdhouse/select?start=0&rows=10&q=*&fq=variable:%22tasmin%22&fq=project:%22CMIP5%22&fq=experiment:%22rcp85%22&fq=frequency:%22day%22&fl=*,score&fq=type:File&sort=id+asc&wt=json&indent=true\n'} Interestingly the canarie monitoring of the Catalog service was working fine. It turns out the file `/data/solr/birdhouse/conf/managed-schema` was important. Diff of that `managed-schema` file against a working one from CRIM: ```diff $ diff /data/solr/solr/birdhouse/conf/managed-schema /tmp/good-file 1c1 < <?xml version="1.0" encoding="UTF-8"?> --- > <?xml version="1.0" encoding="UTF-8"?> 48a49,51 > <field name="dataset_id" type="string" stored="true"/> > <field name="datetime_max" type="date" stored="true"/> > <field name="datetime_min" type="date" stored="true"/> 50a54 > <field name="fileserver_url" type="string" stored="true"/> 55a60 > <field name="latest" type="boolean" stored="true"/> 58a64 > <field name="replica" type="boolean" stored="true"/> 63a70 > <field name="type" type="string" stored="true"/> ``` The good file has a few more fields ! Replaced the bad file with the good file and the `catalog_search.ipynb` works again. Will launch the crawler again to really refresh the data but at least now the Catalog service is working.
@davidcaron if you migrate more servers, use this updated backup script to avoid breaking the Catalog service again. |
Crawler re-launched: $ curl --include "http://boreas.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs="
HTTP/1.1 200 OK
Date: Wed, 22 Jan 2020 21:09:48 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 1010
Vary: Accept-Encoding
Content-Type: text/xml; charset=utf-8
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2020-01-22T21:09:48Z">
<wps:ProcessAccepted percentCompleted="0">PyWPS Process pavicrawler accepted</wps:ProcessAccepted>
</wps:Status>
</wps:ExecuteResponse> Status location: https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml $ curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 22 Jan 2020 21:12:39 GMT
Content-Type: text/xml
Content-Length: 994
Last-Modified: Wed, 22 Jan 2020 21:09:49 GMT
Connection: keep-alive
ETag: "5e28ba1d-3e2"
Accept-Ranges: bytes
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2020-01-22T21:09:49Z">
<wps:ProcessStarted percentCompleted="10">Calling pavicrawler</wps:ProcessStarted>
</wps:Status>
</wps:ExecuteResponse> |
Oh crap, re-crawl failed ! @davidcaron any quick hint? $ curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 22 Jan 2020 21:23:57 GMT
Content-Type: text/xml
Content-Length: 2912
Last-Modified: Wed, 22 Jan 2020 21:17:19 GMT
Connection: keep-alive
ETag: "5e28bbdf-b60"
Accept-Ranges: bytes
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2020-01-22T21:17:19Z">
<wps:ProcessFailed>
<wps:ExceptionReport>
<ows:Exception exceptionCode="NoApplicableCode" locator="None">
<ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py", line 144, in _handler
headers=headers, verify=self.verify)
File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 476, in pavicrawler
headers=headers, verify=verify)
File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 280, in thredds_crawler
verify=verify):
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 33, in crawl
for ds in crawl(ref.url, skip, depth - 1, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 33, in crawl
for ds in crawl(ref.url, skip, depth - 1, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 33, in crawl
for ds in crawl(ref.url, skip, depth - 1, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 33, in crawl
for ds in crawl(ref.url, skip, depth - 1, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 28, in crawl
cat = read_url(url, skip, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 52, in read_url
return read_xml(req.text, url)
File "/usr/local/lib/python2.7/dist-packages/threddsclient/client.py", line 73, in read_xml
raise ValueError("Does not appear to be a Thredds catalog")
ValueError: Does not appear to be a Thredds catalog
</ows:ExceptionText>
</ows:Exception>
</wps:ExceptionReport>
</wps:ProcessFailed>
</wps:Status> |
Not sure... Check the user has the permissions to access thredds, the information is in So the |
The crawler seems to crawl up to a certain depth... and at some point get something that it expects to be a thredds document but is not... |
THREDDS is now configured to serve *.txt files as well. Could that be an issue ? |
Not impossible... One way to be sure would be to build a custom image of the catalog that logs every request it makes when crawling. |
*.txt file on Thredds probably did not cause that. On my test server, I have this kind of dataset: $ tree
.
├── testdata
│ ├── secure
│ │ ├── tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
│ │ ├── tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc
│ │ ├── tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
│ │ └── TEST.txt
│ └── TEST.txt
├── TEST.txt
└── wps_outputs Crawling worked fine and found the 3 .nc files. There are these errors in the Catalog service but looks like they are harmless: $ docker exec catalog bash -c 'tail -f /var/log/apache2/*'
(...)
syntax error, unexpected WORD_STRING, expecting WORD_WORD
context: Error { code = 500; message = "java.io.EOFException: Reading /pavics-data/testdata/TEST.txt at 5 file length = 5"^;};
syntax error, unexpected WORD_STRING, expecting WORD_WORD
context: Error { code = 500; message = "java.io.EOFException: Reading /pavics-data/testdata/secure/TEST.txt at 5 file length = 5"^;}; Let's hope it's just a glitch, I'll retry the crawling again. |
New crawler status location: https://pavics.ouranos.ca/wpsoutputs/catalog/0b9c06e4-3d8a-11ea-b543-0242ac120012.xml I enabled debug logging on the Catalog service this time. Hope to get more hints if it fails. |
Same error again :( Will continue investigation tomorrow. |
Absolutely nothing useful in the debug logs. I guess I will have to patch the docker image for more useful logs. |
@davidcaron So on my test server that only have 3 .nc files above, the resulting |
I hacked up the Catalog container with this change bird-house/threddsclient@master...bird-house:debug-crawl-failure and managed to get this more useful error: ValueError: u'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/wps_outputs/hummingbird/e5c0b950-3277-11ea-b357-0242ac120010/catalog.xml': Does not appear to be a Thredds catalog, xml=u'<?xml version="1.0" encoding="utf-8"?>\n<ExceptionReport version="1.0.0"\n xmlns="http://www.opengis.net/ows/1.1"\n xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">\n <Exception exceptionCode=" NoApplicableCode" locator="NotAcceptable">\n <ExceptionText>Request failed: HTTPConnectionPool(host='pavics.ouranos.ca', port=8083): Max retries exceeded with url: /twitcher/ows/proxy/thredds/catalog/birdhouse/wps_outputs/hummingbird/e5c0b950-3277-11ea-b357-0242ac120010/catalog.xml (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fad01ec5b90>: Failed to establish a new connection: [Errno -3] Try again'))</ExceptionText>\n </Exception>\n</ExceptionReport>' Looks like the transmission is cut during the xml file body transfer. Also, it's weird we are parsing stuff under wps_outputs in the first place. All the url are under Twitcher, which could possibly explain the transmission cut (the amount of data transfer exceed Twitcher capacity?). Will try to remove Twitcher and have the Catalog directly hit Thredds. |
@tlvu I remind you that wps_outputs is a shared docker volume between all wps providers and thredds. This way thredds can provide file/opendap/wms access facility. So being part of the birdhouse catalog it is indeed being parsed by the crawling process. |
I think Blaise had done something about this (splitting user files vs source files). |
But if I'm right output files should be deleted after some time, so they should not be indexed... Malleefowl has a function to persist output file (https://github.com/Ouranosinc/malleefowl/blob/pavics-dev/malleefowl/processes/wps_persist.py), that was called in some workflows or accessed via a frontend option and was required before indexing output files. |
So I think that the "full"crawling option has been shortsighted as it assumes a fresh volume of a new deployment. |
The When I try to run the crawler form a simple python environment (doing only from threddsclient import crawl
url = 'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/catalog.xml'
for n, ds in enumerate(crawl(url, depth=1)):
print(n, ds, ds.url) It takes 10 minutes to run for all the hummingbird wps_outputs. Also, notice I'm not passing through twitcher. (Edit: That might not be entirely true, because the urls returned by thredds are passing through twitcher) The crawler sometimes finishes, sometimes stops at a different dataset everytime with a connection error... |
I am stumped. I sort of, not sure, removed Twitcher, then the front Nginx in front of Thredds (basically more or less undo this PR https://github.com/Ouranosinc/PAVICS/pull/162, might not be enough to completely remove Twitcher/Nginx in front), see diff update-catalog-config...debug-catalog-crawl-failure, still have the same error "ValueError Does not appear to be a Thredds catalog, xml" Note the hostname and port changes: it tries "https://pavics.ouranos.ca/thredds/catalog" but end up with "Request failed: HTTPConnectionPool(host='boreas.ouranos.ca', port=8083): Max retries exceeded with url: /thredds/catalog". This error seems to occur only on Thredds with a lot of data. On my test server with 3 .nc file and 3 .txt file the crawl works fine. Can CRIM try the crawl on your side, on a big and small Thredds server? The debugging changes above are done directly on our production Boreas since I am not able to reproduce the problem somewhere else. I still made sure Jenkins and the Canarie monitoring are still OK. Full error: ValueError: u'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml': Does not appear to be a Thredds catalog, xml=u'<?xml version="1.0" encoding="utf-8"?>\n<ExceptionReport version="1.0.0"\n xmlns="http://www.opengis.net/ows/1.1"\n xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">\n <Exception exceptionCode="NoApplicableCode" locator="NotAcceptable">\n <ExceptionText>Request failed: HTTPConnectionPool(host='boreas.ouranos.ca', port=8083): Max retries exceeded with url: /thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fad01d7b3d0>: Failed to establish a new connection: [Errno -3] Try again'))</ExceptionText>\n </Exception>\n</ExceptionReport>' Full status location for reference: curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/65d8b050-3ebe-11ea-89dd-0242ac120012.xml
|
If the problem is related to text files. One option would be for THREDDS to index only
|
catalog: use public hostname in config when using self-signed SSL behind real SSL from pagekite Fix magpie connection error like: ``` <ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py", line 125, in _handler verify=self.verify) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 523, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 480, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 588, in send r = adapter.send(request, **kwargs) File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 447, in send raise SSLError(e, request=request) SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) </ows:ExceptionText> ``` Thredds url need to use the public hostname too so the path recorded in Solr is the good public one. The wms_alternate_server, not sure what it impact but looks like it might be useful so change it too. Fixed needed to investigate the crawling problem in #4 (comment)
Text files in the Thredds catalog is not the root cause. I just removed the Thredds config that exposes text files in the catalog and still that same error "ValueError: u'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/f5858240-4c08-11e9-a17f-0242ac12000d/catalog.xml': Does not appear to be a Thredds catalog". $ git diff
diff --git a/birdhouse/config/thredds/catalog.xml.template b/birdhouse/config/thredds/catalog.xml.template
index 7d97b36..7ada4b5 100644
--- a/birdhouse/config/thredds/catalog.xml.template
+++ b/birdhouse/config/thredds/catalog.xml.template
@@ -22,9 +22,6 @@
<filter>
<include wildcard="*.nc" />
<include wildcard="*.ncml" />
- <include wildcard="*.txt" />
- <include wildcard="*.md" />
- <include wildcard="*.rst" />
</filter>
</datasetScan> |
I have the Catalog access Thredds directly via internal docker networking instead of using the external network (PAVICS_FQDN) b165d1a and the crawl has been running for 20 mins uninterrupted, the previous longuest run was about 10 mins only. I think I am onto something here, maybe some strict firewall rules or network denial of service attack protection interfering with the crawl since the crawl makes a huge amount of network connections. This would explain why none of other servesr is able to reproduce the problem since the protection would be on the public Boreas only. |
Crawler finished without error (way too fast) and did not insert anything into Solr. But last least we got over the network problem crawling Thredds. $ curl https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008.xml
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2020-01-27T19:50:29Z">
<wps:ProcessSucceeded>PyWPS Process PAVICS Crawler finished</wps:ProcessSucceeded>
</wps:Status>
<wps:ProcessOutputs>
<wps:Output>
<ows:Identifier>crawler_result</ows:Identifier>
<ows:Title>PAVICS Crawler Result</ows:Title>
<ows:Abstract>Crawler result as a json.</ows:Abstract>
<wps:Reference href="https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008/solr_result_2020-01-27T19:50:28Z_.json" mimeType="application/json" encoding="" schema=""/>
</wps:Output>
</wps:ProcessOutputs>
</wps:ExecuteResponse>
$ curl https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008/solr_result_2020-01-27T19:50:28Z_.json
{"responseHeader": {"status": 0, "QTime": 0, "Nquery": 0}} |
A crawl has been running for 4 hours, this looks promising. Note this is when bypassing all external networks and using internal docker network only between the Catalog and Thredds 83c8391...641c648 curl https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml <?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2020-01-29T17:43:15Z">
<wps:ProcessStarted percentCompleted="10">Calling pavicrawler</wps:ProcessStarted>
</wps:Status>
</wps:ExecuteResponse> |
Crawl failed again, this time connection problem to Solr. curl https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml
|
Tagged |
Just to close on this crawling issue, @moulab88 and I finally found 2 root causes. 1 - Catalog was choking when crawling Thredds because there was a gigantic 244G folder under wps_outputs that probably timeout the connection between the Catalog and Thredds when Thredds was generating the catalog.xml of that folder. We removed that folder. 2 - Catalog was unable to connect to Solr due to an out-of-date DNS config on the Boreas host. New config was deployed. So the full Crawl finally worked and took 2 days to complete. Mourad started the crawl during Friday morning and it finished Sunday 10:59 AM. curl --include "https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015.xml"
curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015/solr_result_2020-02-16T14:59:08Z_.json
|
And also 12702 files/sub-directory under this directory. |
…ins-failure-after-new-crawl catalog_search.ipynb: fix jenkins failure after new crawl Make the query much more precise by adding "institute:CCCma,model:CanESM2". Previous query was returning 200+ results after new crawl triggered in bird-house/birdhouse-deploy#4 (comment). Now we seems to have duplicate result, "cccma" and "CCCMA". @huard did we rename "CCCMA" to "cccma" on Thredds? New working Jenkins run: http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/480/console Jenkins error fixed: ``` 00:35:48 _____ pavics-sdi-master/docs/source/notebooks/catalog_search.ipynb::Cell 1 _____ 00:35:48 Notebook cell execution failed 00:35:48 Cell 1: Cell outputs differ 00:35:48 00:35:48 Input: 00:35:48 resp = wps.pavicsearch(constraints="variable:tasmin,project:CMIP5,experiment:rcp85,frequency:day", limit=10, type="File") 00:35:48 [result, files] = resp.get(asobj=True) 00:35:48 files 00:35:48 00:35:48 Traceback: 00:35:48 mismatch 'text/plain' 00:35:48 00:35:48 assert reference_output == test_output failed: 00:35:48 00:35:48 "['https://pa...21001231.nc']" == "['https://pa...20101130.nc']" 00:35:48 Skipping 61 identical leading characters in diff, use -v to show 00:35:48 - birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 - 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r2i1p1/tasmin/tasmin_day_CanESM2_rcp85_r2i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^ ^^^ ^ ^ ^^^ ^ ^ -- ^^ 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r1i1p1_21300101-21391231.nc', 00:35:48 ? ^^^^^^ ^^^^ ^^^^ ^^^ ^ ^^^^ ^^^ ^ ++ ^^ 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r1i1p1_21810101-21891231.nc', 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r4i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r4i1p1_20951201-21001130.nc', 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-CC/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-CC_rcp85_r1i1p1_20451201-20501130.nc', 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_20401201-20451130.nc', 00:35:48 - 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_CanESM2_rcp85_r3i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^ ^^^ ^ ^^^ ^ - - ^ 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r3i1p1_20400101-20491231.nc', 00:35:48 ? ^^^^^^ ^^^^ ^^^^ ^^^ ^^^^ ^^^ + ^^ 00:35:48 - 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r4i1p1/tasmin/tasmin_day_CanESM2_rcp85_r4i1p1_20060101-21001231.nc', 00:35:48 - 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_CanESM2_rcp85_r1i1p1_20060101-21001231.nc'] 00:35:48 ? ^^ ^^^^^ -- ^ ^ -- --- - ^ ^ ^ 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_20151201-20201130.nc', 00:35:48 ? +++++++++ ^^^^^^ ^^ ^ ^^^^^^ +++ + ^ ^ ^^ 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_22191201-22291130.nc', 00:35:48 + 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-CC/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_HadGEM2-CC_rcp85_r3i1p1_20051201-20101130.nc'] 00:35:48 00:35:48 00:35:48 _____ pavics-sdi-master/docs/source/notebooks/catalog_search.ipynb::Cell 2 _____ 00:35:48 Notebook cell execution failed 00:35:48 Cell 2: Cell outputs differ 00:35:48 00:35:48 Input: 00:35:48 result['response']['docs'][0] 00:35:48 00:35:48 Traceback: 00:35:48 mismatch 'text/plain' 00:35:48 00:35:48 assert reference_output == test_output failed: 00:35:48 00:35:48 "{'cf_standar...21001231.nc'}" == "{'cf_standar...21001231.nc'}" 00:35:48 Skipping 56 identical leading characters in diff, use -v to show 00:35:48 - birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 'replica': False, 00:35:48 - 'wms_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^^^^^^^^^^^ ^ ^^^^^^^ ^^^^^^^^^ 00:35:48 + 'wms_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^^^^ ^ ^^^^^^^^^ ^^^^^^^^^ 00:35:48 'keywords': ['air_temperature', 00:35:48 'day', 00:35:48 'application/netcdf', 00:35:48 'tasmin', 00:35:48 'thredds', 00:35:48 'CMIP5', 00:35:48 'rcp85', 00:35:48 - 'CanESM2', 00:35:48 - 'CCCma'], 00:35:48 + 'MRI-CGCM3', 00:35:48 + 'MRI'], 00:35:48 - 'dataset_id': 'CCCMA.CanESM2.rcp85.day.atmos.r5i1p1.tasmin', 00:35:48 ? ^^^ ^^^^^^^^^ ^ 00:35:48 + 'dataset_id': 'cmip5.MRI.rcp85.day.atmos.r1i1p1.tasmin', 00:35:48 ? ^^^^^^ ^^ ^ 00:35:48 'datetime_max': 'DATE_TIME_TZ', 00:35:48 - 'id': '29186a2db2230376', 00:35:48 + 'id': '0035405c47cd3a2f', 00:35:48 'subject': 'Birdhouse Thredds Catalog', 00:35:48 'category': 'thredds', 00:35:48 - 'opendap_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + 'opendap_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 - 'title': 'tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^^ ^ ^ ^ 00:35:48 + 'title': 'tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ++++ ^^ ^ ^ ^ 00:35:48 'variable_palette': ['default'], 00:35:48 'variable_min': [0], 00:35:48 'variable_long_name': ['Daily Minimum Near-Surface Air Temperature'], 00:35:48 - 'source': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog.xml', 00:35:48 + 'source': 'https://pavics.ouranos.ca//twitcher/ows/proxy/thredds/catalog.xml', 00:35:48 ? + 00:35:48 'datetime_min': 'DATE_TIME_TZ', 00:35:48 'score': 1.0, 00:35:48 'variable_max': [1], 00:35:48 'units': ['K'], 00:35:48 - 'resourcename': 'birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + 'resourcename': 'birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 'type': 'File', 00:35:48 - 'catalog_url': 'https://pavics.ouranos.ca/thredds/catalog/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/catalog.xml?dataset=birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 + 'catalog_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/catalog.xml?dataset=birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 'experiment': 'rcp85', 00:35:48 'last_modified': 'DATE_TIME_TZ', 00:35:48 'content_type': 'application/netcdf', 00:35:48 - '_version_': 1599589044577107972, 00:35:48 + '_version_': 1658705770170023939, 00:35:48 'variable': ['tasmin'], 00:35:48 - 'url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc', 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + 'url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc', 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 'project': 'CMIP5', 00:35:48 - 'institute': 'CCCma', 00:35:48 ? ^^^^^ 00:35:48 + 'institute': 'MRI', 00:35:48 ? ^^^ 00:35:48 'frequency': 'day', 00:35:48 - 'model': 'CanESM2', 00:35:48 + 'model': 'MRI-CGCM3', 00:35:48 'latest': True, 00:35:48 - 'fileserver_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc'} 00:35:48 ? ^^^ ^^^^^^^^^ ^ ^^^^ ^ ^ ^ 00:35:48 + 'fileserver_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc'} 00:35:48 ? ^^^^^^ ^^ ^ ++++ ^^ ^ ^ ^ 00:35:48 ```
The
catalog_search.ipynb
(https://pavics.ouranos.ca/jupyter/user/public/lab/tree/tutorial-notebooks/catalog_search.ipynb)
notebook was failing with this error:
owslib.wps.WPSException : {'code': 'NoApplicableCode', 'locator': 'None', 'text': 'Process error: method=wps_pavicsearch.py._handler, line=254, msg=Traceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicsearch.py", line 251, in _handler\n output_format=output_format)\n File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 973, in pavicsearch\n r.raise_for_status()\n File "/usr/lib/python2.7/dist-packages/requests/models.py", line 840, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nHTTPError: 400 Client Error: Bad Request for url: http://pavics.ouranos.ca:8983/solr/birdhouse/select?start=0&rows=10&q=*&fq=variable:%22tasmin%22&fq=project:%22CMIP5%22&fq=experiment:%22rcp85%22&fq=frequency:%22day%22&fl=*,score&fq=type:File&sort=id+asc&wt=json&indent=true\n'}
Interestingly the canarie monitoring of the Catalog service was working fine.
It turns out the file
/data/solr/birdhouse/conf/managed-schema
was important.Diff of that
managed-schema
file against a working one from CRIM:The good file has a few more fields !
Replaced the bad file with the good file and the
catalog_search.ipynb
works again.
Will launch the crawler again to really refresh the data but at least now the
Catalog service is working.