Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup solr: should save all of /data/solr, not just the index #4

Merged
merged 1 commit into from Jan 22, 2020

Conversation

tlvu
Copy link
Collaborator

@tlvu tlvu commented Jan 22, 2020

The catalog_search.ipynb
(https://pavics.ouranos.ca/jupyter/user/public/lab/tree/tutorial-notebooks/catalog_search.ipynb)
notebook was failing with this error:

owslib.wps.WPSException : {'code': 'NoApplicableCode', 'locator': 'None', 'text': 'Process error: method=wps_pavicsearch.py._handler, line=254, msg=Traceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicsearch.py", line 251, in _handler\n output_format=output_format)\n File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 973, in pavicsearch\n r.raise_for_status()\n File "/usr/lib/python2.7/dist-packages/requests/models.py", line 840, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nHTTPError: 400 Client Error: Bad Request for url: http://pavics.ouranos.ca:8983/solr/birdhouse/select?start=0&rows=10&q=*&fq=variable:%22tasmin%22&fq=project:%22CMIP5%22&fq=experiment:%22rcp85%22&fq=frequency:%22day%22&fl=*,score&fq=type:File&sort=id+asc&wt=json&indent=true\n'}

Interestingly the canarie monitoring of the Catalog service was working fine.

It turns out the file /data/solr/birdhouse/conf/managed-schema was important.

Diff of that managed-schema file against a working one from CRIM:

$ diff /data/solr/solr/birdhouse/conf/managed-schema /tmp/good-file
1c1
< <?xml version="1.0" encoding="UTF-8"?>
---
> <?xml version="1.0" encoding="UTF-8"?>
48a49,51
>   <field name="dataset_id" type="string" stored="true"/>
>   <field name="datetime_max" type="date" stored="true"/>
>   <field name="datetime_min" type="date" stored="true"/>
50a54
>   <field name="fileserver_url" type="string" stored="true"/>
55a60
>   <field name="latest" type="boolean" stored="true"/>
58a64
>   <field name="replica" type="boolean" stored="true"/>
63a70
>   <field name="type" type="string" stored="true"/>

The good file has a few more fields !

Replaced the bad file with the good file and the catalog_search.ipynb
works again.

Will launch the crawler again to really refresh the data but at least now the
Catalog service is working.

The `catalog_search.ipynb`
(https://pavics.ouranos.ca/jupyter/user/public/lab/tree/tutorial-notebooks/catalog_search.ipynb)
notebook was failing with this error:

owslib.wps.WPSException : {'code': 'NoApplicableCode', 'locator': 'None', 'text': 'Process error: method=wps_pavicsearch.py._handler, line=254, msg=Traceback (most recent call last):\n  File "/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicsearch.py", line 251, in _handler\n    output_format=output_format)\n  File "/usr/local/lib/python2.7/dist-packages/pavics/catalog.py", line 973, in pavicsearch\n    r.raise_for_status()\n  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 840, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nHTTPError: 400 Client Error: Bad Request for url: http://pavics.ouranos.ca:8983/solr/birdhouse/select?start=0&rows=10&q=*&fq=variable:%22tasmin%22&fq=project:%22CMIP5%22&fq=experiment:%22rcp85%22&fq=frequency:%22day%22&fl=*,score&fq=type:File&sort=id+asc&wt=json&indent=true\n'}

Interestingly the canarie monitoring of the Catalog service was working fine.

It turns out the file `/data/solr/birdhouse/conf/managed-schema` was important.

Diff of that `managed-schema` file against a working one from CRIM:

```diff
$ diff /data/solr/solr/birdhouse/conf/managed-schema /tmp/good-file
1c1
< <?xml version="1.0" encoding="UTF-8"?>
---
> <?xml version="1.0" encoding="UTF-8"?>
48a49,51
>   <field name="dataset_id" type="string" stored="true"/>
>   <field name="datetime_max" type="date" stored="true"/>
>   <field name="datetime_min" type="date" stored="true"/>
50a54
>   <field name="fileserver_url" type="string" stored="true"/>
55a60
>   <field name="latest" type="boolean" stored="true"/>
58a64
>   <field name="replica" type="boolean" stored="true"/>
63a70
>   <field name="type" type="string" stored="true"/>
```

The good file has a few more fields !

Replaced the bad file with the good file and the `catalog_search.ipynb`
works again.

Will launch the crawler again to really refresh the data but at least now the
Catalog service is working.
@tlvu
Copy link
Collaborator Author

tlvu commented Jan 22, 2020

@davidcaron if you migrate more servers, use this updated backup script to avoid breaking the Catalog service again.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 22, 2020

Crawler re-launched:

$ curl --include "http://boreas.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs="
HTTP/1.1 200 OK
Date: Wed, 22 Jan 2020 21:09:48 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 1010
Vary: Accept-Encoding
Content-Type: text/xml; charset=utf-8

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-22T21:09:48Z">
        <wps:ProcessAccepted percentCompleted="0">PyWPS Process pavicrawler accepted</wps:ProcessAccepted>
        </wps:Status>
</wps:ExecuteResponse>

Status location: https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml

$ curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 22 Jan 2020 21:12:39 GMT
Content-Type: text/xml
Content-Length: 994
Last-Modified: Wed, 22 Jan 2020 21:09:49 GMT
Connection: keep-alive
ETag: "5e28ba1d-3e2"
Accept-Ranges: bytes

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-22T21:09:49Z">
        <wps:ProcessStarted percentCompleted="10">Calling pavicrawler</wps:ProcessStarted>
        </wps:Status>
</wps:ExecuteResponse>

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 22, 2020

Oh crap, re-crawl failed ! @davidcaron any quick hint?

$ curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 22 Jan 2020 21:23:57 GMT
Content-Type: text/xml
Content-Length: 2912
Last-Modified: Wed, 22 Jan 2020 21:17:19 GMT
Connection: keep-alive
ETag: "5e28bbdf-b60"
Accept-Ranges: bytes

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/85e8b1d8-3d5b-11ea-829a-0242ac120012.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-22T21:17:19Z">
        <wps:ProcessFailed>
            <wps:ExceptionReport>
                    <ows:Exception exceptionCode="NoApplicableCode" locator="None">
                            <ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last):
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py&#34;, line 144, in _handler
    headers=headers, verify=self.verify)
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 476, in pavicrawler
    headers=headers, verify=verify)
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 280, in thredds_crawler
    verify=verify):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 33, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 33, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 33, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 33, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 28, in crawl
    cat = read_url(url, skip, **kwargs)
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 52, in read_url
    return read_xml(req.text, url)
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 73, in read_xml
    raise ValueError(&#34;Does not appear to be a Thredds catalog&#34;)
ValueError: Does not appear to be a Thredds catalog
</ows:ExceptionText>
                    </ows:Exception>
            </wps:ExceptionReport>
        </wps:ProcessFailed>
        </wps:Status>

@davidcaron
Copy link

Not sure...

Check the user has the permissions to access thredds, the information is in config/catalog/catalog.cfg

So the magpie_user must have the permissions to access thredds_host

@davidcaron
Copy link

The crawler seems to crawl up to a certain depth... and at some point get something that it expects to be a thredds document but is not...

@huard
Copy link
Collaborator

huard commented Jan 22, 2020

THREDDS is now configured to serve *.txt files as well. Could that be an issue ?

@davidcaron
Copy link

Not impossible... One way to be sure would be to build a custom image of the catalog that logs every request it makes when crawling.

@tlvu tlvu merged commit b7c01f6 into master Jan 22, 2020
@tlvu tlvu deleted the fix-solr-backup-script-missing-data branch January 22, 2020 23:37
@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

*.txt file on Thredds probably did not cause that.

On my test server, I have this kind of dataset:

$ tree
.
├── testdata
│   ├── secure
│   │   ├── tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
│   │   ├── tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc
│   │   ├── tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
│   │   └── TEST.txt
│   └── TEST.txt
├── TEST.txt
└── wps_outputs

Crawling worked fine and found the 3 .nc files.

There are these errors in the Catalog service but looks like they are harmless:

$ docker exec catalog bash -c 'tail -f /var/log/apache2/*'

(...)

syntax error, unexpected WORD_STRING, expecting WORD_WORD
context: Error { code = 500; message = "java.io.EOFException: Reading /pavics-data/testdata/TEST.txt at 5 file length = 5"^;};
syntax error, unexpected WORD_STRING, expecting WORD_WORD
context: Error { code = 500; message = "java.io.EOFException: Reading /pavics-data/testdata/secure/TEST.txt at 5 file length = 5"^;};

Let's hope it's just a glitch, I'll retry the crawling again.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

New crawler status location: https://pavics.ouranos.ca/wpsoutputs/catalog/0b9c06e4-3d8a-11ea-b543-0242ac120012.xml

I enabled debug logging on the Catalog service this time. Hope to get more hints if it fails.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

New crawler status location: https://pavics.ouranos.ca/wpsoutputs/catalog/0b9c06e4-3d8a-11ea-b543-0242ac120012.xml

Same error again :( Will continue investigation tomorrow.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

I enabled debug logging on the Catalog service this time. Hope to get more hints if it fails.

Absolutely nothing useful in the debug logs. I guess I will have to patch the docker image for more useful logs.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

@davidcaron So on my test server that only have 3 .nc files above, the resulting managed-schema is exactly the same as the one you gave me. Could it be possible that the number of .nc files do not impact the content of that managed-schema file?

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 23, 2020

I hacked up the Catalog container with this change bird-house/threddsclient@master...bird-house:debug-crawl-failure and managed to get this more useful error:

ValueError: u'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/wps_outputs/hummingbird/e5c0b950-3277-11ea-b357-0242ac120010/catalog.xml&#39;: Does not appear to be a Thredds catalog, xml=u'<?xml version="1.0" encoding="utf-8"?>\n<ExceptionReport version="1.0.0"\n xmlns="http://www.opengis.net/ows/1.1&#34;\n xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&#34;\n xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd&#34;&gt;\n <Exception exceptionCode=" NoApplicableCode" locator="NotAcceptable">\n <ExceptionText>Request failed: HTTPConnectionPool(host=&#x27;pavics.ouranos.ca&#x27;, port=8083): Max retries exceeded with url: /twitcher/ows/proxy/thredds/catalog/birdhouse/wps_outputs/hummingbird/e5c0b950-3277-11ea-b357-0242ac120010/catalog.xml (Caused by NewConnectionError(&#x27;&lt;urllib3.connection.HTTPConnection object at 0x7fad01ec5b90&gt;: Failed to establish a new connection: [Errno -3] Try again&#x27;))</ExceptionText>\n </Exception>\n</ExceptionReport>'

Looks like the transmission is cut during the xml file body transfer. Also, it's weird we are parsing stuff under wps_outputs in the first place.

All the url are under Twitcher, which could possibly explain the transmission cut (the amount of data transfer exceed Twitcher capacity?). Will try to remove Twitcher and have the Catalog directly hit Thredds.

@dbyrns
Copy link
Collaborator

dbyrns commented Jan 24, 2020

@tlvu I remind you that wps_outputs is a shared docker volume between all wps providers and thredds. This way thredds can provide file/opendap/wms access facility. So being part of the birdhouse catalog it is indeed being parsed by the crawling process.
Having said that I think we should review this in the catalog exploration task so that we did not mix source/processed data.

@huard
Copy link
Collaborator

huard commented Jan 24, 2020

I think Blaise had done something about this (splitting user files vs source files).

@dbyrns
Copy link
Collaborator

dbyrns commented Jan 24, 2020

But if I'm right output files should be deleted after some time, so they should not be indexed... Malleefowl has a function to persist output file (https://github.com/Ouranosinc/malleefowl/blob/pavics-dev/malleefowl/processes/wps_persist.py), that was called in some workflows or accessed via a frontend option and was required before indexing output files.

@dbyrns
Copy link
Collaborator

dbyrns commented Jan 24, 2020

So I think that the "full"crawling option has been shortsighted as it assumes a fresh volume of a new deployment.

@davidcaron
Copy link

davidcaron commented Jan 24, 2020

The thredds/catalog/birdhouse/wps_outputs/hummingbird contains a lot of txt files (~12000), and David Huard said they were added to thredds recently. So maybe they were just skipped before, and that would explain why the problem is new.

When I try to run the crawler form a simple python environment (doing only conda install -c conda-forge threddsclient)

from threddsclient import crawl

url = 'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/catalog.xml'
for n, ds in enumerate(crawl(url, depth=1)):
    print(n, ds, ds.url)

It takes 10 minutes to run for all the hummingbird wps_outputs.

Also, notice I'm not passing through twitcher. (Edit: That might not be entirely true, because the urls returned by thredds are passing through twitcher)

The crawler sometimes finishes, sometimes stops at a different dataset everytime with a connection error...

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 24, 2020

I am stumped. I sort of, not sure, removed Twitcher, then the front Nginx in front of Thredds (basically more or less undo this PR https://github.com/Ouranosinc/PAVICS/pull/162, might not be enough to completely remove Twitcher/Nginx in front), see diff update-catalog-config...debug-catalog-crawl-failure, still have the same error "ValueError Does not appear to be a Thredds catalog, xml"

Note the hostname and port changes: it tries "https://pavics.ouranos.ca/thredds/catalog" but end up with "Request failed: HTTPConnectionPool(host=&#x27;boreas.ouranos.ca&#x27;, port=8083): Max retries exceeded with url: /thredds/catalog".

This error seems to occur only on Thredds with a lot of data. On my test server with 3 .nc file and 3 .txt file the crawl works fine. Can CRIM try the crawl on your side, on a big and small Thredds server?

The debugging changes above are done directly on our production Boreas since I am not able to reproduce the problem somewhere else. I still made sure Jenkins and the Canarie monitoring are still OK.

Full error:

ValueError: u'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml&#39;: Does not appear to be a Thredds catalog, xml=u'<?xml version="1.0" encoding="utf-8"?>\n<ExceptionReport version="1.0.0"\n xmlns="http://www.opengis.net/ows/1.1&#34;\n xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&#34;\n xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd&#34;&gt;\n <Exception exceptionCode="NoApplicableCode" locator="NotAcceptable">\n <ExceptionText>Request failed: HTTPConnectionPool(host=&#x27;boreas.ouranos.ca&#x27;, port=8083): Max retries exceeded with url: /thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml (Caused by NewConnectionError(&#x27;&lt;urllib3.connection.HTTPConnection object at 0x7fad01d7b3d0&gt;: Failed to establish a new connection: [Errno -3] Try again&#x27;))</ExceptionText>\n </Exception>\n</ExceptionReport>'
</ows:ExceptionText>

Full status location for reference: curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/65d8b050-3ebe-11ea-89dd-0242ac120012.xml

HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Fri, 24 Jan 2020 15:44:32 GMT
Content-Type: text/xml
Content-Length: 3977
Last-Modified: Fri, 24 Jan 2020 15:32:28 GMT
Connection: keep-alive
ETag: "5e2b0e0c-f89"
Accept-Ranges: bytes

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/65d8b050-3ebe-11ea-89dd-0242ac120012.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-24T15:32:28Z">
        <wps:ProcessFailed>
            <wps:ExceptionReport>
                    <ows:Exception exceptionCode="NoApplicableCode" locator="None">
                            <ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last):
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py&#34;, line 144, in _handler
    headers=headers, verify=self.verify)
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 476, in pavicrawler
    headers=headers, verify=verify)
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 280, in thredds_crawler
    verify=verify):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 35, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 35, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 35, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 35, in crawl
    for ds in crawl(ref.url, skip, depth - 1, **kwargs):
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 30, in crawl
    cat = read_url(url, skip, **kwargs)
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 55, in read_url
    return read_xml(req.text, url)
  File &#34;/usr/local/lib/python2.7/dist-packages/threddsclient/client.py&#34;, line 77, in read_xml
    % (baseurl, xml)))
ValueError: u&#39;https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml&#39;: Does not appear to be a Thredds catalog, xml=u&#39;&lt;?xml version=&#34;1.0&#34; encoding=&#34;utf-8&#34;?&gt;\n&lt;ExceptionReport version=&#34;1.0.0&#34;\n    xmlns=&#34;http://www.opengis.net/ows/1.1&#34;\n    xmlns:xsi=&#34;http://www.w3.org/2001/XMLSchema-instance&#34;\n    xsi:schemaLocation=&#34;http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd&#34;&gt;\n    &lt;Exception exceptionCode=&#34;NoApplicableCode&#34; locator=&#34;NotAcceptable&#34;&gt;\n        &lt;ExceptionText&gt;Request failed: HTTPConnectionPool(host=&amp;#x27;boreas.ouranos.ca&amp;#x27;, port=8083): Max retries exceeded with url: /thredds/catalog/birdhouse/wps_outputs/hummingbird/c61c6948-3c2f-11ea-a46b-0242ac120014/catalog.xml (Caused by NewConnectionError(&amp;#x27;&amp;lt;urllib3.connection.HTTPConnection object at 0x7fad01d7b3d0&amp;gt;: Failed to establish a new connection: [Errno -3] Try again&amp;#x27;))&lt;/ExceptionText&gt;\n    &lt;/Exception&gt;\n&lt;/ExceptionReport&gt;&#39;
</ows:ExceptionText>
                    </ows:Exception>
            </wps:ExceptionReport>
        </wps:ProcessFailed>
        </wps:Status>
</wps:ExecuteResponse>

@huard
Copy link
Collaborator

huard commented Jan 24, 2020

If the problem is related to text files. One option would be for THREDDS to index only

  • LICENSE.{txt,md,rst}
  • README.{txt,md,rst}

tlvu added a commit that referenced this pull request Jan 27, 2020
catalog: use public hostname in config when using self-signed SSL behind real SSL from pagekite

Fix magpie connection error like:

```
<ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last):
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py&#34;, line 125, in _handler
    verify=self.verify)
  File &#34;/usr/lib/python2.7/dist-packages/requests/sessions.py&#34;, line 523, in post
    return self.request(&#39;POST&#39;, url, data=data, json=json, **kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/sessions.py&#34;, line 480, in request
    resp = self.send(prep, **send_kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/sessions.py&#34;, line 588, in send
    r = adapter.send(request, **kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/adapters.py&#34;, line 447, in send
    raise SSLError(e, request=request)
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
</ows:ExceptionText>
```

Thredds url need to use the public hostname too so the path recorded in Solr is
the good public one.

The wms_alternate_server, not sure what it impact but looks like it might be
useful so change it too.

Fixed needed to investigate the crawling problem in #4 (comment)
@tlvu
Copy link
Collaborator Author

tlvu commented Jan 27, 2020

Text files in the Thredds catalog is not the root cause. I just removed the Thredds config that exposes text files in the catalog and still that same error "ValueError: u'https://pavics.ouranos.ca/thredds/catalog/birdhouse/wps_outputs/hummingbird/f5858240-4c08-11e9-a17f-0242ac12000d/catalog.xml&#39;: Does not appear to be a Thredds catalog".

$ git diff
diff --git a/birdhouse/config/thredds/catalog.xml.template b/birdhouse/config/thredds/catalog.xml.template
index 7d97b36..7ada4b5 100644
--- a/birdhouse/config/thredds/catalog.xml.template
+++ b/birdhouse/config/thredds/catalog.xml.template
@@ -22,9 +22,6 @@
       <filter>
         <include wildcard="*.nc" />
         <include wildcard="*.ncml" />
-        <include wildcard="*.txt" />
-        <include wildcard="*.md" />
-        <include wildcard="*.rst" />
       </filter>
 
     </datasetScan>

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 27, 2020

I have the Catalog access Thredds directly via internal docker networking instead of using the external network (PAVICS_FQDN) b165d1a and the crawl has been running for 20 mins uninterrupted, the previous longuest run was about 10 mins only.

I think I am onto something here, maybe some strict firewall rules or network denial of service attack protection interfering with the crawl since the crawl makes a huge amount of network connections. This would explain why none of other servesr is able to reproduce the problem since the protection would be on the public Boreas only.

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 27, 2020

Crawler finished without error (way too fast) and did not insert anything into Solr. But last least we got over the network problem crawling Thredds.

$ curl https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008.xml
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-27T19:50:29Z">
        <wps:ProcessSucceeded>PyWPS Process PAVICS Crawler finished</wps:ProcessSucceeded>
        </wps:Status>
        <wps:ProcessOutputs>
                <wps:Output>
            <ows:Identifier>crawler_result</ows:Identifier>
            <ows:Title>PAVICS Crawler Result</ows:Title>
            <ows:Abstract>Crawler result as a json.</ows:Abstract>
            <wps:Reference href="https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008/solr_result_2020-01-27T19:50:28Z_.json" mimeType="application/json" encoding="" schema=""/>
                </wps:Output>
        </wps:ProcessOutputs>
</wps:ExecuteResponse>

$ curl https://pavics.ouranos.ca/wpsoutputs/catalog/05f2d1a6-413a-11ea-9d58-0242ac120008/solr_result_2020-01-27T19:50:28Z_.json
{"responseHeader": {"status": 0, "QTime": 0, "Nquery": 0}}

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 29, 2020

A crawl has been running for 4 hours, this looks promising. Note this is when bypassing all external networks and using internal docker network only between the Catalog and Thredds 83c8391...641c648

curl https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-29T17:43:15Z">
        <wps:ProcessStarted percentCompleted="10">Calling pavicrawler</wps:ProcessStarted>
        </wps:Status>
</wps:ExecuteResponse>

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 30, 2020

Crawl failed again, this time connection problem to Solr.

curl https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/d35b666e-42be-11ea-827b-0242ac120016.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-01-30T04:11:10Z">
        <wps:ProcessFailed>
            <wps:ExceptionReport>
                    <ows:Exception exceptionCode="NoApplicableCode" locator="None">
                            <ows:ExceptionText>Process error: method=wps_pavicrawler.py._handler, line=146, msg=Traceback (most recent call last):
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics_datacatalog-0.6.11-py2.7.egg/pavics_datacatalog/wps_processes/wps_pavicrawler.py&#34;, line 144, in _handler
    headers=headers, verify=self.verify)
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 493, in pavicrawler
    doc[&#39;title&#39;], doc[&#39;dataset_id&#39;]))
  File &#34;/usr/local/lib/python2.7/dist-packages/pavics/catalog.py&#34;, line 976, in pavicsearch
    r = requests.get(solr_search_url)
  File &#34;/usr/lib/python2.7/dist-packages/requests/api.py&#34;, line 67, in get
    return request(&#39;get&#39;, url, params=params, **kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/api.py&#34;, line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/sessions.py&#34;, line 480, in request
    resp = self.send(prep, **send_kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/sessions.py&#34;, line 588, in send
    r = adapter.send(request, **kwargs)
  File &#34;/usr/lib/python2.7/dist-packages/requests/adapters.py&#34;, line 437, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host=&#39;pavics.ouranos.ca&#39;, port=8983): Max retries exceeded with url: /solr/birdhouse/select?start=0&amp;rows=1000&amp;q=tasmax_day_MPI-ESM-LR_rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_2003.nc%20AND%20testdata.ouranos.cb-oura-1.0_rechunk.MPI-ESM-LR.rcp85.day.tasmax&amp;fl=*,score&amp;sort=id+asc&amp;wt=json&amp;indent=true (Caused by NewConnectionError(&#39;&lt;requests.packages.urllib3.connection.HTTPConnection object at 0x7f9ea2d26350&gt;: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution&#39;,))
</ows:ExceptionText>
                    </ows:Exception>
            </wps:ExceptionReport>
        </wps:ProcessFailed>
        </wps:Status>
</wps:ExecuteResponse>

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 3, 2020

Tagged 1.7.0 since the last tag from the old PAVICS repo was 1.6.13. Migrating to a public repo is not small, worth to bump a minor instead of a patch version.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 17, 2020

Just to close on this crawling issue, @moulab88 and I finally found 2 root causes.

1 - Catalog was choking when crawling Thredds because there was a gigantic 244G folder under wps_outputs that probably timeout the connection between the Catalog and Thredds when Thredds was generating the catalog.xml of that folder. We removed that folder.

2 - Catalog was unable to connect to Solr due to an out-of-date DNS config on the Boreas host. New config was deployed.

So the full Crawl finally worked and took 2 days to complete. Mourad started the crawl during Friday morning and it finished Sunday 10:59 AM.

curl --include "https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015.xml"

HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Mon, 17 Feb 2020 21:14:55 GMT
Content-Type: text/xml
Content-Length: 1461
Last-Modified: Sun, 16 Feb 2020 14:59:10 GMT
Connection: keep-alive
ETag: "5e4958be-5b5"
Accept-Ranges: bytes

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2020-02-16T14:59:10Z">
        <wps:ProcessSucceeded>PyWPS Process PAVICS Crawler finished</wps:ProcessSucceeded>
        </wps:Status>
        <wps:ProcessOutputs>
                <wps:Output>
            <ows:Identifier>crawler_result</ows:Identifier>
            <ows:Title>PAVICS Crawler Result</ows:Title>
            <ows:Abstract>Crawler result as a json.</ows:Abstract>
            <wps:Reference href="https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015/solr_result_2020-02-16T14:59:08Z_.json" mimeType="application/json" encoding="" schema=""/>
                </wps:Output>
        </wps:ProcessOutputs>
</wps:ExecuteResponse>

curl --include https://pavics.ouranos.ca/wpsoutputs/catalog/ea24a6fe-4f6f-11ea-a3e1-0242ac120015/solr_result_2020-02-16T14:59:08Z_.json

HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Mon, 17 Feb 2020 21:20:27 GMT
Content-Type: application/json
Content-Length: 65
Last-Modified: Sun, 16 Feb 2020 14:59:08 GMT
Connection: keep-alive
ETag: "5e4958bc-41"
Accept-Ranges: bytes

{"responseHeader": {"status": 0, "QTime": 58978, "Nquery": 1214}}

@moulab88
Copy link
Collaborator

And also 12702 files/sub-directory under this directory.

tlvu added a commit to Ouranosinc/pavics-sdi that referenced this pull request Feb 24, 2020
…ins-failure-after-new-crawl

catalog_search.ipynb: fix jenkins failure after new crawl

Make the query much more precise by adding "institute:CCCma,model:CanESM2".

Previous query was returning 200+ results after new crawl triggered in bird-house/birdhouse-deploy#4 (comment).

Now we seems to have duplicate result, "cccma" and "CCCMA".  @huard did we rename "CCCMA" to "cccma" on Thredds?

New working Jenkins run: http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/480/console

Jenkins error fixed:

```
00:35:48  _____ pavics-sdi-master/docs/source/notebooks/catalog_search.ipynb::Cell 1 _____
00:35:48  Notebook cell execution failed
00:35:48  Cell 1: Cell outputs differ
00:35:48
00:35:48  Input:
00:35:48  resp = wps.pavicsearch(constraints="variable:tasmin,project:CMIP5,experiment:rcp85,frequency:day", limit=10, type="File")
00:35:48  [result, files] = resp.get(asobj=True)
00:35:48  files
00:35:48
00:35:48  Traceback:
00:35:48   mismatch 'text/plain'
00:35:48
00:35:48   assert reference_output == test_output failed:
00:35:48
00:35:48    "['https://pa...21001231.nc']" == "['https://pa...20101130.nc']"
00:35:48    Skipping 61 identical leading characters in diff, use -v to show
00:35:48    - birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?           ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    + birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?           ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48    -  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r2i1p1/tasmin/tasmin_day_CanESM2_rcp85_r2i1p1_20060101-21001231.nc',
00:35:48    ?                                                                        ^^^ ^ ^^^   ^                  ^                       ^^^   ^        ^       --       ^^
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r1i1p1_21300101-21391231.nc',
00:35:48    ?                                                                        ^^^^^^ ^^^^ ^^^^   ^^^                  ^                       ^^^^   ^^^        ^      ++        ^^
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r1i1p1_21810101-21891231.nc',
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r4i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r4i1p1_20951201-21001130.nc',
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-CC/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-CC_rcp85_r1i1p1_20451201-20501130.nc',
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_20401201-20451130.nc',
00:35:48    -  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_CanESM2_rcp85_r3i1p1_20060101-21001231.nc',
00:35:48    ?                                                                        ^^^ ^ ^^^   ^                                          ^^^   ^                 -      - ^
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MPI-M/MPI-ESM-LR/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_MPI-ESM-LR_rcp85_r3i1p1_20400101-20491231.nc',
00:35:48    ?                                                                        ^^^^^^ ^^^^ ^^^^   ^^^                                          ^^^^   ^^^                +        ^^
00:35:48    -  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r4i1p1/tasmin/tasmin_day_CanESM2_rcp85_r4i1p1_20060101-21001231.nc',
00:35:48    -  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_CanESM2_rcp85_r1i1p1_20060101-21001231.nc']
00:35:48    ?                                                                         ^^ ^^^^^  --                                          ^ ^  --                ---     -   ^ ^    ^
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_20151201-20201130.nc',
00:35:48    ?                                                                        +++++++++ ^^^^^^ ^^                                            ^ ^^^^^^                   +++     +  ^ ^    ^^
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-ES/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_HadGEM2-ES_rcp85_r1i1p1_22191201-22291130.nc',
00:35:48    +  'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MOHC/HadGEM2-CC/rcp85/day/atmos/r3i1p1/tasmin/tasmin_day_HadGEM2-CC_rcp85_r3i1p1_20051201-20101130.nc']
00:35:48
00:35:48
00:35:48  _____ pavics-sdi-master/docs/source/notebooks/catalog_search.ipynb::Cell 2 _____
00:35:48  Notebook cell execution failed
00:35:48  Cell 2: Cell outputs differ
00:35:48
00:35:48  Input:
00:35:48  result['response']['docs'][0]
00:35:48
00:35:48  Traceback:
00:35:48   mismatch 'text/plain'
00:35:48
00:35:48   assert reference_output == test_output failed:
00:35:48
00:35:48    "{'cf_standar...21001231.nc'}" == "{'cf_standar...21001231.nc'}"
00:35:48    Skipping 56 identical leading characters in diff, use -v to show
00:35:48    - birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?           ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    + birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?           ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48       'replica': False,
00:35:48    -  'wms_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?                                                                                                                                        ^^^^^^^^^^^^^                  ^                       ^^^^^^^        ^^^^^^^^^
00:35:48    +  'wms_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?                                                                                                                                        ^^^^^^^^^                  ^                       ^^^^^^^^^        ^^^^^^^^^
00:35:48       'keywords': ['air_temperature',
00:35:48        'day',
00:35:48        'application/netcdf',
00:35:48        'tasmin',
00:35:48        'thredds',
00:35:48        'CMIP5',
00:35:48        'rcp85',
00:35:48    -   'CanESM2',
00:35:48    -   'CCCma'],
00:35:48    +   'MRI-CGCM3',
00:35:48    +   'MRI'],
00:35:48    -  'dataset_id': 'CCCMA.CanESM2.rcp85.day.atmos.r5i1p1.tasmin',
00:35:48    ?                 ^^^ ^^^^^^^^^                  ^
00:35:48    +  'dataset_id': 'cmip5.MRI.rcp85.day.atmos.r1i1p1.tasmin',
00:35:48    ?                 ^^^^^^ ^^                  ^
00:35:48       'datetime_max': 'DATE_TIME_TZ',
00:35:48    -  'id': '29186a2db2230376',
00:35:48    +  'id': '0035405c47cd3a2f',
00:35:48       'subject': 'Birdhouse Thredds Catalog',
00:35:48       'category': 'thredds',
00:35:48    -  'opendap_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?                                                                                       ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    +  'opendap_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?                                                                                       ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48    -  'title': 'tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?                        ^^^^ ^        ^       ^
00:35:48    +  'title': 'tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?                       ++++ ^^ ^        ^       ^
00:35:48       'variable_palette': ['default'],
00:35:48       'variable_min': [0],
00:35:48       'variable_long_name': ['Daily Minimum Near-Surface Air Temperature'],
00:35:48    -  'source': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog.xml',
00:35:48    +  'source': 'https://pavics.ouranos.ca//twitcher/ows/proxy/thredds/catalog.xml',
00:35:48    ?                                      +
00:35:48       'datetime_min': 'DATE_TIME_TZ',
00:35:48       'score': 1.0,
00:35:48       'variable_max': [1],
00:35:48       'units': ['K'],
00:35:48    -  'resourcename': 'birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?                             ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    +  'resourcename': 'birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?                             ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48       'type': 'File',
00:35:48    -  'catalog_url': 'https://pavics.ouranos.ca/thredds/catalog/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/catalog.xml?dataset=birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    +  'catalog_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/catalog.xml?dataset=birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48       'experiment': 'rcp85',
00:35:48       'last_modified': 'DATE_TIME_TZ',
00:35:48       'content_type': 'application/netcdf',
00:35:48    -  '_version_': 1599589044577107972,
00:35:48    +  '_version_': 1658705770170023939,
00:35:48       'variable': ['tasmin'],
00:35:48    -  'url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc',
00:35:48    ?                                                                                    ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    +  'url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc',
00:35:48    ?                                                                                    ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48       'project': 'CMIP5',
00:35:48    -  'institute': 'CCCma',
00:35:48    ?                ^^^^^
00:35:48    +  'institute': 'MRI',
00:35:48    ?                ^^^
00:35:48       'frequency': 'day',
00:35:48    -  'model': 'CanESM2',
00:35:48    +  'model': 'MRI-CGCM3',
00:35:48       'latest': True,
00:35:48    -  'fileserver_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/CCCMA/CanESM2/rcp85/day/atmos/r5i1p1/tasmin/tasmin_day_CanESM2_rcp85_r5i1p1_20060101-21001231.nc'}
00:35:48    ?                                                                                               ^^^ ^^^^^^^^^                  ^                        ^^^^ ^        ^       ^
00:35:48    +  'fileserver_url': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/cmip5/MRI/rcp85/day/atmos/r1i1p1/tasmin/tasmin_day_MRI-CGCM3_rcp85_r1i1p1_20960101-21001231.nc'}
00:35:48    ?                                                                                               ^^^^^^ ^^                  ^                       ++++ ^^ ^        ^       ^
00:35:48
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants