Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDC GetSeriesCatalogForBox3 provides richer response and more search parameters than GetSeriesCatalogForBox2 #1931

Open
emiliom opened this issue May 31, 2017 · 18 comments
Labels

Comments

@emiliom
Copy link
Contributor

emiliom commented May 31, 2017

@kdeloach, I think you (and I) have so far not done much testing on the WDC GetSeriesCatalogForBox3 API query. My "series" queries have focused on GetSeriesCatalogForBox2 (eg, this notebook). But it turns out that GetSeriesCatalogForBox3 has two important advantages:

  • It includes more query parameters (from controlled vocabularies): sampleMedium, dataType and valueType. Out of those, probably only sampleMedium is actually valuable for our likely use cases
  • More importantly, the response is considerably richer in information! That took me by surprise. It includes organization (data service) names, ID's and descriptions, plus other information. So we can do and show more with the response -- though it'll also be more verbose.

I ran a test, and GetSeriesCatalogForBox3 seems to work fine, though probably with the same current bugs and keyword limitations as in GetSeriesCatalogForBox2.

cc @aufdenkampe

@emiliom emiliom added the BigCZ label May 31, 2017
@kdeloach
Copy link
Contributor

Thanks for looking in to this. Do we want to display series or sites? Our current implementation uses GetSitesInBox2 combined with GetServicesInBox2 (Source). It doesn't look it would be difficult to use GetSeriesCatalogForBox3 instead though.

@emiliom
Copy link
Contributor Author

emiliom commented May 31, 2017

Ah, I didn't fully catch that. Got it.

Using GetSitesInBox2 combined with GetServicesInBox2 for now seems ok. @aufdenkampe and I should first discuss what should be displayed in the results; depending on what we decide, a series response may make more sense.

BTW, for future consideration: it occurred to me that currently there are only 94 "data services" in WDC (ie, the maximum total that can be returned by GetServicesInBox2), and those change rarely. Maybe a future strategy for better search responsiveness could be to get all services in the US lower 48 (based on a GetServicesInBox2 query) and cache them once a day, then just use that cache together with the actual query results from GetSitesInBox2.

@kdeloach
Copy link
Contributor

kdeloach commented Jun 1, 2017

Because there are so few services, the GetServicesInBox2 request is very fast. For the moment, this isn't a performance bottleneck. But to avoid making unnecessary requests, and for the reasons you described, it may be a good idea to cache the results anyway. I'll create an issue for this.

@kdeloach
Copy link
Contributor

kdeloach commented Jun 1, 2017

One advantage I see in using GetSeriesCatalogForBox3 is that the results contain conceptKeyword which we can use for client-side filtering. Neither GetServicesInBox2 nor GetSitesInBox2 exposes this field.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 1, 2017

One advantage I see in using GetSeriesCatalogForBox3 is that the results contain conceptKeyword which we can use for client-side filtering.

Just for reference, this is a broader difference between the GetSeriesCatalogForBox* queries and your current approach; it's not a unique feature of GetSeriesCatalogForBox3 per se.

A GetSeriesCatalogForBox* request that doesn't specify a keyword will get multiple series per site. Client-side processing could be used to make the response more user friendly (I think) by grouping the different series into a single site "dataset" record. But that's getting into details that Anthony and I should probably discuss first.

@aufdenkampe
Copy link
Member

​Yes! I agree with Emilio that we want to focus on the information in the GetSeriesCatalogForBox*, but group by Sitename. He and I should discuss this first, and soon.

The hierarchy of this information model here is that:

  • A Service can have many Sites
  • A Site can have many Series
  • A Series has one and only one Variable (VarCode & VarName), along with it's associated metadata (beginDate, samplemedium, Speciation, MethodDesc, etc.).

We want to organize our returns by Sites, but display (and search) the info on all the Series at each Site.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 2, 2017

FYI, specially for Anthony: I've updated the notebook gist I mentioned at the start of this issue, to include examples from both GetSeriesCatalogForBox2 and GetSeriesCatalogForBox3

@aufdenkampe
Copy link
Member

aufdenkampe commented Jun 7, 2017

@kdeloach and @ajrobbins , I had a conversation with @emiliom on Friday where we carefully explored the best approach for searching CUAHSI Water Data Center (WDC) and compiling the output provided to the user. We used these resources to inform our discussion:

The approach below resolves the open question about whether to use GetSitesInBox2 or GetSeriesCatalogForBox3. See #1858 and #1931 for more details. This will allow us to move forward on the CUAHSI WDC #1945. @emiliom can add where necessary.

A. These are the GET requests that Azavea should use:

  1. Azavea should run a GetSeriesCatalogForBox2​ GET request each time a user does a dataset search

    • xmin, xmax, ymin, ymax as required arguments
    • beginDate, endDate as optional arguments
    • don't use additional arguments
    • Note: we presently prefer this over GetSeriesCatalogForBox3​ because we're worried about query response time, and Box2 seems to provide just the right amount of metadata. (Note: While GetSeriesCatalogForBox2​​ does not have VariableUnitsAbbrev and GetSeriesCatalogForBox3 does, we're confident this will not be a problem because we only need units info once we fetch the actual data values. Let's reevaluate after we start fetching data.)
  2. Azavea should run a GetServicesInBox2​ GET request once per day (or week?) for the entire world, and save the results to be merged with the returned results from GetSeriesCatalogForBox2​ GET requests. This is captured in issue Cache CUAHSI GetServicesInBox2 results #1932.

  3. Azavea should NOT run any GetSitesInBox2* GET requests, because all relevant site info is included in each GetSeriesCatalogForBox2​ GET request.

B. Azavea should develop a means to combine and filter GET request results in the following ways.

  1. ​Combine/merge all records by site ('location' code) from a GetSeriesCatalogForBox2​ GET request.
    a. ​​​These fields, below, will be the same for each SeriesRecord with the same 'location' code

    • ​​​ServCode
    • ServURL
    • location
    • Sitename
    • latitude
    • longitude

    b. These fields, below, will be different for each SeriesRecord with the same 'location' code​, and will be grouped by VarCode

    • VarCode
    • VarName
    • beginDate
    • endDate
    • ValueCount
    • datatype
    • valuetype
    • samplemedium
    • timeunits
    • conceptKeyword
    • genCategory
    • TimeSupport
  2. ​Append each of these new SiteRecords (created above in B1 ) with the associated ServicesRecord metadata that was saved from A2, above, using the ServCode / ServiceID (there's a 1:1 map for these).

C. Later on, Azavea should develop a client-side means for filtering SiteRecords via a free-text search of all the terms in all the fields of the combined, hierarchical Site+Service(s)+Series result. Described in #1936.

D. We will likely want to develop ​​​Constructors to build (and expose) user friendly URLs for Services and selected Sites, which resolves the questions in #1859.

  1. Service URL. The ServURL in from GetServicesInBox2 does not provide a web-friendly URL (e.g. for ServCode = "NWISDV", ServURL = "http://hydroportal.cuahsi.org/nwisdv/cuahsi_1_1.asmx"). However, a friendly URL can easily be constructed from ServiceID (which has a 1:1 map with ServCode) by following this pattern: http://hiscentral.cuahsi.org/pub_network.aspx?n=1
  2. Site URLs can be similarly constructed from some Services, such as https://waterdata.usgs.gov/nwis/uv/?site_no=14113000, when ServCode = "NWISDV" and location = "NWISDV:14113000"​.
    * We would create Site URL constructor for a handful of important Services, such as USGS NWIS and Data.EnviroDIY. Let's start with ServCode = "NWISDV" as an example. We will not explore Site URLs from other services for now, but the list will expand in time.

kdeloach pushed a commit that referenced this issue Jun 12, 2017
Replace `GetSitesInBox2` with `GetSeriesCatalogForBox2` based on
feedback from #1931.

Connects #1858
kdeloach pushed a commit that referenced this issue Jun 12, 2017
Replace `GetSitesInBox2` with `GetSeriesCatalogForBox2` based on
feedback from #1931.

Connects #1858
kdeloach pushed a commit that referenced this issue Jun 12, 2017
Replace `GetSitesInBox2` with `GetSeriesCatalogForBox2` based on
feedback from #1931.

Connects #1858
kdeloach pushed a commit that referenced this issue Jun 12, 2017
Replace `GetSitesInBox2` with `GetSeriesCatalogForBox2` based on
feedback from #1931.

Connects #1858
kdeloach pushed a commit that referenced this issue Jun 13, 2017
Replace `GetSitesInBox2` with `GetSeriesCatalogForBox2` based on
feedback from #1931.

Connects #1858
@kdeloach
Copy link
Contributor

These changes have been implemented in PR #1959. Check out the screenshots to compare the differences.

Notes:

GetSeriesCatalogForBox2 produces a greater volume of results, but the amount of metadata available hasn't increased much, compared to using GetSitesInBox2. The only fields common to series records are: ​​​ServCode, ServURL, Sitename, location, latitude, longitude, beginDate, and endDate. These are the fields we will expose from our API.

We still don't have access to these fields for each resource:

  • author
  • created date (currently, this field is populated with beginDate-- not sure if this is correct)
  • updated date

This is a known issue, but the beginDate and endDate filters don't seem to do anything. I get the same results no matter which dates I try.

We can dynamically generate URLs for each resource, if necessary. However, we don't need to generate URLs for services, since that is already available from the ServiceDescriptionURL field from GetServicesInBox2.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 20, 2017

GetSeriesCatalogForBox2 produces a greater volume of results,

Yes, that's expected, as Anthony has mentioned above.

but the amount of metadata available hasn't increased much, compared to using GetSitesInBox2

There's much more metadata coming in! See @aufdenkampe's comment above. Maybe what you mean is that there isn't much more metadata for the subset of metadata defined in your common dataset record metadata? Assuming this interpretation I'm making is correct, I guess that would be true b/c that dataset record metadata did not encompass the additional information Anthony listed in B.1.b that comment above (except for beginDate and endDate).

We still don't have access to these fields for each resource:

  • author
  • created date (currently, this field is populated with beginDate-- not sure if this is correct)
  • updated date

These do not exist in the WDC response, per se. But that should be ok.

Depending on how author is used, the service provider (derived from ​​​ServCode and the results of GetServicesInBox2) could be used for it.

This is a known issue, but the beginDate and endDate filters don't seem to do anything. I get the same results no matter which dates I try.

Ok. I guess this depends on the roll-out of the fixed WDC Catalog API, which hadn't been released as of June 7.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 20, 2017

With Kevin gone, I don't know if @rajadain is now automatically pinged. So I'm pinging him here.

@rajadain
Copy link
Member

Thanks for pinging me @emiliom, I'll subscribe to all issues created so far so that I'm notified. I'll go through the discussion and respond here shortly.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 20, 2017

After reading through the comments in #1959, I think it's clearer we have a misunderstanding about what Anthony's and my intent was. It's clear that in that PR, the metadata specific to a "series" (which is sort-of a synonym for "variable" is thrown out.

Anthony and I will submit a much more specific request/recommendation for what should be shown in the WDC dataset record boxes on the UI.

@rajadain
Copy link
Member

Thanks, we'll wait for that.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 21, 2017

Just a couple of references, for future use:

  • A Jupyter notebook I wrote (and updated in early June) that illustrates calls to and responses from GetSeriesCatalogForBox2 and GetSeriesCatalogForBox3
  • A sample link to the human-readable (though not terribly nice) page for a WDC service (USGS NWISDV): http://hiscentral.cuahsi.org/pub_network.aspx?n=1, where the n value is the service or source ID, SourceId. The information on this page should be the same information available from a GetServices* request (eg, GetServicesInBox2); Anthony mentioned this in his long comment.

@aufdenkampe
Copy link
Member

aufdenkampe commented Jun 26, 2017

@rajadain, we just created the Sample_WDC_Site_Record_BiGCZPortal_SearchResult Google Doc to provide an example record to display.

In brief, output should look like this:

NWISDV:14113000
KLICKITAT RIVER NEAR PITT, WA. Observations on SurfaceWater, Air.
Variables: Discharge, stream – Temperature, air
From U.S. Geological Survey (USGS) NWISDV web service.
Date range for site: 1909-07-01 to 2017-01-26.

Which would be constructed from this set of responses:

<location>
<Sitename>[https://waterdata.usgs.gov/nwis/uv/?site_no=14113000]. Observations on <samplemedium 1>[, samplemedium 2, samplemedium 3, …]. 
Variables: <conceptKeyword 1> - <conceptKeyword 2> - …
From [<GSERV:SourceOrg> - <ServCode>](<GSERV:http://hiscentral.cuahsi.org/pub_network.aspx?n=SourceId>) web service.
Date range for site: <series Min beginDate> to <series Max endDate>.

Please see the Goolge Doc for better formatting and additional info.

@rajadain
Copy link
Member

Thanks @aufdenkampe. I just tried out GetSeriesCatalogForBox3 and GetServicesInBox2, and was able to confirm that the data format you suggest is derivable.

Since this set of information isn't a lot, were you thinking this to be in the "list" view or the "detail" view? It could very well fit in a list view.

I was unable to find examples listing multiple conceptKeywords or samplemediums, at least in the samples used in @emiliom's Jupyter notebook. Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?

And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV? Other ServCodes I encountered were GLDAS_NOAH, NLDAS_NOAH, and MOPEX.

One potential use of "variable linking" would be to filter results by the clicked variable. Other ideas may present themselves as we progress further along the implementation.

@emiliom
Copy link
Contributor Author

emiliom commented Jun 27, 2017

I was unable to find examples listing multiple conceptKeywords or samplemediums, at least in the samples used in @emiliom's Jupyter notebook.

Try this: NWISUV:01474500. It has a suite of water quality sensors (pH, oxygen, turbidity), in addition to discharge, plus rainfall. It should give you a diverse set of results for testing. Plus it's in the Azavea neighborhood: USGS 01474500 Schuylkill River at Philadelphia, PA. You can browse it in "my" Monitor-My-Watershed pilot application:
http://www.wikiwatershed-vs.org/Explorer?action=oiw:fixed_platform:USGS_01474500

If you'd like to examine it using my jupyter notebook, these request parameters worked for me:

bbox1 = (39.9, -75.2, 40.0, -75.1)
keyword = ''
start_date = '01/01/2016'
end_date   = '12/31/2016'

Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?

Neither here nor there. It just seems more obvious than a comma, plus some of the "variable" (conceptKeywords) strings include commas.

And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV?

Yes, but I believe you should also add NWISUV, and possibly NWISGW (both USGS services). No other services, for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants