> Note: This is a new version of a notebook that I originally worked up in a more experimental locale. It now operates using the [pymodelcat](https://github.com/usgs-biolab/pymodelcat) package that I started to house this work and uses a forked [sciencebasepy](https://github.com/skybristol/sciencebasepy/tree/weblink_structured_data_crawler) to handle web link annotation.

Now that we have a [container](https://www.sciencebase.gov/catalog/item/5e8de96182cee42d134687cc) for model items in ScienceBase that we can operate against, it opens up all kinds of interesting possibilities for "robots" to do some work for us. This notebook explores what we might be able to get from the web links added into the mix. Theoretically, those represent a wealth of meaty material to help flesh out the model catalog as a useful resource. We can write some code that can go do some gathering, and then decide if there is anything that we can consistently bring back in to flesh out the items. Because we're doing this in code and this whole thing is experimental, we can be reasonably safe in writing information back into ScienceBase for use and evaluation. We just need to keep track of the parts of ScienceBase Items where we want to "cede control to the robots" and the parts we want to manage in some other way.

There are several different strategies for leveraging links to gather more information. One interesting dynamic would be to simply index everything we find on subsequent landing pages and even spidering into the contents like any search engine. I've proposed in the past that ScienceBase could do this writ large, providing directed search engine functionality to go after linked content in meaningful ways.

For this exercise, I'm focusing on a couple ways of fishing for structured metadata. This little robot is essentially the machine we talk about when we say things about "machine-readable" metadata or other content. A couple of potential strategies occur to me:

* Content negotiation - Some web pages and applications accessible over HTTP enable content negotiation, which is a method for the requestor to negotiate the structure or substance of the content that is returned in a response. It can also do things like specify languages that content should be returned in. This is all based on what the server/application providing the content will actually support. You can't force the system to give you something it isn't prepared to deliver. Systems like ScienceBase and some other USGS platforms do support content negotiation with a couple different machine readable options, so it's worth trying to see what we might be able to use.
* Structured metadata - There are a whole variety of ways that web systems have worked out to embed structured, machine-readable metadata within the HTML content delivered through web pages and web apps. Many of these coalesce around the schema.org set of content specifications to give us some degree of consistency to work from. One cool thing about these techniques is that they can be implemented right within the primary vehicle for web content delivery - the web page viewed by humans. Perhaps the coolest thing from a data science perspective is that the specifications and encoding methods really encourage explicit semantics (meaning we don't have to guess about what words mean), use of persistent identifiers and linking to associated registries (so we don't have to guess about disambiguating things), and linking between concepts that relate (so we can go after a wealth of information and tie it all together into a network/graph).

Of these two, structured metadata probably offers interesting promise in terms of richness of content, consistency in resulting data, and overall ease of use. Unfortunately, the uptake of these methods in USGS is pretty abysmal, including in ScienceBase where we've not updated our use of schema.org metadata in 8 years or so.

As a last resort, we could fall back on painful web scraping methods where we essentially parse HTML content and try to extract useful information. These are painful, because every one would essentially be a mostly custom affair that is probably more trouble than its worth. I do show a small bit of that here just for demonstration purposes.

In [10]:
from pymodelcat.catbuilder import Catbuilder
from IPython.display import display
import pandas as pd
import qgrid

Most of the functional logic to operate this robot is now housed in the Catbuilder class of the pymodelcat Python package. We instantiate the class as the cb object here and then operate the various functions. I'll explain what the functions do, but please reference the package code as needed.

In [2]:
cb = Catbuilder()

In this codeblock, I use a get_models function to retrieve the model items in the model catalog. Right now, this returns a scaled down data structure that has just the stuff we care about working with most. We'll need to work up some options in this function for future use if we want to continue having a bit of a model-specific abstraction on the ScienceBase Item model via this API.

> After migrating the link annotation code into my fork of sciencebasepy, I reworked the function to essentially work against each ScienceBase Item and its links as opposed to an earlier iteration where I pulled out all unique URLs from the entire collection and then ran those. After thinking through how this system would likely operate in production, working item by item seemed the best overall approach. We would ultimately want to establish some type of registry and associated API for every link check the system ever runs that would be decoupled from the links themselves. The registry would be checked by our code processes (likely operated as lambdas) and only run fresh periodically to check for new information nased on some business rules.

At this point, I'm not concerned about what type of link we're dealing with. I can again come back and work on using the title of the web links (e.g., "Model Reference Link") to make decisions on what to do with what I find. In looking through the links, however, and the information that does come back on some of them, we might need to do a little more work to better classify just what these links mean when it comes to using content from their pages in a meaningful way (more on that later). To give a sense of what's returned here, I output a couple of examples.

In [3]:
models = cb.get_models(fields="title,webLinks")
print(len(models))
display(models[:5])

108


[{'id': '5e8e569482cee42d1348c1d5',
  'title': 'FourPt',
  'webLinks': [{'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': 'https://water.usgs.gov/software/FourPt/',
    'rel': 'related',
    'title': 'Model Reference Link',
    'hidden': False}]},
 {'id': '5e8e5abe82cee42d1348c1fd',
  'title': 'MODFLOW-OWHM',
  'webLinks': [{'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': 'https://water.usgs.gov/ogw/modflow-owhm/',
    'rel': 'related',
    'title': 'Model Reference Link',
    'hidden': False}]},
 {'id': '5e8de97682cee42d134687d9',
  'title': 'COAWST',
  'webLinks': [{'type': 'webLink',
    'typeLabel': 'Web Link',
    'uri': 'https://www.usgs.gov/center-news/coupled-ocean-atmosphere-waves-sediment-transport-coawst-modeling-system-training?qt-news_science_products=2#qt-news_science_products',
    'rel': 'related',
    'title': 'Model Reference Link',
    'hidden': False}]},
 {'id': '5e8e59f682cee42d1348c1f6',
  'title': 'MODFE',
  'webLinks': [{'type': 'webLink',
 

With our list of model items, we can now go out and figure out if we have anything interesting to work with. After working up the link examination process in the context of ScienceBase, I set it up to return an "annotation" data structure as a new key in each webLink for a given item. That gives us all the information we have to evaluate right inline with our other webLink information.

The main sciencebasepy.Weblinks class has a set of functions that operate together in a configurable way on webLink records through the following logical components to gather potentially useful data.

* Content negotiation is a whole topic in its own right that could use some careful consideration and more advanced methdos than I use here. As I said above, structured metadata holds a lot more interest and potential. In the current iteration, I check for either an application/xhtml+xml or application/json content structure. If valid XML is returned, I use a nice metadata parsing abstraction ([gis-metadata-parser](https://github.com/consbio/gis-metadata-parser)) put together by folks at the Conservation Biology Institute to grab up useful metadata properties from CSDGM, ISO, or ArcGIS metadata. If JSON is available, I just bring back the whole structure (this will need work).
* For the structured metadata part, I've found the [extruct](https://github.com/scrapinghub/extruct) package from the ScrapingHub folks to be one of the most reliable, but it doesn't deal with quite all of the derivations folks have employed on embedding structured metadata in web pages. In this instance, I just try for anything we can get to evaluate for use.
* The final thing threw in here as something of a Hail Mary if a basic web page meta scraper. It parses HTML content using the Python BeautifulSoup package and returns the page title and any named meta tags with content. This can sometimes yield a reasonable description depending on a lot of factors. This is a really terrible way to try and go about things for any kind of consisency given the vagaries of content management systems, legacy content, and all kinds of factors. But it's some other stuff to look at. I put this into a function that could be built upon further if it turns out this is actually a reasonable source to think about. There are also some more robust alternatives to this like the ScrapingHub AutoExtract API that we could think about.

Running this kind of process in a big loop isn't a great method for some eventual production application. There are all kinds of considerations in doing this kind of thing in terms of having our robots be polite web crawlers, not freaking system administrators out with too many "weird" requests, and optimizing to check back in routinely over time for updated information. We could certainly paralellize this in a number of ways from launching Lambdas on the cloud to multithreading, and that would pull together data nice and quick. For now, working the relatively small number of links through in a loop is fine for demonstration and evaluation purposes.

> Note: If you are fiddling with this and want to see what things look like before trying every URL, add something like ```[:5]``` to the end of the "models" list in the function to only send a limited number of items through the process.

For our purposes at this point, we are really just trying to figure out what's useful out of this gathered information. I built a gathering process into Catbuilder that will run a list of model items and return either the ScienceBase Items with included annotation or flatten everything out and return a Pandas dataframe that might be useful for examination.

In [4]:
%%time
annotated_models = cb.annotate_model_links(models=models, output_format="python")

CPU times: user 45.1 s, sys: 1.04 s, total: 46.2 s
Wall time: 5min 7s


Now's the fun part: analyzing the data and seeing if there's anything we want to use. Well, it would be fun, except our results aren't actually all that consistent or robust.

> We have very few cases where we were able to pull any useful structured metadata in at all, so our most promising route for a consistent and powerful method is stymied by the fact that most of the systems behind these links haven't implemented that method. As a little bit of a side tangent, someone really ought to be encouraging this type of thing across the USGS and perhaps leading by example. Beyond supporting what we're trying to do here, these methods would go a long way to improving how USGS content presents itself on the web, how we might influence search rankings, and how we might influence the various knowledge graph efforts toward recognizing USGS as an authority on some subjects.

Content negotiation doesn't give us a whole lot of results either with a couple of notable exceptions. Somewhat by design, anything that points at a DOI link should respond to some type of accept header that will give us DOI metadata as a response.

As clunky as it is, meta tag scraping may still be a viable option to try to work from (if we can avoid encouraging bad behavior). Meta tags are really not designed for modern robots as they do not really have capacity for explicit semantics to understand what the intent of the meta tags should be. It's all a matter of convention and usage within a particular context, and that has to be unraveled and dealt with in some fashion.

Beyond the information structure, some of the research questions I would want to pursue include the following:

* Can we legitimately use anything we bring back in these processes in a meaningful way to add more depth to our model catalog?
* Can we use titles from any of these sources provide more than model acronyms or short names in our ScienceBase Items?
* Do any of the descriptions make sense to serve as an abstract for the concept of the model as cataloged, or are they for some specific aspect of the modeling system?
* Does our initial notional way of type classifying web links (e.g., Model Reference Link) help us in determining what information we can use?
* How might the extended information from model related assets like model output data or software code be encoded into the model items to add value without adding confusion?

To help facilitate looking through the results and set us up for the eventual capability we will want to build onto this, I set up a link_miner() function in Catbuilder. It is really crude at this point, and I know there are more efficient ways of working through the content. I focused on a couple of the more straightforward and content-rich types of structured schema.org content, XML metadata from content negotiation, and the higher level meta tags. The link_miner provides what is essentially a simple table containing a reference back to the model item in ScienceBase from which the links come from, the links and link types, and then the type of information, where it comes from in terms of an extraction method, and the content. Some of the content is pretty straightforward in terms of text strings, while for others, I left the basic raw content which could require further processing. For instance, there are lists of author names, which we could attempt to do some further work on.

To facilitate looking at this information, I pull the list of dictionaries from the link_miner for every distinct piece of potentially useful information into a Pandas dataframe and then use the qgrid package to show the data in a filterable table. One thing you can do with this is to filder on "link_classification" to show Model Reference Links only and then filter info_type to select the most likely candidates for improved titles (title, name, and og:title are probably the ones to look at for this use). You can also dump the dataframe to CSV or Excel or some other format for use with other tools.

In [5]:
mined_link_info = list()
for model in annotated_models:
    mined_link_info.extend(cb.link_miner(model, output_type="python"))
df_mined_link_info = pd.DataFrame(mined_link_info)

In [9]:
qgrid.show_grid(df_mined_link_info)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

# Next Steps
There's lots more to do here, likely focusing in on the link_miner() function. This is something that we could also eventually abstract up to a higher level with sciencebasepy once we know enough about what we are looking for, where we can likely find the most useful information, and develop whatever tests are necessary to validate that gleaned information is indeed useful. It will be interesting to study just how much context matters in terms of the overall utility of the information in these different kinds of structured metadata content methods. My guess is that it will matter a good deal, and we may need different profiles for different kinds of content or sources of content to aid in determining the best use of the information contained. This is also where the schema.org methods should begin to become much more important (when implemented correctly) as they do help to force a certain degree of foresight into how information will be perceived when taken out of context by pushing for encoding of explicit context into the schema itself.