### Schemas

Federally-hosted schemas? (Where are we getting these objects?)

Linkrot

(processing code found in the "XML Schema Identification and Statistics" IPy.)


Of 509,608 parsed XML responses, 124,769 contained a schema reference.

We have three sets of schema data - the references as pulled from the XML (noting that the @schemaLocation and @noNamespaceSchemaLocation attributes can have multiple, space-delimtied references), the references as individual URIs (split on the whitespace), and a list of federally-hosted schemas.

Note: these were automatically extracted but underwent some minor manual cleanup (in some cases, the space was obviously missing).

In [1]:
%matplotlib inline
import pandas as pd
import json as js  # name conflict with sqla
import sqlalchemy as sqla
from sqlalchemy.orm import sessionmaker
from IPython.display import display

In [2]:
# grab the clean text from the rds
with open('../local/big_rds.conf', 'r') as f:
    conf = js.loads(f.read())

# our connection
engine = sqla.create_engine(conf.get('connection'))

In [8]:
# the original, extracted schema references
with open('outputs/unique_packed_schemas_sorted.txt', 'r') as f:
    original_schemas = [s.strip() for s in f.readlines() if s]
original_schemas

['../../../OGC/gml/3.2.1/gml.xsd',
 '../../../OGC/iso/19139/20070417/gmd/gmd.xsd',
 '../../../OGC/om/1.0.1/om.xsd',
 '../../../OGC/sweCommon/1.0.2/swe.xsd',
 '../../../general-component-schema.xsd',
 '../../appinfo/2.0/appinfo.xsd',
 '../../common/1.0/common.xsd',
 '../../dei/2012/dei-2012-01-31.xsd',
 '../../general-component-schema.xsd',
 '../../gml/3.1.1/base/gml.xsd',
 '../../structures/2.0/structures.xsd',
 '../../xlink/1.0.0/xlinks.xsd',
 '../common/UBL-CommonBasicComponents-2.0.xsd',
 '../common/UBL-CommonExtensionComponents-2.0.xsd',
 '../common/ws_common_1_0b.xsd',
 '../fgdc-std-001-1998/fgdc-std-001-1998-ann.xsd',
 '../gco/gco.xsd',
 '../general-component-schema.xsd',
 '../gmd/gmd.xsd',
 '../napm/napmCharacterString.xsd',
 '../schema/frame.xsd',
 './aggregateTypes.xsd',
 './basicTypes.xsd',
 './encoding.xsd',
 './simpleTypes.xsd',
 './types.xsd',
 '/data /finance/disclosure/schema/AdminFine.xsd',
 '/data /finance/disclosure/schema/Leadership.xsd',
 '/data /finance/disclosure/

**1,960 unique references.**

In [5]:
# the references split by whitespace
with open('outputs/unique_schemas_sorted.txt', 'r') as f:
    unique_schemas = [s.strip() for s in f.readlines() if s]

unique_schemas

['../../../ISO_19136_Schemas/gml.xsd',
 '../../../OGC/gml/3.2.1/gml.xsd',
 '../../../OGC/iso/19139/20070417/gmd/gmd.xsd',
 '../../../OGC/om/1.0.1/om.xsd',
 '../../../OGC/sweCommon/1.0.2/swe.xsd',
 '../../../general-component-schema.xsd',
 '../../appinfo/2.0/appinfo.xsd',
 '../../common/1.0/common.xsd',
 '../../dei/2012/dei-2012-01-31.xsd',
 '../../gco/gco.xsd',
 '../../general-component-schema.xsd',
 '../../gml/3.1.1/base/gml.xsd',
 '../../gml/gml.xsd',
 '../../gmx/gmx.xsd',
 '../../schema/mets.xsd',
 '../../schema/mods-3-0.xsd',
 '../../schema/odl.xsd',
 '../../schema/xlink.xsd',
 '../../structures/2.0/structures.xsd',
 '../../xlink/1.0.0/xlinks.xsd',
 '../../xlink/xlinks.xsd',
 '../AOML/AEML.xsd',
 '../AOML/AOML.xsd',
 '../common/UBL-CommonBasicComponents-2.0.xsd',
 '../common/UBL-CommonExtensionComponents-2.0.xsd',
 '../common/ws_common_1_0b.xsd',
 '../fgdc-std-001-1998/fgdc-std-001-1998-ann.xsd',
 '../gco/gco.xsd',
 '../general-component-schema.xsd',
 '../gmd/gmd.xsd',
 '../itsoutp

From our unique, space-delimited list, we have **2,366 strings**. Not all of them refer to externally resolvable objects, ie they are relative paths, FTP links, etc. File names or relative paths: **363** (15% of the unpacked strings). 

A note on schemas and XML validation.

Command line validation tools such as the [Apache Xerces PParse](http://xml.apache.org/xerces-c-new/pparse.html) require that the schema(s) be included in the XML. So there's no option to point to a schema URL or file at run time. A schema reference needs to be injected into the XML before executing PParse.

This is also true of more integrated platforms, such as Oxygen. You need to assign a schema to an XML document before it can validate. This action inserts the reference as @schemaLocation or @noNamespaceSchemaLocation. 

So you need the reference *in* the XML for validation. But it doesn't need to be a URL. 

It isn't uncommon, when developing a schema or an XML document, to point to a local XSD for validation. It is often faster (XML validation against the externally-hosted schemas has some fairly significant latency - response times of seconds.)

The issue is when the XML is then published through some platform without the local path changed to some appropriate URL. Those files can't be validated without modification, injecting a schema reference, and it isn't always clear which schema should be used. In addition, schemas do change over time so it's important that the reference match that used for the document at the time of publication - otherwise a newer or simply different schema reference might be injected and the validation process returns errors.

And we can see, in these sets, different versions of the schemas identified in the URLs.

In [9]:
# federally hosted schemas
with open('outputs/federal_schemas_sorted.txt', 'r') as f:
    federal_schemas = [s.strip() for s in f.readlines() if s]

federal_schemas

['ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd',
 'ftp://ftp.ncddc.noaa.gov/pub/Metadata/Online_ISO_Training/Intro_to_ISO/schemas/ISObio/schema.xsd',
 'http://api.echo.nasa.gov/echo/wsdl/EchoForms.xsd',
 'http://aviationweather.gov/adds/schema/aircraftreport1_0.xsd',
 'http://aviationweather.gov/adds/schema/airsigmet1_1.xsd',
 'http://aviationweather.gov/adds/schema/gairmet1_0.xsd',
 'http://aviationweather.gov/adds/schema/pirep1_2.xsd',
 'http://aviationweather.gov/adds/schema/taf1_2.xsd',
 'http://data.nodc.noaa.gov/coris/data/CoRIS/fgdc_schema_coris/fgdc-std-001-1998.xsd',
 'http://data.usgs.gov/nggdpp/NGGDPPMetadataSample_v2.xsd',
 'http://earthquake.usgs.gov/eqcenter/shakemap/xml/schemas/shakemap.xsd',
 'http://earthquake.usgs.gov/shakemap/xml/schemas/shakemap.xsd',
 'http://echo.nasa.gov/v9/echoforms',
 'http://fgdcxml.sourceforge.net/schema/fgdc-std-012-2002/fgdc-std-012-2002.xsd',
 'http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/',
 'http://gcmd.gsfc.nasa.gov/Aboutu

**206 identified as federally-hosted schemas**

In [None]:
# linkrot in the set of unique schema URLs (post-split)
sql = """
select round(status_code, -2) as code,
	count(round(status_code, -2)) as num
from schema_status
group by code
order by code;
"""

df = pd.read_sql(sql, engine)
df
# with pd.option_context('display.max_colwidth', 1000):
#     display(df[:10])

In [4]:
# as a percent of the total number of schemas (2365 unique items)
sql = """
select round(status_code, -2) as code,
	round(count(round(status_code, -2)) / 2365. * 100.0, 2) as pct
from schema_status
group by code
order by code;
"""

df = pd.read_sql(sql, engine)
df

Unnamed: 0,code,pct
0,200,37.59
1,300,15.69
2,400,20.63
3,500,3.64
4,900,22.45


Only 53% are accessible (bearing in mind that 300 of the 900s are local references so not resolvable anyway).

Let's run a round of counting for federally-hosted ones.

(Just to note, we care about those, some of them more than others, because they are so commonly linked to in the EO-related metadata. While the ISO schemas are hosted on a non-fed server, for a long time, that system didn't include a root schema to link the set together for use in something like PParse. One of the NGDC schemas served as this root schema. Also, during the rounds of FGDC to ISO updates, examples XSLTs for that process were shared within the community. Those XSLTs often contained an NGDC schema reference so that became a default.)

In [14]:
from collections import defaultdict
from math import floor, ceil

with open('outputs/federal_schema_linkrot.csv', 'r') as f:
    fschemas = [s.strip().split(',') for s in f.readlines() if s.strip()]

fed_stats = defaultdict(int)
for schema, status, vdate in fschemas:
    code = floor(int((int(status) / 100.)) * 100)
    fed_stats[code] += 1
    
for k, v in fed_stats.iteritems():
    print '{0}: {1} schemas'.format(k, v)

200.0: 98 schemas
400.0: 16 schemas
900.0: 12 schemas
300.0: 78 schemas


These aren't as likely to be unavailable. (The chunk of redirects is almost entirely due to two systems, each with a large number of schemas referenced, multiple versions, etc.)