### ISO Specific Information

Mostly related to how version is represented and rolecode details. 

Are the rolecode URLs still available? How many are from an unexpected source? 

Do any responses reference code lists from different sources? 

How many responses (or host systems) contain a reference to a codelist that's no longer available? 

And what informational content will be affected, ie are the spatial reference blocks more likely to reference a dead code list link?


How are ISO versions represented? Is it clear enough to understand how it might be structured?

(So, no. What matters more for parsing and reading the XML is more keyed to the schema definition. SMAP is interesting in that there's a mimetype (unnecessary) but I think not a custom schema - it's a standardized structure for SMAP in ISO but is just about how to use these elements in this way to reduce the variability. It should just be ISO and validated that way; knowing it's SMAP doesn't affect anything *if* a client is built to the spec, not the server (haha. hahaha.).) But no, the version string gives you a bit of a clue to where some of those other, similar uses of ISO. This is different than INSPIRE, right, that one is an additional namespace so it is validated for ISO and for INSPIRE structure.


In [1]:
%matplotlib inline
import pandas as pd
import json as js  # name conflict with sqla
import sqlalchemy as sqla
from sqlalchemy.orm import sessionmaker
from IPython.display import display

In [2]:
# grab the clean text from the rds
with open('../local/big_rds.conf', 'r') as f:
    conf = js.loads(f.read())

# our connection
engine = sqla.create_engine(conf.get('connection'))

In [None]:
df = pd.read_sql(sql, engine)
with pd.option_context('display.max_colwidth', 1000):
    display(df[:10])

### Versions

Where to find ISO version information:



| Standard Name                                                       |  Version                                                                | 
|---------------------------------------------------------------------|-------------------------------------------------------------------------| 
| //*/metadataStandardName/CharacterString                            |  //*/metadataStandardVersion/CharacterString                            | 
| //*/seriesMetadata/MD_Metadata/metadataStandardName/CharacterString |  //*/seriesMetadata/MD_Metadata/metadataStandardVersion/CharacterString | 
| //*/seriesMetadata/MI_Metadata/metadataStandardName/CharacterString |  //*/seriesMetadata/MI_Metadata/metadataStandardVersion/CharacterString | 


This is just the version info for single metadata records (not the DS proper, but the MI/MD contained by a DS). [Speculation] This bit of information is not often included in the CSW or OAI-PMH getrecords outputs (those never track as valid ISO and certainly not at all complete ISO compared to what can be accessed from other parts of those larger portal systems). That last is important? or not, depends on whether we think CSW or OAI-PMH is functional (slow, so painfully slow in current implementations). They do not track as first-class dev efforts - must have it but not key to the system so not well-built and not well-maintained. But, back one of the points, did not validate any of those internal ISO records (if we have any and we do) and, as noted previously, the specs for XML validation allow for nested vaildation but that is not implemented. PParse, part of the Apache Xerces package, is (been a while since I had to go through that eval) one of the view available free options and it does not support that kind of validation.

In [7]:
# the set of unique version strings

sql = """
select distinct standard_name, standard_version
from iso_versions
order by standard_name, standard_version;
"""

df = pd.read_sql(sql, engine)
with pd.option_context('display.max_rows', 75):
    display(df[:61])

Unnamed: 0,standard_name,standard_version
0,ANZLIC Metadata Profile: An Australian/New Zea...,1.1
1,DM - Regole tecniche RNDT,10 novembre 2011
2,Gemini,2.1
3,GEMINI,2.1
4,INSPIRE Implementing Rules for Metadata,1.2
5,ISO19115,1.0
6,ISO19115,2003/Cor.1:2006
7,ISO 19115,2003
8,ISO 19115,ISO 19139 / DKRZ ISO Simple Profile V1.0
9,ISO 19115,Nederlandse metadata profiel op ISO 19115 voor...


In [11]:
# let's go with pct of all iso per version string, out of 19,689
sql = """
with j as
(
	with i as
	(
		select d.response_id, jsonb_array_elements(d.identity::jsonb) ident
		from identities d
		where d.identity is not null
	), a as (
		select count(*) as total
		from responses r join i on i.response_id = r.id
		where i.ident->>'protocol' = 'ISO'
	)

	select r.id, a.total
	from responses r join i on i.response_id = r.id
		natural inner join a 
	where i.ident->>'protocol' = 'ISO'
), v as 
(
	select response_id,
		standard_name || '; ' || standard_version as the_version
	from iso_versions
)

select v.the_version, count(v.response_id) as num,
	round(count(v.response_id) / max(j.total)::numeric * 100.0, 2) as pct_of_iso
from responses r join v on v.response_id = r.id
	join j on j.id = r.id
group by v.the_version
order by pct_of_iso DESC;
"""

df = pd.read_sql(sql, engine)
with pd.option_context('display.max_rows', 55, 'display.max_colwidth', 500):
    display(df[:54])
    
# the sum is 200 shy 
# TODO: verify that count (not from versions unless there're duplicate entries?)

Unnamed: 0,the_version,num,pct_of_iso
0,ISO 19115-2 Geographic Information - Metadata - Part 2: Extensions for Imagery and Gridded Data; ISO 19115-2:2009(E),6666,33.86
1,ISO 19115 Geographic Information - Metadata; 2009-02-15,2516,12.78
2,ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data; ISO 19115-2:2009(E),1503,7.63
3,ISO19115; 2003/Cor.1:2006,1470,7.47
4,ISO-USGIN; 1.2,1241,6.3
5,ISO 19115; 2003,1179,5.99
6,ISO-NAP-USGIN; 1.1.4,1146,5.82
7,ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for Imagery and Gridded Data; ISO 19115-2:2009(E),1074,5.45
8,MEDIN Discovery Metadata Standard; Version 2.3.5,411,2.09
9,ISO 19139/19119 Metadata for Web Services; 2005,399,2.03


In [None]:
# pct of the iso responses, ordered by number of responses for the version string,
# per host

sql = """
with j as
(
	with i as
	(
		select d.response_id, jsonb_array_elements(d.identity::jsonb) ident
		from identities d
		where d.identity is not null
	), a as (
		select r.host, count(r.id) as num_per_host
		from responses r join i on i.response_id = r.id
		where i.ident->>'protocol' = 'ISO'
		group by r.host
	)

	select r.id, r.host, a.num_per_host
	from responses r join i on i.response_id = r.id
		join a on a.host = r.host
	where i.ident->>'protocol' = 'ISO'
), v as 
(
	select response_id,
		standard_name || '; ' || standard_version as the_version
	from iso_versions
)

select v.the_version, j.host, --count(j.id) as num_iso_responses,
	round(count(j.id) / max(j.num_per_host)::numeric * 100.0, 2) as pct_of_iso
from responses r join v on v.response_id = r.id
	join j on j.id = r.id
group by v.the_version, j.host
--order by j.host, pct_of_iso DESC;
order by count(v.response_id) DESC;
"""
    
df = pd.read_sql(sql, engine)

In [8]:
with pd.option_context('display.max_rows', 150):
    display(df[:145])

Unnamed: 0,the_version,host,pct_of_iso
0,ISO 19115-2 Geographic Information - Metadata ...,catalog.data.gov,76.81
1,ISO 19115 Geographic Information - Metadata; 2...,meta.geo.census.gov,100.0
2,ISO-USGIN; 1.2,repository.stategeothermaldata.org,51.09
3,ISO 19115; 2003,www.usgs.gov,99.83
4,ISO-NAP-USGIN; 1.1.4,repository.stategeothermaldata.org,48.91
5,ISO 19115-2 Geographic Information - Metadata ...,data.noaa.gov,86.41
6,ISO 19115-2 Geographic Information - Metadata ...,www.ngdc.noaa.gov,99.7
7,ISO 19115 Geographic Information - Metadata; 2...,catalog.data.gov,16.13
8,ISO 19115-2 Geographic Information - Metadata ...,inport.nmfs.noaa.gov,100.0
9,ISO19115; 2003/Cor.1:2006,opendata.euskadi.eus,100.0


In [12]:
sql = """
with j as
(
	with i as
	(
		select d.response_id, jsonb_array_elements(d.identity::jsonb) ident
		from identities d
		where d.identity is not null
	), a as (
		select r.host, count(r.id) as num_per_host
		from responses r join i on i.response_id = r.id
		where i.ident->>'protocol' = 'ISO'
		group by r.host
	)

	select r.id, r.host, a.num_per_host
	from responses r join i on i.response_id = r.id
		join a on a.host = r.host
	where i.ident->>'protocol' = 'ISO'
), v as 
(
	select response_id,
		standard_name || '; ' || standard_version as the_version
	from iso_versions
)

select v.the_version, j.host, --count(j.id) as num_iso_responses,
	round(count(j.id) / max(j.num_per_host)::numeric * 100.0, 2) as pct_of_iso
from responses r join v on v.response_id = r.id
	join j on j.id = r.id
group by v.the_version, j.host
--order by j.host, pct_of_iso DESC;
order by j.host, pct_of_iso DESC;
"""
    
df = pd.read_sql(sql, engine)
with pd.option_context('display.max_rows', 150, 'display.max_colwidth', 400):
    display(df[:145])

Unnamed: 0,the_version,host,pct_of_iso
0,ISO 19115-2; ISO 19115-2:2009(E),arcticlcc.org,100.0
1,ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for Imagery and Gridded Data; ISO 19115-2:2009(E),bluehub.jrc.ec.europa.eu,100.0
2,ISO 19115-2 Geographic Information - Metadata - Part 2: Extensions for Imagery and Gridded Data; ISO 19115-2:2009(E),catalog.data.gov,76.81
3,ISO 19115 Geographic Information - Metadata; 2009-02-15,catalog.data.gov,16.13
4,ISO 19115:2003/19139; 1.0,catalog.data.gov,1.72
5,ISO 19115 Geographic information - Metadata; ISO 19115:2003(E),catalog.data.gov,1.51
6,ISO 19115 Geographic Information - Metadata; ISO 19115,catalog.data.gov,1.31
7,ISO-USGIN; 1.2,catalog.data.gov,0.9
8,ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data; ISO 19115-2:2009(E),catalog.data.gov,0.76
9,ISO 19115 Geographic information - Metadata - Converted from Data.gov legacy DMS format.; ISO 19115:2003(E),catalog.data.gov,0.45


### Codelists


In [13]:
# how many unique codelist references (before the hash only)

sql = """
with a as
(
	select replace(left(codelist, strpos(codelist, '#')), '#', '') as codelist
	from codelists
)
select distinct codelist
from a
where codelist != ''
order by codelist;
"""

df = pd.read_sql(sql, engine)
with pd.option_context('display.max_rows', 60, 'display.max_colwidth', 400):
    display(df[:52])

Unnamed: 0,codelist
0,file://Agi-s000001/afdelingen/GSMS/Extern/Applicaties_en_Tools/Databeheer/Metadata/GeoMDE_22/resources/customCodelists_new.xml
1,file://appconf.intranet.rws.nl/Appconf/metadatatools/Metadatamaker/config_bestanden/customCodelists.xml
2,file://Shr-ipvw-gpr001/metadatatools/Metadatamaker/config_bestanden/customCodelists.xml
3,http://adl.brs.gov.au/anrdl/resources/codeList/codeList20120313.xml
4,http://asdd.ga.gov.au/asdd/profileinfo/GAScopeCodeList.xml
5,http://asdd.ga.gov.au/asdd/profileinfo/gmxCodelists.xml
6,http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/Codelist/gmxCodelists.xml
7,http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/Codelist/ML_gmxCodelists.xml
8,http://aws2.caris.com/sfs/schemas/iso/19139/CARIS_20120814/resources/Codelist/gmxCodelists.xml
9,http://aws2.caris.com/sfs/schemas/iso/19139/CARIS_20120814/resources/Codelist/srvCodelists.xml


In [18]:
# how many of those are http or https links?
# AND some are casing diffs only, so lowercase all
sql = """
with a as
(
	select lower(replace(left(codelist, strpos(codelist, '#')), '#', '')) as codelist
	from codelists
)
select distinct codelist
from a
where codelist ilike 'http%%'
order by codelist;
"""

df = pd.read_sql(sql, engine)
with pd.option_context('display.max_rows', 60, 'display.max_colwidth', 400):
    display(df[:46])

Unnamed: 0,codelist
0,http://adl.brs.gov.au/anrdl/resources/codelist/codelist20120313.xml
1,http://asdd.ga.gov.au/asdd/profileinfo/gascopecodelist.xml
2,http://asdd.ga.gov.au/asdd/profileinfo/gmxcodelists.xml
3,http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/codelist/gmxcodelists.xml
4,http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/codelist/ml_gmxcodelists.xml
5,http://aws2.caris.com/sfs/schemas/iso/19139/caris_20120814/resources/codelist/gmxcodelists.xml
6,http://aws2.caris.com/sfs/schemas/iso/19139/caris_20120814/resources/codelist/srvcodelists.xml
7,http://data.daff.gov.au/anrdl/resources/codelist/codelist20120313.xml
8,http://geobrain.laits.gmu.edu/catalogxml/resources/codelist.xml
9,http://mdtranslator.adiwg.org/api/codelists?format=xml


In [20]:
# and a super quick linkrot check
import requests

codes = []
for d in df.itertuples():
    url = d[1]
    try:
        rsp = requests.get(url, timeout=30)
        status_code = rsp.status_code
    except Exception as ex:
        print url, ex
        status_code = 900
    
    codes.append((url, status_code))
    
    
    

http://adl.brs.gov.au/anrdl/resources/codelist/codelist20120313.xml ('Connection aborted.', gaierror(8, 'nodename nor servname provided, or not known'))
http://www.ec.gc.ca/data_donnees/standards/schemas/napec ('Connection aborted.', error(54, 'Connection reset by peer'))
http://www.tc211.org/iso19139/resources/codelist.xml ('Connection aborted.', gaierror(8, 'nodename nor servname provided, or not known'))


In [22]:
from collections import defaultdict
statuses = defaultdict(int)
for url, code in codes:
    print code, url
    statuses[code] += 1

900 http://adl.brs.gov.au/anrdl/resources/codelist/codelist20120313.xml
404 http://asdd.ga.gov.au/asdd/profileinfo/gascopecodelist.xml
404 http://asdd.ga.gov.au/asdd/profileinfo/gmxcodelists.xml
200 http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/codelist/gmxcodelists.xml
200 http://aws2.caris.com/sfs/schemas/iso/19139/20070417/resources/codelist/ml_gmxcodelists.xml
200 http://aws2.caris.com/sfs/schemas/iso/19139/caris_20120814/resources/codelist/gmxcodelists.xml
200 http://aws2.caris.com/sfs/schemas/iso/19139/caris_20120814/resources/codelist/srvcodelists.xml
404 http://data.daff.gov.au/anrdl/resources/codelist/codelist20120313.xml
404 http://geobrain.laits.gmu.edu/catalogxml/resources/codelist.xml
200 http://mdtranslator.adiwg.org/api/codelists?format=xml
404 http://nap.geogratis.gc.ca/metadata/register/napmetadataregister.xml
404 http://schemas.opengis.net/iso/19139/20070417/resources/codelist/gmxcodelists.xml
404 http://seadata.bsh.de/isocodelists/sdncodelists/csrcod

In [23]:
statuses

defaultdict(<type 'int'>, {200: 14, 404: 25, 900: 3})

In [24]:
len(codes)

42

In [28]:
for k, v in statuses.iteritems():
    print k, '{:.2f}%'.format(v / float(len(codes)) * 100.)

200 33.33%
404 59.52%
900 7.14%


####So how many responses contain an inaccessible codelist?

In [31]:
sql = """
select count(distinct r.response_id)
from response_codelists r join codelist_statuses c on c.codelist = r.codelist
where c.status != 200;
"""
df = pd.read_sql(sql, engine)
df

Unnamed: 0,count
0,16129


So 16,129 out of 19,689 ISO records have at least one reference to a downed codelist. 82%.

Question: Do you want to know what the downed codelists affect (role codes or progress codes or ??)? 

On to answering that question. Starting with which elements (the terminal element tag names).

In [6]:
# list of elements with number of responses containing at least reference for that element 
# pointing to one of the linkrot codelists

sql = """
with c as 
(
	select id, codelist
	from codelist_statuses
	where status > 200
), x as
(
	select r.id as response_id, r.host, r.metadata_age,
		lower(replace(left(i.codelist, strpos(i.codelist, '#')), '#', '')) as codelist,
		i.xpath, string_to_array(i.xpath, '/') as xpath_arr
	from codelists i join responses r on r.source_url_sha = i.file
)
select count(distinct x.response_id) as num_responses, 
	-- x.host, 
	-- x.metadata_age::date, 
	-- c.codelist,
	-- x.xpath, 
	x.xpath_arr[array_upper(xpath_arr, 1)] as last_x
from x join c on c.codelist = x.codelist
--order by x.codelist
group by last_x
order by num_responses DESC;
"""
    
df = pd.read_sql(sql, engine)
df

Unnamed: 0,num_responses,last_x
0,15079,CI_RoleCode
1,12001,MD_ScopeCode
2,9950,CI_DateTypeCode
3,8229,MD_MaintenanceFrequencyCode
4,7061,MD_KeywordTypeCode
5,6657,CI_OnLineFunctionCode
6,4828,MD_CharacterSetCode
7,3792,DS_AssociationTypeCode
8,3614,MD_RestrictionCode
9,3145,MD_ProgressCode


In [5]:
# and by the full unqualified xpath
sql = """
with c as 
(
	select id, codelist
	from codelist_statuses
	where status > 200
), x as
(
	select r.id as response_id, r.host, r.metadata_age,
		lower(replace(left(i.codelist, strpos(i.codelist, '#')), '#', '')) as codelist,
		i.xpath, string_to_array(i.xpath, '/') as xpath_arr
	from codelists i join responses r on r.source_url_sha = i.file
)
select count(x.response_id) as num_responses, 
	x.xpath
from x join c on c.codelist = x.codelist
group by x.xpath
order by num_responses DESC;
"""
    
df = pd.read_sql(sql, engine)
    
with pd.option_context('display.max_colwidth', 1000, 'display.max_rows', 310):
    display(df[:309])

Unnamed: 0,num_responses,xpath
0,15556,MI_Metadata/identificationInfo/MD_DataIdentification/descriptiveKeywords/MD_Keywords/thesaurusName/CI_Citation/citedResponsibleParty/CI_ResponsibleParty/role/CI_RoleCode
1,14236,MI_Metadata/identificationInfo/MD_DataIdentification/descriptiveKeywords/MD_Keywords/thesaurusName/CI_Citation/citedResponsibleParty/CI_ResponsibleParty/contactInfo/CI_Contact/onlineResource/CI_OnlineResource/function/CI_OnLineFunctionCode
2,13728,MI_Metadata/identificationInfo/MD_DataIdentification/citation/CI_Citation/citedResponsibleParty/CI_ResponsibleParty/role/CI_RoleCode
3,13530,MI_Metadata/dataQualityInfo/DQ_DataQuality/lineage/LI_Lineage/source/LI_Source/sourceCitation/CI_Citation/citedResponsibleParty/CI_ResponsibleParty/role/CI_RoleCode
4,11844,MD_Metadata/identificationInfo/MD_DataIdentification/pointOfContact/CI_ResponsibleParty/role/CI_RoleCode
5,11692,MI_Metadata/identificationInfo/MD_DataIdentification/descriptiveKeywords/MD_Keywords/type/MD_KeywordTypeCode
6,8895,MI_Metadata/distributionInfo/MD_Distribution/distributor/MD_Distributor/distributorTransferOptions/MD_DigitalTransferOptions/onLine/CI_OnlineResource/function/CI_OnLineFunctionCode
7,8378,MD_Metadata/identificationInfo/MD_DataIdentification/citation/CI_Citation/date/CI_Date/dateType/CI_DateTypeCode
8,7991,MD_Metadata/identificationInfo/MD_DataIdentification/descriptiveKeywords/MD_Keywords/thesaurusName/CI_Citation/date/CI_Date/dateType/CI_DateTypeCode
9,7768,MI_Metadata/identificationInfo/MD_DataIdentification/aggregationInfo/MD_AggregateInformation/aggregateDataSetName/CI_Citation/citedResponsibleParty/CI_ResponsibleParty/role/CI_RoleCode
