How many different ways do we find to note the ISO standard and version?

So for any XML reseponse, check for the version.

####Notes

This does not rely on any formal identification of ISO first so any XML that contains any one of the xpath structures is included. This catches ISO records embedded in something like a CSW GetRecords response. And it is only extracting the first value so, if it *is* from a CSW and each ISO within has its own version definition, we aren't capturing that (that would be odd, but wouldn't rule it out).

In [2]:
%reload_ext autoreload
%autoreload 2

import os
import glob
import json
from lxml import etree
from semproc.rawresponse import RawResponse
from semproc.xml_utils import extract_item

def convert_header_list(headers):
    return dict(
        (k.strip().lower(), v.strip()) for k, v in (
            h.split(':', 1) for h in headers)
    )

In [3]:
files = glob.glob('/Users/sparky/Documents/solr_responses/solr_20150922_docs/*.json')

In [4]:
xpaths = [
    (['//*', 'metadataStandardName', 'CharacterString'], ['//*', 'metadataStandardVersion', 'CharacterString']),
    (['//*', 'seriesMetadata', 'MD_Metadata', 'metadataStandardName', 'CharacterString'], ['//*', 'seriesMetadata', 'MD_Metadata', 'metadataStandardVersion', 'CharacterString']),
    (['//*', 'seriesMetadata', 'MI_Metadata', 'metadataStandardName', 'CharacterString'], ['//*', 'seriesMetadata', 'MI_Metadata', 'metadataStandardVersion', 'CharacterString'])
]

In [5]:
versions = set()
responses_containing_version = set()
for f in files:
    with open(f, 'r') as g:
        data = json.loads(g.read())
    
    response = data.get('raw_content')
    headers = convert_header_list(data.get('response_headers', []))
    content_type = headers.get('content-type', '')
    
    rr = RawResponse(response, content_type)
    try:
        content = rr.clean_raw_content()
    except:
        print 'FAILED ', f
        continue

    if rr.datatype != 'xml':
        continue
    
    try:
        xml = etree.fromstring(content)
    except:
        continue
        
    for std_name, std_vsn in xpaths:
        sn = extract_item(xml, std_name)
        sv = extract_item(xml, std_vsn)
        
        if sn or sv:
            versions.add('Standard: {0}, Version: {1}'.format(sn, sv))
            responses_containing_version.add('File: {0}, Standard: {1}, Version: {2}'.format(f.split('/')[-1], sn, sv))
    

FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/14a548cd708b8b3c1b74c47a57dd033bfaf5a7ed.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/2b727b1afb9381062a3f5dd05802e56ce21eeee8.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/3354a2bd8c334dddcb82c4cf50bc18d8619db25e.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/3459af68928a619ac5cc332e251ff72a62bd10d9.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/51bfa74c90c5cacbc46b20cd87abf86aa3a958a3.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/58cc1b77ccf6848cf6f66d3ad75e32efd2df9ad9.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/6269c3680e2d851de9bcd4e9f140783925b77890.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/64aa05e70dda49093c464c1973599aa25d20ecac.json
FAILED  /Users/sparky/Documents/solr_responses/solr_20150922_docs/76d89ea59e67c23c8b16dcc65079410937eb52

In [7]:

versions


{'Standard: ANZLIC Metadata Profile: An Australian/New Zealand Profile of AS/NZS ISO 19115:2005, Geographic information - Metadata, Version: 1.1',
 'Standard: DM - Regole tecniche RNDT, Version: 10 novembre 2011',
 'Standard: GEMINI, Version: 2.1',
 'Standard: Gemini, Version: 2.1',
 'Standard: INSPIRE Implementing Rules for Metadata, Version: 1.2',
 'Standard: ISO 19115 (UK GEMINI), Version: 1.0 (2.2)',
 'Standard: ISO 19115 Geographic Information - Metadata, Version: 2009-02-15',
 'Standard: ISO 19115 Geographic Information - Metadata, Version: ISO 19115',
 'Standard: ISO 19115 Geographic information - Metadata - Converted from Data.gov legacy DMS format., Version: ISO 19115:2003(E)',
 'Standard: ISO 19115 Geographic information - Metadata, Version: ISO 19115:2003(E)',
 'Standard: ISO 19115 Geographic information Metadata; WISE Metadata profile, Version: 2003/Cor.1:2006',
 'Standard: ISO 19115, Version: 2003',
 'Standard: ISO 19115, Version: ISO 19139 / DKRZ ISO Simple Profile V1.0',

In [6]:
with open('outputs/iso_versions_by_response.txt', 'w') as f:
    f.write('\n'.join(responses_containing_version))

In [8]:
len(responses_containing_version)

25371

In [9]:
for v in versions:
    print v

Standard: WMO Core Metadata Profile, Version: 1.0 (draft)
Standard: MEDIN Discovery Metadata Standard, Version: Version 2.3.5
Standard: ISO 19115 Geographic information - Metadata - Converted from Data.gov legacy DMS format., Version: ISO 19115:2003(E)
Standard: MEDIN Discovery Metadata Standard, Version: Version 2.3.7
Standard: ISO 19115/2003/Cor.1:2006, Version: GDI-Vlaanderen Best Practices - versie 1.0
Standard: ISO 19115-2 Geographic Information - Metadata - Part 2: Extensions for Imagery and Gridded Data, Version: ISO 19115-2:2013(E)
Standard: ISO 19115-2 Geographic Information - Metadata - Part 2: Extensions for Imagery and Gridded Data with Biological Extensions, Version: ISO 19115-2:2009(E)
Standard: ISO 19115 Geographic information - Metadata, Version: ISO 19115:2003(E)
Standard: ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data, Version: ISO 19115-2:2009(E)
Standard: Gemini, Version: 2.1
Standard: North American Profile of ISO 19115

###Counts by Version String

From our saved documents file, aggregate on the unique strings.

In [17]:
import pandas as pd

def repack_response(r):
    # File: aa990c18d59cac186291423719efab1b7d.json, Standard: ISO-USGIN, Version: 1.2
    parts = r.split('.json, ')
    rsp = parts[0].replace('File: ', '').strip()
    vsn = parts[1].strip().replace('Standard: ', '').replace(', Version: ', '; ').strip() if len(parts) > 1 else 'Unknown'
    return (rsp, vsn)

with open('outputs/iso_versions.txt', 'r') as f:
    versions = f.readlines()

# chuck our standard/version tags
versions = [v.strip().replace('Standard: ', '').replace(', Version: ', '; ') for v in versions if v]

with open('outputs/iso_versions_by_response.txt', 'r') as f:
    responses = f.readlines()
responses = [repack_response(r) for r in responses if r]

df = pd.DataFrame(responses, columns=['SHA_ID', 'Version String'])
df[:10]

Unnamed: 0,SHA_ID,Version String
0,f8dc5daa990c18d59cac186291423719efab1b7d,ISO-USGIN; 1.2
1,e388d8863c30b175feb849b8e135386f2262f080,ISO 19115-2 Geographic Information - Metadata ...
2,34089aa49c76eab03fac09a7ee0f4dd7e02023e3,ISO 19115 Geographic Information - Metadata; I...
3,62d7febe3717636690c1cc5313ca7df06a91c5ef,ISO 19115-2 Geographic Information - Metadata ...
4,bef6893cdacbcab8b13eaeaf460344c1770ceee3,ISO 19115 Geographic Information - Metadata; 2...
5,a3462cc918040ebbd780b4abc5a115e262b51736,ISO-USGIN; 1.2
6,9ad65b200f636f40c4be4efa1208a7d136b7ba59,ISO 19115-2 Geographic Information - Metadata ...
7,15f20a4fd3a5f6208bc1149ae7d18d622bd608bd,ISO 19115 (UK GEMINI); 1.0 (2.2)
8,821ce5bf41e19731b97a01cbeedf31d616d0567f,ISO-USGIN; 1.2
9,ce867d1a914bc82b5cc566c14a1b41474b453dc4,ISO-NAP-USGIN; 1.1.4


In [6]:
# and the aggregation by version string
df.groupby(['Version String']).count()

Unnamed: 0_level_0,SHA_ID
Version String,Unnamed: 1_level_1
"ANZLIC Metadata Profile: An Australian/New Zealand Profile of AS/NZS ISO 19115:2005, Geographic information - Metadata; 1.1",124
DM - Regole tecniche RNDT; 10 novembre 2011,2
GEMINI; 2.1,25
Gemini; 2.1,1
INSPIRE Implementing Rules for Metadata; 1.2,36
ISO 19115 (UK GEMINI); 1.0 (2.2),318
ISO 19115 Geographic Information - Metadata; 2009-02-15,2516
ISO 19115 Geographic Information - Metadata; ISO 19115,333
ISO 19115 Geographic information - Metadata - Converted from Data.gov legacy DMS format.; ISO 19115:2003(E),86
ISO 19115 Geographic information - Metadata; ISO 19115:2003(E),208


In [16]:
# and storing that output
with open('outputs/iso_version_counts.csv', 'w') as f:
    f.write('Version|Count\n')
    for vsn, vsn_data in df.groupby(['Version String']):
        f.write('{0}|{1}\n'.format(vsn, vsn_data.count()[1]))