From the current Solr set as of 2015-09-22T21:00.

Initial run to determine:

1. How many of the parsable XML responses include a schema location out of the set?
2. What are the unique schemas?
3. How many are federally-hosted?

TODOs:

- aggregate by protocol (post-identification)

In [1]:
import os
import json
import glob
from lxml import etree

In [2]:
# parsing code
# note: there's some older cruft to deal with
# related to cdata, encodings, etc.

_total_responses = 677510
_doc_dir = '/Users/sparky/Documents/solr_responses/solr_20150922_docs/'
_xpaths = [
    ['//*', '@schemaLocation'],
    ['//*', '@noNamespaceSchemaLocation']
]

def generate_localname_xpath(tags):
    unchangeds = ['*', '..', '.', '//*']
    return '/'.join(
        ['%s*[local-name()="%s"]' % ('@' if '@' in t else '', t.replace('@', ''))
         if t not in unchangeds else t for t in tags])


def extract_attribs(elem, tags):
    e = extract_elems(elem, tags)
    return list([' '.join(m.strip().split()) for m in e] if isinstance(e, list) else [' '.join(e.split())])


def extract_elems(elem, tags):
    xp = generate_localname_xpath(tags)
    return elem.xpath(xp)


def _clean_content(response):
    response = response.replace('\\\n', '').replace('\r\n', '').replace('\\r', '').replace('\\n', '').replace('\n', '')
    response = response.replace('\\\t', '').replace('\\t', '').replace('\t', '')
    # this is likely useless (mostly issues in the json)
    response = response.replace('\\\\ufffd', '').replace('\\\ufffd', '').replace('\\ufffd', '').replace('\ufffd', '')
    response = response.decode('utf-8', errors='replace').encode('unicode_escape') 
    return response


def _parse_content(response):
    parser = etree.XMLParser()
    return etree.fromstring(response, parser=parser)


def prep_content(filename):
    with open(filename, 'r') as f:
        data = json.loads(f.read())
    response = data.get('raw_content', '')
    response = _clean_content(response)
    return _parse_content(response)


In [3]:
# gather the schemas, noting that the method
# for extraction does *not* consider failovers
# ie, the ="schema_a.xsd schema_b.xsd" situation
# so we need to handle it after

parsed_responses = 0
failed_responses = []
unique_schemas = set()
packed_unique_schemas = set()
responses_with_a_schema = 0
for i, f in enumerate(glob.glob(_doc_dir + '*.json')):
    try:
        xml = prep_content(f)
        parsed_responses += 1
    except Exception as ex:
        failed_responses.append((f, ex))
        continue
    
    schemas = []
    for xp in _xpaths:
        schemas += extract_attribs(xml, xp)

    if not schemas:
        continue

    packed_unique_schemas = packed_unique_schemas.union(set(schemas))
    
    schemas = [a.strip() for s in schemas for a in s.split()]
    unique_schemas = unique_schemas.union(set(schemas))
    responses_with_a_schema += 1

In [4]:
parsed_responses

509608

In [5]:
responses_with_a_schema

124769

In [6]:
len(failed_responses)

167902

In [8]:
len(unique_schemas)

2382

In [9]:
len(packed_unique_schemas)

1960

In [7]:
with open('outputs/unique_schemas.txt', 'w') as f:
    f.write('\n'.join(list(unique_schemas)))
with open('outputs/unique_packed_schemas.txt', 'w') as f:
    f.write('\n'.join(list(packed_unique_schemas)))

And here we pause for a bit to do some manual cleanup. From that, we need to reset for unique from our now three schema lists (unique, unique packed, federal). 

Note: unique is the schema location split by spaces (each schema listed), unique packed is the schema location as is, and federal are the federally-hosted schemas.

So re-open, set, sort and save.

In [3]:
with open('outputs/unique_schemas.txt', 'r') as f:
    lines = f.readlines()
    
unique = sorted(list(set(lines)))
with open('outputs/unique_schemas_sorted.txt', 'w') as f:
    f.write(''.join(list(unique)))

print 'uniques: ', len(unique)

with open('outputs/unique_packed_schemas.txt', 'r') as f:
    lines = f.readlines()
    
unique = sorted(list(set(lines)))
with open('outputs/unique_packed_schemas_sorted.txt', 'w') as f:
    f.write(''.join(list(unique)))

print 'packed uniques: ', len(unique)

with open('outputs/federal_schemas.txt', 'r') as f:
    lines = f.readlines()
    
unique = sorted(list(set(lines)))
with open('outputs/federal_schemas_sorted.txt', 'w') as f:
    f.write(''.join(list(unique)))

print 'feds: ', len(unique)


uniques:  2366
packed uniques:  1960
feds:  206


A few more stats from the unique, unpacked list:

Relative paths or simple file names: 363