The OAI-PMH Blog (http://ekvv.uni-bielefeld.de/blog/baseoai/) has a reasonably well-structured registry of OAI-PMH servers noting things like the framework used, identifier usage, etc. 

This is some basic parsing and EDA on that registry to get:

1. the repository name and URL
2. the framework used
3. use of dc:identifiers
4. changes to the URL (at least to know)
5. whether it comes out of OpenDOAR

In [35]:
import os
from lxml import etree
import json
from bs4 import BeautifulSoup
import HTMLParser
from itertools import chain

with open('oaipmh_blog_alle.atom', 'r') as f:
    text = f.read()

xml = etree.fromstring(text)

hparse = HTMLParser.HTMLParser()


In [7]:
# some little xml helpers
def generate_localname_xpath(tags):
    unchangeds = ['*', '..', '.', '//*']
    return '/'.join(
        ['%s*[local-name()="%s"]' % ('@' if '@' in t else '', t.replace('@', ''))
         if t not in unchangeds else t for t in tags])

def extract_attrib(elem, tags):
    e = extract_elem(elem, tags)
    return e.strip() if e else ''


def extract_attribs(elem, tags):
    e = extract_elem(elem, tags)
    return [m.strip() for m in e]


def extract_item(elem, tags):
    e = extract_elem(elem, tags)
    return e.text.strip() if e is not None and e.text else ''


def extract_items(elem, tags):
    es = extract_elems(elem, tags)
    return [e.text.strip() for e in es if e is not None and e.text]


def extract_elems(elem, tags):
    xp = generate_localname_xpath(tags)
    return elem.xpath(xp)


def extract_elem(elem, tags):
    xp = generate_localname_xpath(tags)
    return next(iter(elem.xpath(xp)), None)

In [8]:
# note: not dealing with namespacing at all.

entries = xml.xpath('//*/*[local-name()="entry"]')
len(entries)

200

Let's talk about the patterns for a minute.

For the titles:

Many things are "{title of the service} has ...".

New services are marked as "New OAI-PMH Repository: ".

It's not 100% (and why it's clearly a manual blog) that the phrase before "has" is only the name of the repository. Nor is it always clear that the "migrated" posts refer to a new URL or a new framework. 


For the text content.

For a new link, it's "the new basicurl is"; for a new entry, it's "the basicurl is" (however, we can get that link from the link/@rel=alternate element).

The framework used can be grokked but it also changes patterns often enough.

In [46]:
oais = []

def parse_title(text):
    if 'has' in text:
        parts = text.strip().split('has')
        # let's not call it a triple
        return parts[0], parts[1]
    elif 'New OAI-PMH ' in text: 
        return text.replace('New OAI-PMH Repository:', '').strip(), 'new registry'
    else:
        print '?', text
        return text.strip(), ''

def parse_content(text):
    text = hparse.unescape(text)
    soup = BeautifulSoup(text)
    lines = soup.text.split('\n')
    
    # we can ignore some of the paragraph blocks
    # given that we want to know that it's a new link
    # and what framework it came from (or moved to)
    
    info = {}
    for line in lines:
        if 'repository' in line and 'uses' in line:
            info['repo'] = line.strip().split('uses')[-1].strip().replace('.', '')
        elif 'repository' in line and 'from' in line and 'to' in line:
            info['repo'] = line.strip().split('to')[-1].strip().replace('.', '')
            info['repo_changed'] = True
        
        if 'OpenDOAR' in line:
            info['source'] = 'OpenDOAR'
        
        if 'identifiers' in line:
            info['dc:identifiers'] = True
    
    return info

for entry in entries[:2]:
    title = entry.xpath('./*[local-name()="title"]/text()')
    oai_link = entry.xpath('./*[local-name()="link" and @rel="alternate"]/@href')
    
    # get the content which will be escaped html
    content = entry.xpath('./*[local-name()="content"]/text()')
    
    name, event = parse_title(title[0])
    info = parse_content(content[0])
    
    oais.append(dict(chain.from_iterable((
        {
            "name": name, 
            "link": oai_link[0],
            "event": event
        }.items(),
        info.items()
    ))))
    


In [47]:
oais

[{'event': ' been migrated',
  'link': 'http://ekvv.uni-bielefeld.de/blog/baseoai/entry/publication_server_of_berlin_brandenburgischen',
  'name': 'Publication Server of Berlin-Brandenburgischen Akademie der Wissenschaften, Germany ',
  'repo': u'Opus 4 now',
  'source': 'OpenDOAR'},
 {'event': 'new registry',
  'link': 'http://ekvv.uni-bielefeld.de/blog/baseoai/entry/new_oai_pmh_repository_dokumentenserver1',
  'name': 'Dokumentenserver Klimawandel  of Climate Service Center/HZG , Germany',
  'repo': u'Opus 4'}]

so we have a dict of stuff, let's make it pretty and countable

In [None]:
# in pandas