The OAI-PMH Blog (http://ekvv.uni-bielefeld.de/blog/baseoai/) has a reasonably well-structured registry of OAI-PMH servers noting things like the framework used, identifier usage, etc. 

This is some basic parsing and EDA on that registry to get:

1. the repository name and URL
2. the framework used
3. use of dc:identifiers
4. changes to the URL (at least to know)
5. whether it comes out of OpenDOAR

I will note that this is not a great parser in a lot of ways related to inconsistent text.

In [35]:
import os
from lxml import etree
import json
from bs4 import BeautifulSoup
import HTMLParser
from itertools import chain

with open('oaipmh_blog_alle.atom', 'r') as f:
    text = f.read()

xml = etree.fromstring(text)

hparse = HTMLParser.HTMLParser()


In [7]:
# some little xml helpers
def generate_localname_xpath(tags):
    unchangeds = ['*', '..', '.', '//*']
    return '/'.join(
        ['%s*[local-name()="%s"]' % ('@' if '@' in t else '', t.replace('@', ''))
         if t not in unchangeds else t for t in tags])

def extract_attrib(elem, tags):
    e = extract_elem(elem, tags)
    return e.strip() if e else ''


def extract_attribs(elem, tags):
    e = extract_elem(elem, tags)
    return [m.strip() for m in e]


def extract_item(elem, tags):
    e = extract_elem(elem, tags)
    return e.text.strip() if e is not None and e.text else ''


def extract_items(elem, tags):
    es = extract_elems(elem, tags)
    return [e.text.strip() for e in es if e is not None and e.text]


def extract_elems(elem, tags):
    xp = generate_localname_xpath(tags)
    return elem.xpath(xp)


def extract_elem(elem, tags):
    xp = generate_localname_xpath(tags)
    return next(iter(elem.xpath(xp)), None)

In [8]:
# note: not dealing with namespacing at all.

entries = xml.xpath('//*/*[local-name()="entry"]')
len(entries)

200

Let's talk about the patterns for a minute.

For the titles:

Many things are "{title of the service} has ...".

New services are marked as "New OAI-PMH Repository: ".

It's not 100% (and why it's clearly a manual blog) that the phrase before "has" is only the name of the repository. Nor is it always clear that the "migrated" posts refer to a new URL or a new framework. 


For the text content.

For a new link, it's "the new basicurl is"; for a new entry, it's "the basicurl is" (however, we can get that link from the link/@rel=alternate element).

The framework used can be grokked but it also changes patterns often enough.

In [145]:
oais = []

def parse_title(text):
    if 'has' in text:
        parts = text.strip().split('has')
        # let's not call it a triple
        return ' '.join(parts[0].strip().split()), parts[1].strip()
    elif 'New' in text and 'repository' in text.lower(): 
        return ' '.join(text.split(':')[-1].strip().split()), 'new registry'
    else:
        print '?', text
        return text.strip(), ''

def parse_link(text):
    # little soup parser just to get the basicurl
    # this should be unescaped, etc, etc
    soup = BeautifulSoup(text)
    try:
        return soup.find_all('a')[0]['href']
    except Exception as ex:
        print line
        print ex
        return ''
    
def _clean(text):
    # this is just annoying
    replaces = ['.', 'now', '<br/>', '</p>']
    for r in replaces:
        text = text.replace(r, '')
    return text
    
def parse_content(text):
    soup = BeautifulSoup(text)
    lines = hparse.unescape(soup.text).split('\n')
    
    # we can ignore some of the paragraph blocks
    # given that we want to know that it's a new link
    # and what framework it came from (or moved to)
    
    info = {
        "repo": "",
        "repo_changed": False,
        "source": "",
        "dc:identifiers": False,
        "link": ""
    }
    
    for line in lines:
        if 'repository' in line and 'uses' in line:
            # also some html junk in here
            info['repo'] = _clean(line.strip().split('uses')[-1]).strip()
        elif 'repository' in line and 'from' in line and 'to' in line:
            info['repo'] = _clean(line.strip().split('to')[-1]).strip()
            info['repo_changed'] = True
        
        if 'OpenDOAR' in line:
            info['source'] = unicode('OpenDOAR')
        
        if 'identifiers' in line:
            info['dc:identifiers'] = True
            
        if 'the new basicurl is:' in line.lower() or 'the basicurl is:' in line.lower() \
            or 'the oai-pmh basicurl is:' in line.lower() or 'the oa basicurl is:' in line.lower():
            info['link'] = parse_link(line)
            
    return info

for entry in entries:
    title = entry.xpath('./*[local-name()="title"]/text()')
    # bah. wrong link.
    # oai_link = entry.xpath('./*[local-name()="link" and @rel="alternate"]/@href')
    
    # get the content which will be escaped html
    content = entry.xpath('./*[local-name()="content"]')
    
    name, event = parse_title(title[0])
    info = parse_content(etree.tostring(content[0]))
    
    oais.append(dict(chain.from_iterable((
        {
            "name": unicode(name), 
            "event": unicode(event)
        }.items(),
        info.items()
    ))))
    


? New OAI-PMH system: Research Index FOX of the Leuphana University Lüneburg, Germany
? New OA journal platform: SEDinst International Journals, The Science and Education Development Institutea
? New OA Journal: Otolaryngology online journal
? The Caltech Environmental Quality Laboratory Technical Reports have abondened their OAI-PMH interface
? UJDigispace of Univ. Johannesburg changed the OAI-PMH basicurl
? OA repository Linnaeus University Publications re-organized
? Brandeis University Digital Collections have moved to Brandeis Institutional Repository
? OAI-PMH interface available for Publications from Karolinska Institutet 
? Norwegian Brage repositoires with improved OAI-PMH interface
? OACIS of Tokyo University of Marine Science and Technology (TUMSAT) is accessible via OAI-PMH
? Migrated repository: RiuNet with new url and new basicurl
? ElAr, Open Electronic Archive of Kharkov National University of Radioelectronics is accessible via OAI-PMH now
? Colecciones Digitales Unimin

In [146]:
oais[:3]

[{'dc:identifiers': False,
  'event': u'been migrated',
  'link': 'http://edoc.bbaw.de/oai?verb=Identify',
  'name': u'Publication Server of Berlin-Brandenburgischen Akademie der Wissenschaften, Germany',
  'repo': u'Opus 4',
  'repo_changed': False,
  'source': u'OpenDOAR'},
 {'dc:identifiers': False,
  'event': u'new registry',
  'link': 'http://edoc.sub.uni-hamburg.de/klimawandel/oai?verb=Identify',
  'name': u'Dokumentenserver Klimawandel of Climate Service Center/HZG , Germany',
  'repo': u'Opus 4',
  'repo_changed': False,
  'source': ''},
 {'dc:identifiers': False,
  'event': u'been migrated to FreiDok plus',
  'link': 'https://www.freidok.uni-freiburg.de/oai/oai2.php?verb=Identify',
  'name': u'The FreiDok repository',
  'repo': u'a new platform software',
  'repo_changed': False,
  'source': ''}]

so we have a dict of stuff, let's make it pretty and countable

In [147]:
# in pandas
import pandas as pd

df = pd.DataFrame(oais)
df

Unnamed: 0,dc:identifiers,event,link,name,repo,repo_changed,source
0,False,been migrated,http://edoc.bbaw.de/oai?verb=Identify,Publication Server of Berlin-Brandenburgischen...,Opus 4,False,OpenDOAR
1,False,new registry,http://edoc.sub.uni-hamburg.de/klimawandel/oai...,Dokumentenserver Klimawandel of Climate Servic...,Opus 4,False,
2,False,been migrated to FreiDok plus,https://www.freidok.uni-freiburg.de/oai/oai2.p...,The FreiDok repository,a new platform software,False,
3,False,been migrated,http://eprints.teiwm.gr/cgi/oai2?verb=Identify,"Library of ΤΕΙ of Western Macedonia, Greece",EPrints,False,
4,False,changed the OAI-PMH basicurl,http://recherche.archives.somme.fr/oai_pmh.cgi...,Mémoires de la Somme Archives en ligne,,False,
5,True,improved the OAI-PMH interface,http://repository.unimilitar.edu.co/oai/reques...,Repositorio Documental UMNG of Universidad Mil...,DSpace,False,OpenDOAR
6,False,been migrated to DSpace at National Taiwan Nor...,http://dspace.lib.ntnu.edu.tw/oai/request?verb...,National Taiwan Normal University Repository,,False,OpenDOAR
7,False,changed the OAI-PMH basicurl,http://www.oceandocs.org/oai/request?verb=Iden...,OceanDocs,,False,OpenDOAR
8,False,been migrated,http://phka.bsz-bw.de/oai?verb=Identify,"OPUS-PHKA, the Publication Server of Karlsruhe...",Opus 4,False,OpenDOAR
9,False,switched the OAI-PMH basicurl again,,"Massey Research Online, the institutional repo...",DSpace,False,OpenDOAR


In [149]:
df.to_csv('oaipmh_blog_alle.csv', sep='|', encoding='utf-8')

### Some preliminary results 

(Apologies for not putting it in the pandas. Todo later with a bit of tweaking for the wonky URL extraction. take the numbers with a grain of salt re: "correct" parsing.)

The feed describes 200 OAI-PMH events - new repository, changes to the access URL, etc. 

Rough numbers for the platforms used (half of the 200 included a recognizable value):

DSpace: 56
Eprints: 11
Opus: 10
Alfresco: 1
ContentDM: 1
DigitalCommons: 1
Fedora: 2
Goobi: 1
Islandora: 1
JaDoX: 1
Keystone: 2
MyCoRe: 2
OJS: 4
Pure: 1

79 entries were pulled from the OpenDOAR listings.

I am missing half of the links.

