# Migrating Breadcrumbs

<pre class="yaml">
date: 2017-12-29
tags: [publishing, breadcrumbs, ipython]
summary: migrating an old drupal blog
published: True
</pre>

_Breadcrumbs_ was the blog of [DIG][], the Decentralized Information Group at MIT CSAIL.

[DIG]: http://dig.csail.mit.edu/

In a [2015 #microformats chat][2015], I discovered that it was down:

> DanC> grr... the blog is down. http://dig.csail.mit.edu/breadcrumbs/node/228  
> "Unable to connect to database server"  
> _DanC verifies that he has an export of his work there..._  
> DanC> interesting... my backup is evidently python pickles of XMLRPC responses from the API of that CMS (drupal?)  
>     >>> x['dateCreated']
>     <DateTime '20080306T17:00:05' at 7f20e8aef5f0>
>     >>> x['dateCreated'].__class__
>     <class xmlrpclib.DateTime at 0x7f20e444eef0>

[2015]: http://logs.glob.uno/?c=freenode%23microformats&s=20+Jun+2015&e=20+Jun+2015#c81549

The files are numbered:

In [1]:
def _numbered_files(pattern='[0-9]*',
                    breadcrumbs='/home/connolly/sites/breadcrumbs'):
    from pathlib import Path
    return Path(breadcrumbs).glob(pattern)

breadcrumbs_bak = list(_numbered_files())
sorted(int(f.parts[-1]) for f in breadcrumbs_bak)[:10]

[4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

Each is a pickled XMLRPC response:

In [2]:
import pickle

breadcrumbs_xmlrpc = dict((int(f.parts[-1]), pickle.load(f.open('rb'))) for f in breadcrumbs_bak)
x = breadcrumbs_xmlrpc[228]
x['title'], x['dateCreated'], x['dateCreated'].__class__

('hAudio for microformats mixtapes, in progress',
 <DateTime '20080306T17:00:05' at 7fa8242a5320>,
 <class xmlrpclib.DateTime at 0x7fa82427cf58>)

## MadMode blog pages

In [3]:
from collections import OrderedDict
from __future__ import print_function
from sys import stderr


class BlogWriter(object):
    def __init__(self, pages):
        self._pages = pages

    def addPage(self, body, title, date, tags, published, slug):
        datestr = date.isoformat()
        headings = OrderedDict(title=repr(title),
                               date=datestr[:10],
                               tags="[%s]" % (', '.join("'%s'" % tag for tag in tags)),
                               published=published)
        header = '\n'.join(["%s: %s" % (k, v) for k, v in headings.iteritems()])
        yyyy = datestr[:4]
        page = (self._pages / yyyy / slug).with_suffix('.md')
        print("addPage: ", page, tags, file=stderr)
        with page.open('wb') as out:
            out.write(header)
            out.write('\n\n')
            out.write(body.encode('utf-8'))

def _madmode():
    from pathlib import Path

    return BlogWriter(Path('/home/connolly/sites') / 'madmode-blog' / 'pages')

mmwr = _madmode()

In [4]:
from time import mktime
from datetime import datetime
import re


def drupal2md(body):
    body = body.split('</title>', 1)[1]  # remove redundant title
    body = body.replace('\r', '')  # unix newlines
    return body


def findTags(body):
    tags = []
    for txt in body.split('<'):
        if txt.startswith('a '):
            txt = txt[len('a '):]
            attrs = {}
            while '=' in txt and not txt.startswith('>'):
                name, txt = txt.split('=', 1)
                name = name.strip()
                txt = txt.strip()
                _, value, txt = txt.split('"', 2)
                attrs[name] = value
                txt = txt.strip()
            href = attrs.get('href', '')
            if 'tag' in attrs.get('rel', '') or 'del.icio.us' in href:
                if href.endswith('/'):
                    href = href[:-1]
                tags.append(href.split('/')[-1])
    return tags


for postid, item in sorted(breadcrumbs_xmlrpc.items()):
    print(postid, item['title'], file=stderr)
    dt = datetime.fromtimestamp(mktime(item['dateCreated'].timetuple()))
    tags = ['breadcrumbs'] + findTags(item['content'])
    mmwr.addPage(drupal2md(item['content']), title=item['title'], date=dt,
                 tags=tags,
                 published=True, slug='breadcrumbs_%04d' % postid)

4 On OpenID and comment policies
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0004.md ['breadcrumbs']
5 little burst of PAW demo hacking
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0005.md ['breadcrumbs']
6 DIG blog wish list
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0006.md ['breadcrumbs', 'connolly']
7 Fire at Southampton... hope everything's alright soon
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0007.md ['breadcrumbs']
8 Sourceforge is the place... to sell soap?
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0008.md ['breadcrumbs']
9 Reflecting blog structure into the Semantic Web with SIOC?
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0009.md ['breadcrumbs']
10 I'd rather be...
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0010.md ['breadcrumbs']
11 PHP angst
addPage:  /home/connolly/sites/madmode-blog/pages/2005/breadcrumbs_0

## PyData Tools

In [5]:
import pandas as pd
dict(pandas=pd.__version__)

{'pandas': u'0.17.1'}

In [6]:
items = pd.DataFrame.from_records(breadcrumbs_xmlrpc.values())
items.postid = items.postid.astype(int)
items = items.set_index('postid')
print(items.dtypes)
items[['title', 'dateCreated']].sort_values('dateCreated').head()

content              object
dateCreated          object
description          object
link                 object
mt_allow_comments     int64
mt_convert_breaks    object
permaLink            object
title                object
userid               object
dtype: object


Unnamed: 0_level_0,title,dateCreated
postid,Unnamed: 1_level_1,Unnamed: 2_level_1
4,On OpenID and comment policies,20051024T23:28:49
5,little burst of PAW demo hacking,20051026T20:12:18
6,DIG blog wish list,20051026T20:14:27
7,Fire at Southampton... hope everything's alrig...,20051031T11:59:08
9,Reflecting blog structure into the Semantic We...,20051031T13:18:51


In [7]:
items.loc[[228], ['title', 'dateCreated']]

Unnamed: 0_level_0,title,dateCreated
postid,Unnamed: 1_level_1,Unnamed: 2_level_1
228,"hAudio for microformats mixtapes, in progress",20080306T17:00:05
