## Notebook revisions as tidy data.

This notebook creates a convience function that loads the revision history of a jupyter notebook into a `pandas.DataFrame`.

In [191]:
    from git import *
    from pandas import Series, read_json, to_datetime, concat

`expand` transforms `gitpython` objects into `pandas.Series` objects.

In [229]:
    def expand(x): Series({
        object: getattr(x, object) 
        for object in dir(x) 
        if (object not in {'data_stream'}) and object[0].islower()
        and not callable(getattr(x, object))})

    assert isinstance(Blob.data_stream, property)
    
We ignore the `data_stream` property because it the files contents of the revision, but it may not be executed twice.  _It took an unfortunate amount of time to discover this.

`get_history` will return a formatted pandas dataframe.

In [213]:
        def get_history(repo, path)->'DataFrame':
            df = concat([
                expand(c.tree / path).append(Series(
                    [c, c.tree/path], ['commit', 'blob']
                ))
                for c in Repo(repo).iter_commits(paths=path)], axis=1).T

            df = df.blob.apply(lambda x: read_json(x.data_stream, typ='Series')).join(df)
            df = df.commit.apply(expand).join(df, lsuffix='_')
            df = df.cells.apply(Series).stack().apply(Series).pipe(
                lambda df_: df_.reset_index(-1, drop=True).join(df, rsuffix='_').set_index(df_.index))
            df.index = df.index.rename('id', 1)
            df.committed_datetime = df.committed_datetime.pipe(to_datetime, utc=True)
            return df.reset_index(0, drop=True).set_index(
                'committed_datetime', append=True).reorder_levels((1,0))

The test below operates on a file in the `deathbeds` project.  We assure that the notebook components are expanded.  A dataframe is returned to be reused in an interactive session.

In [214]:
    def _test_deathbeds(): 
        df = get_history('..', 'deathbeds/2018-06-19-String-Node-Transformer.ipynb')
        assert all(x in df.columns for x in {'cells', 'metadata'}), """The dataframe didn't expand correctly"""
        return df

## A Sample view.

In [220]:
    Ø = __name__ == '__main__'
    df = Ø and _test_deathbeds()    
    Ø and df.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,metadata,source,execution_count,outputs,author,author_tz_offset,authored_date,authored_datetime,binsha_,...,link_mode,mime_type,mode,name,path,repo,size,type,commit,blob
committed_datetime,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2018-08-14 03:45:00+00:00,23,markdown,{},[### GraphViz Example],,,tonyfast,14400,1534218300,2018-08-13 23:45:00-04:00,b'|\xa0\xdds;0\x93\xf9;\xce\xae\x8fe\xd1`$I\xf...,...,40960,text/plain,33188,2018-06-19-String-Node-Transformer.ipynb,deathbeds/2018-06-19-String-Node-Transformer.i...,"<git.Repo ""C:\Users\deathbeds\deathbeds.github...",9790,blob,7ca0dd733b3093f93bceae8f65d1602449f532b8,c18d13b6aa860ee437584d049547022b7f168dac
2018-08-19 17:50:19+00:00,18,code,{},[ class DoctestString(StrTokenTransformer):...,9.0,[],Tony Fast,14400,1534701019,2018-08-19 13:50:19-04:00,b'\xdd\x0e\xbc\xc7V\xb1d\xb1\xa1\xf0b3\xb4\x1a...,...,40960,text/plain,33188,2018-06-19-String-Node-Transformer.ipynb,deathbeds/2018-06-19-String-Node-Transformer.i...,"<git.Repo ""C:\Users\deathbeds\deathbeds.github...",10276,blob,dd0ebcc756b164b1a1f06233b41a18cbb67bf7d3,c5ee562d54227081b260dfe0f089f02dcf827236
2018-08-18 00:23:08+00:00,29,markdown,{},[# What other replacements could you imagine?],,,Tony Fast,14400,1534551788,2018-08-17 20:23:08-04:00,b'\xbf\x05\xd3c\xb2\x8b<a@\x85\x1f}\xc1\xc7\x9...,...,40960,text/plain,33188,2018-06-19-String-Node-Transformer.ipynb,deathbeds/2018-06-19-String-Node-Transformer.i...,"<git.Repo ""C:\Users\deathbeds\deathbeds.github...",10293,blob,bf05d363b28b3c6140851f7dc1c79a7491e5fb67,7b8bdc1b59a42b82b512d2ed7adaed585176d419


The dataframe contains the following columns.

In [223]:
    Ø and __import__("IPython").display.Markdown(' '.join(f"`{x}`" for x in df.columns ))

`cell_type` `metadata` `source` `execution_count` `outputs` `author` `author_tz_offset` `authored_date` `authored_datetime` `binsha_` `committed_date` `committer` `committer_tz_offset` `conf_encoding` `default_encoding` `encoding` `env_author_date` `env_committer_date` `gpgsig` `hexsha_` `message` `name_rev` `parents` `repo_` `size_` `stats` `summary` `tree` `type_` `cells` `metadata_` `nbformat` `nbformat_minor` `abspath` `binsha` `executable_mode` `file_mode` `hexsha` `link_mode` `mime_type` `mode` `name` `path` `repo` `size` `type` `commit` `blob`

## Developer

In [228]:
    if __name__ == '__main__':
        !ipython -m pytest -- -c ../tox.ini 2018-08-25-Notebook-git-histories-as-dataframes.ipynb

platform win32 -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
Matplotlib: 2.2.2
Freetype: 2.8.1
rootdir: C:\Users\deathbeds\deathbeds.github.io\deathbeds, inifile: ../tox.ini
plugins: xdist-1.22.5, testmon-0.9.12, remotedata-0.2.1, parallel-0.0.2, openfiles-0.3.0, mpl-0.9, localserver-0.4.1, forked-0.2, doctestplus-0.1.3, arraydiff-0.2, hypothesis-3.66.16, importnb-0.5.0
collected 1 item

2018-08-25-Notebook-git-histories-as-dataframes.ipynb .                  [100%]

