Software version repositories contain a huge amount of evolutionary data. It's very common to mine these repositories to gain some insight about how the development of a software product works. But there is the need for some preprocessing of that data to avoid false analysis.

In this notebook, I show you how to read a Git repository into Pandas' DataFrame.

# Idea
The main idea is to use an existing Git library that provides the necessary (and hopefully efficient) access to a Git repository in Python.

In [1]:
import pandas as pd
import git

repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
commits = pd.DataFrame(repo.iter_commits('master'), columns=['data'])
commits.head()

Unnamed: 0,data
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1
2,d77f31c96e9e82d6245d1e244dd5a91b444295a0
3,2f3e035c60551cb5caf2005b9caea8126a59a52e
4,1a6572d1ac0c7659d9243405074f3f19f9a93328


Our <tt>commit</tt> column now contains all the commits as PythonGit's <tt>Commit</tt> Objects:

In [2]:
last_commit = commits.ix[0, 'data']
type(last_commit)

git.objects.commit.Commit

The <tt>Commit</tt> objects are now our entry point for retrieving further data:

In [3]:
print(last_commit.__doc__)

Wraps a git Commit object.

    This class will act lazily on some of its attributes and will query the
    value on demand only if it involves calling the git binary.


It provides a huge variety of data on demand:

In [4]:
last_commit.__slots__

('tree',
 'author',
 'authored_date',
 'author_tz_offset',
 'committer',
 'committed_date',
 'committer_tz_offset',
 'message',
 'parents',
 'encoding',
 'gpgsig')

E. g. basic data like the commit messageh

In [5]:
last_commit.message

"Downgrade Cobertura to enable 'mvn site'\n"

Or the date of the commit

In [6]:
last_commit.committed_datetime

datetime.datetime(2016, 7, 28, 21, 3, 3, tzinfo=<git.objects.util.tzoffset object at 0x0000023C07A7E710>)

In [7]:
last_commit.author.name

'feststelltaste'

In [8]:
last_commit.author.email

'feststelltaste@googlemail.com'

Or file statistics about the commit:

In [9]:
last_commit.stats.files

{'pom.xml': {'deletions': 2, 'insertions': 4, 'lines': 6}}

Let's check how far this is by retrieving all the authors from the commit's data:

In [10]:
%%time
commits['author'] = commits['data'].apply(lambda x: x.author.name)
print(len(commits))
commits.head()

Wall time: 0 ns
477


Unnamed: 0,data,author
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,Antoine Rey
2,d77f31c96e9e82d6245d1e244dd5a91b444295a0,Antoine Rey
3,2f3e035c60551cb5caf2005b9caea8126a59a52e,Antoine Rey
4,1a6572d1ac0c7659d9243405074f3f19f9a93328,Antoine Rey


OK, it seems that this isn't measurable (but we only have 447 entries). Let's got further and retrieve some more data.

In [11]:
%%time
commits['email'] = commits['data'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['data'].apply(lambda x: x.committed_datetime)
commits['message'] = commits['data'].apply(lambda x: x.message)
commits['sha'] = commits['data'].apply(lambda x: str(x))
commits.head()

Wall time: 0 ns


Unnamed: 0,data,author,email,committed_date,message,sha
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste,feststelltaste@googlemail.com,2016-07-28 21:03:03+02:00,Downgrade Cobertura to enable 'mvn site'\n,410923f52a18ec2452b0a5ac47c7974f0f1d96ab
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,Antoine Rey,antoine.rey@gmail.com,2016-07-09 12:06:26+02:00,Upgrade Spring IO Platform to 2.0.6\n,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1
2,d77f31c96e9e82d6245d1e244dd5a91b444295a0,Antoine Rey,antoine.rey@gmail.com,2016-07-06 19:00:32+02:00,Fix Jetty 9 startup\n,d77f31c96e9e82d6245d1e244dd5a91b444295a0
3,2f3e035c60551cb5caf2005b9caea8126a59a52e,Antoine Rey,antoine.rey@gmail.com,2016-07-06 18:18:40+02:00,The maven-war-plugin does not failed on missin...,2f3e035c60551cb5caf2005b9caea8126a59a52e
4,1a6572d1ac0c7659d9243405074f3f19f9a93328,Antoine Rey,antoine.rey@gmail.com,2016-07-06 18:04:23+02:00,Replace web.xml by PetclinicInitializer\n,1a6572d1ac0c7659d9243405074f3f19f9a93328


Dead easy and fast, but what about the modified files in the <tt>commit.stats</tt> object?

In [12]:
%%time
stats = pd.DataFrame(commits['data'].apply(
    lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats.head()

Wall time: 20.7 s


In [13]:
commits = commits.join(stats)
commits.head()

Unnamed: 0,data,author,email,committed_date,message,sha,filename,stats_modifications
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste,feststelltaste@googlemail.com,2016-07-28 21:03:03+02:00,Downgrade Cobertura to enable 'mvn site'\n,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,pom.xml,"{'deletions': 2, 'lines': 6, 'insertions': 4}"
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,Antoine Rey,antoine.rey@gmail.com,2016-07-09 12:06:26+02:00,Upgrade Spring IO Platform to 2.0.6\n,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,pom.xml,"{'deletions': 2, 'lines': 4, 'insertions': 2}"
2,d77f31c96e9e82d6245d1e244dd5a91b444295a0,Antoine Rey,antoine.rey@gmail.com,2016-07-06 19:00:32+02:00,Fix Jetty 9 startup\n,d77f31c96e9e82d6245d1e244dd5a91b444295a0,src/main/webapp/WEB-INF/jetty-web.xml,"{'deletions': 0, 'lines': 7, 'insertions': 7}"
3,2f3e035c60551cb5caf2005b9caea8126a59a52e,Antoine Rey,antoine.rey@gmail.com,2016-07-06 18:18:40+02:00,The maven-war-plugin does not failed on missin...,2f3e035c60551cb5caf2005b9caea8126a59a52e,pom.xml,"{'deletions': 0, 'lines': 1, 'insertions': 1}"
4,1a6572d1ac0c7659d9243405074f3f19f9a93328,Antoine Rey,antoine.rey@gmail.com,2016-07-06 18:04:23+02:00,Replace web.xml by PetclinicInitializer\n,1a6572d1ac0c7659d9243405074f3f19f9a93328,readme.md,"{'deletions': 2, 'lines': 4, 'insertions': 2}"


In [14]:
stats_modifications = pd.DataFrame(commits['stats_modifications'].apply(
    lambda x: pd.Series(x)).stack()).reset_index(level=1)
stats_modifications = stats_modifications.rename(columns={ 'level_1' : 'change_type', 0 : 'lines'})
stats_modifications.head()

  union = _union_indexes(indexes)


Unnamed: 0,change_type,lines
0,deletions,2.0
0,insertions,4.0
0,lines,6.0
1,deletions,2.0
1,insertions,2.0


In [15]:
commits = commits.join(stats_modifications)
commits.head()

Unnamed: 0,data,author,email,committed_date,message,sha,filename,stats_modifications,change_type,lines
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste,feststelltaste@googlemail.com,2016-07-28 21:03:03+02:00,Downgrade Cobertura to enable 'mvn site'\n,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,pom.xml,"{'deletions': 2, 'lines': 6, 'insertions': 4}",deletions,2.0
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste,feststelltaste@googlemail.com,2016-07-28 21:03:03+02:00,Downgrade Cobertura to enable 'mvn site'\n,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,pom.xml,"{'deletions': 2, 'lines': 6, 'insertions': 4}",insertions,4.0
0,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,feststelltaste,feststelltaste@googlemail.com,2016-07-28 21:03:03+02:00,Downgrade Cobertura to enable 'mvn site'\n,410923f52a18ec2452b0a5ac47c7974f0f1d96ab,pom.xml,"{'deletions': 2, 'lines': 6, 'insertions': 4}",lines,6.0
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,Antoine Rey,antoine.rey@gmail.com,2016-07-09 12:06:26+02:00,Upgrade Spring IO Platform to 2.0.6\n,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,pom.xml,"{'deletions': 2, 'lines': 4, 'insertions': 2}",deletions,2.0
1,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,Antoine Rey,antoine.rey@gmail.com,2016-07-09 12:06:26+02:00,Upgrade Spring IO Platform to 2.0.6\n,eddc72cfa8ec1f010b3b21e80e661d8d8ba33cf1,pom.xml,"{'deletions': 2, 'lines': 4, 'insertions': 2}",insertions,2.0


In [16]:
commits.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5800291 entries, 0 to 476
Data columns (total 10 columns):
data                   object
author                 object
email                  object
committed_date         object
message                object
sha                    object
filename               object
stats_modifications    object
change_type            object
lines                  float64
dtypes: float64(1), object(9)
memory usage: 486.8+ MB


In [19]:
commits.groupby('sha').count().sum()

data                   5800291
author                 5800291
email                  5800291
committed_date         5800291
message                5800291
filename               5800290
stats_modifications    5800290
change_type            5800290
lines                  5800290
dtype: int64