[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/chaoss/wg-gmd/master?filepath=implementations/Code_Changes-Git.ipynb)
# Code_Changes-Git

This is the reference implementation for Code_Changes,
a metric specified by the
[GMD Working Group](https://github.com/chaoss/wg-gmd) of the
[CHAOSS project](https://chaoss.community).
This implementation is specific to Git repositories.

See [README.md](README.md) to find out how to run this notebook (and others in this directory).

The implementation is described in two parts (see below):

* Retrieving data from the data source
* Class for computing Code_Changes

Some more auxiliary information in this notebook:

* Examples of the use of the implementation
* Examples of how to check for specific peculiarities of git commits

## Retrieving data from the data source

From the command line run Perceval on the git repositories to analyze,
to produce a file with JSON documents for all its commits,
one per line (`git-commits.json`).

As an example we will use the Perceval, SortingHat, and a fork of SortingHat
git repositories:
change it to get data from your preferred repositories
(for example, you can use `https://github.com/elastic/elasticsearch-docker`
or `https://github.com/git/git`):

```
$ perceval git --json-line http://github.com/chaoss/grimoirelab-perceval > git-commits.json
[2019-01-28 21:05:45,461] - Sir Perceval is on his quest.
[2019-01-28 21:05:48,229] - Fetching commits: 'http://github.com/chaoss/grimoirelab-perceval' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-01-28 21:05:49,727] - Fetch process completed: 1320 commits fetched
[2019-01-28 21:05:49,728] - Sir Perceval completed his quest.
$ perceval git --json-line http://github.com/chaoss/grimoirelab-sortinghat >> git-commits.json
...
[2019-01-28 21:07:27,169] - Fetch process completed: 635 commits fetched
[2019-01-28 21:07:27,169] - Sir Perceval completed his quest.
$ perceval git --json-line http://github.com/jgbarah-chaoss/grimoirelab-sortinghat >> git-commits.json
...
[2019-01-28 23:58:47,068] - Fetch process completed: 567 commits fetched
[2019-01-28 23:58:47,068] - Sir Perceval completed his quest.
```

## Class for computing Code_Changes-Git

This implementation uses data retrieved as described above.
The implementation is encapsulated in the `Code_Changes` class,
which gets all commits for a set of repositories.

In [81]:
import json
import datetime

import pandas as pd

class Code_Changes:
    """Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
        
    :param path: Path to file with one Perceval JSON document per line
    """

    @staticmethod
    def _summary(repo, cdata):
        """Compute a summary of a commit, suitable as a row in a dataframe"""
        
        summary = {
            'repo': repo,
            'hash': cdata['commit'],
            'author': cdata['Author'],
            'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'commit': cdata['Commit'],
            'commit_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'files_no': len(cdata['files'])
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
        return summary;
    
    def __init__(self, path):
        """Initilizes self.df, the dataframe with one row per commit.
        """

        self.df = pd.DataFrame(columns=['hash', 'author', 'author_date',
                                        'commit', 'commit_date',
                                        'files_no', 'files_action',
                                        'merge'])
        commits = []
        with open(path) as commits_file:
            for line in commits_file:
                commit = json.loads(line)
                commits.append(self._summary(repo=commit['origin'],
                                             cdata=commit['data']))
        self.df = self.df.append(commits, sort=False)
        self.df['author_date'] = pd.to_datetime(self.df['author_date'], utc=True)
        self.df['commit_date'] = pd.to_datetime(self.df['commit_date'], utc=True)
        
    def total_count(self):
        
        return len(self.df.index)
    
    def count(self, since = None, until = None, empty=True, merge=True, date='author_date'):
        """Count number of commits
        
        :param since: Period start
        :param until: Period end
        :param empty: Include empty commits
        :param merge: Include merge commits
        :param  date: Kind of date ('author_date' or 'commit_date')
        """
        
        df = self.df
        if since:
            df = df[df[date] >= since]
        if until:
            df = df[df[date] < until]
        if not empty:
            df = df[df['files_action'] != 0]
        if not merge:
            df = df[df['merge'] == False]
        return df['hash'].nunique()
    
    def by_month(self):
        
        return self.df['author_date'] \
            .groupby([self.df.author_date.dt.year.rename('year'),
                      self.df.author_date.dt.month.rename('month')]) \
            .agg('count')


Method `count()` implements `Count` aggregation for `Code_Changes`.
It accepts parameters specified for the general metric:
    
* Period of time: `since` and `until`

It accepts parameters specified for the specific case of Git:
    
* Include merge commits: `merge`
* Include empty commits: `empty`
* Kind of date: `date`

## Examples of use of the implementation

In [82]:
changes = Code_Changes('git-commits.json')
print("Code changes total count:", changes.total_count())
print("Code changes count all period:", changes.count())
print("Code changes count from 2018-01-01 to 2018-07-01:",
      changes.count(since="2018-01-01", until="2018-07-01"))
print("Code changes count from 2018-01-01 to 2018-07-01 (no merge commits):",
      changes.count(since="2018-01-01", until="2018-07-01", merge=False))
print("Code changes count from 2018-01-01 to 2018-07-01 (no empty commits):",
      changes.count(since="2018-01-01", until="2018-07-01", empty=False))

Code changes total count: 2522
Code changes count all period: 1963
Code changes count from 2018-01-01 to 2018-07-01: 437
Code changes count from 2018-01-01 to 2018-07-01 (no merge commits): 317
Code changes count from 2018-01-01 to 2018-07-01 (no empty commits): 317


## Examples showing peculiarities of git commits

Let's prepare a dictionary, `commits`, with all commits retrieved,
by reading the `commits-git.json` file.

In [83]:
import json

commits = {}
with open('git-commits.json') as commits_file:
    for line in commits_file:
        commit = json.loads(line)
        commits[commit['data']['commit']] = commit
print("Total number of commits:", len(commits))

Total number of commits: 1963


### Naive count of commits

Let's compute number of commits the easiest way: just count all commits:

In [84]:
code_commits = len(commits)
print("Code Commits (naive):", code_commits)

Code Commits (naive): 1963


### Ignoring empty commits

Empty commits are those that touch no file (for example, most merge commits). We can find them by looking at the list of files involved in the commit, and checking that all of them have no 'action' field ('action' is for identifying the action performed on the file, such as modification or creation):

In [85]:
code_commits = 0
for commit in commits.values():
    for file in commit['data']['files']:
        if 'action' in file:
            code_commits += 1
            break
                
print("Code Commits (non-empty):", code_commits)

Code Commits (non-empty): 1615


### Only non-merge commits

Now, instead of filtering out empty commits, let's filter those commits that are merge commits. Those involve no real coding, but merging commits in different branches (for example, after a pull request).

In [86]:
code_commits = 0
for commit in commits.values():
    if 'Merge' not in commit['data']:
        code_commits += 1
                
print("Code Commits (non-merge):", code_commits)

Code Commits (non-merge): 1615


### Only commits in master

In this case, we will consider only commits in the master branch:

In [87]:
# Find commits in master branch.
# Start by adding head to an empty todo list. Then loop until todo set is empty:
# for each commit in the todo list, add it to the master set, and go backwards
# (finding parents), adding them to the todo set.

todo = set()
for id, commit in commits.items():
    if 'HEAD -> refs/heads/master' in commit['data']['refs']:
        todo.add(id)

master = set()
while len(todo) > 0:
    current = todo.pop()
    master.add(current)
    for parent in commits[current]['data']['parents']:
        if parent not in master:
            todo.add(parent)
    
code_commits = len(master)
    
print("Code Commits (master branch):", code_commits)

Code Commits (master branch): 1913


### Only non-empty commits in master

Now, let's consider only those non-empty commits that you can find in the master branch. Run the next snippet after running the previous one, so that master has the right collection of commits.

In [88]:
code_commits = 0
for commit_id in master:
        commit = commits[commit_id]
        for file in commit['data']['files']:
            if 'action' in file:
                code_commits += 1
                break

print("Code Commits (non-empty in master branch):", code_commits)

Code Commits (non-empty in master branch): 1572
