In this notebook, we demonstrate how we built our simple baseline model, as well as how it was evaluated.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import scipy.sparse
import json
import string
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu
import os
BASE_DATASET_PATH = '../nngen/data/'
DIFF_FILE_SUFFIX = '.diff'
COMMIT_FILE_SUFFIX = '.msg'  

### Data Processing and Loading
We use the NNGen dataset from to train our model. We preprocess the commit messages to remove punctuations, digits and convert them to lower cases.

In [2]:
def preprocess_word(word):
    word = word.lower()
    word = word.translate(str.maketrans('', '', string.punctuation))
    word = word.translate(str.maketrans('', '', string.digits))
    return word

def read_data( split_name):
    with open(os.path.join(BASE_DATASET_PATH, split_name+DIFF_FILE_SUFFIX), 'r') as diff_file, open(os.path.join(BASE_DATASET_PATH, split_name+COMMIT_FILE_SUFFIX), 'r') as commit_file:
        diff_lines = diff_file.readlines()
        diff_lines = [diff.strip() for diff in diff_lines]
        commit_lines = commit_file.readlines()
        commit_words = [line.strip().split() for line in commit_lines]
        commit_words = [word for line in commit_words for word in line]
        commit_words = [' '.join(word for word in commit_words)]
        return diff_lines, commit_lines
    
train_diff, train_commit = read_data('cleaned.train')
valid_diff, valid_commit = read_data('cleaned.valid')
test_diff, test_commit = read_data('cleaned.test')


In [3]:
print(f"train_diff: {len(train_diff)}")
print(f"valid_diff: {len(valid_diff)}")
print(f"test_diff: {len(test_diff)}")
print(train_diff[42])
print(train_commit[42])

train_diff: 22112
valid_diff: 2511
test_diff: 2521
mmm a / INSTALL <nl> ppp b / INSTALL <nl> For installation instructions see the manual in the docs subdirectory <nl> - or online at < http : / / grails . codehaus . org / Installation > . <nl> + or online at < http : / / grails . org / Installation > . <nl>
updated url of install doc 



### Baseline Model
For baseline model, we use the NNGen model[(Paper)](https://dl.acm.org/doi/pdf/10.1145/3238147.3238190?casa_token=PQjtlNRBvJgAAAAA:dGLvlol87sT5a8biu2oEV9g5HWucpTiHaPZma8Iy1T3DNWCPQEEvupzyQK7mtg7WYRfn2SB_xSlu), whicj is a similarity-based simple statistical model that doesn't require any training. 

Given a new diff, NNGen first finds the diff which is most similar to it at the token level from the training set, then simply
outputs the commit message of the training diff as the generated commit message. 
1. The approach first extracts diffs from the training set.
2. The training diffs and the new diff are represented as vectors in the form of “bags of words”
3. Then the cosine similarity between the new diff vector and each training diff vector are calculated
4. Top k training diffs with highest similarity scores are picked
5. BLEU-4 score between the new diff and each of the top-k training diffs are computed. Training diff with the highest BLEU-4 score is regarded as the nearest neighbor of the new diff. 
6. Finally, the approach simply outputs the reference message of the nearest neighbor as the final result.

To reproduce this algoeithm, we adapted code from [(this repo)](https://github.com/vladislavneon/autogenerating-commit-messages/blob/master/nngen/)

In [4]:
## This is a one time run cell. Once the assets are generated, proceed from the following cells
vectorizer = CountVectorizer(token_pattern=r'\S+', stop_words=[''], min_df=8)
bow_matrix = vectorizer.fit_transform(train_diff)

In [5]:
vocabulary = {}
for k, v in vectorizer.vocabulary_.items():
    try: 
        vocabulary[k] = v
    except Exception as e:
        continue
print(len(vectorizer.vocabulary_))
print(f"Vocabulry size: {len(vocabulary)}")
scipy.sparse.save_npz('train_bow_matrix.npz', bow_matrix)
with open('train_vocabulary.json', 'w') as ouf:
    ouf.write(json.dumps(vocabulary, sort_keys=True, indent=4))

3844
Vocabulry size: 3844


In [7]:
def find_nearest_neighbor(simi, diffs, test_diff, candidate :int =5) -> int:
    """Find the nearest neighbor using cosine simialrity and bleu score"""
    candidates = simi.argsort()[-candidate:][::-1]
    max_score = 0
    max_idx = 0
    for j in candidates:
        score = sentence_bleu([diffs[j].split()], test_diff.split())
        if score > max_score:
            max_score = score
            max_idx = j
    return max_idx

In [8]:
#load vocabulary
with open('train_vocabulary.json', 'r') as inf:
    vocabulary = json.load(inf) 
vectorizer = CountVectorizer(vocabulary=vocabulary, token_pattern=r'\S+', stop_words=['<nl>'])
analyzer = vectorizer.build_analyzer()
train_bow_matrix = scipy.sparse.load_npz('train_bow_matrix.npz')

# convert test data to bow matrix and calculate cosine similarity
test_bow_matrix = vectorizer.transform(test_diff)
similarities = cosine_similarity(test_bow_matrix, train_bow_matrix)

In [None]:
# generate commit messages based on a nearest neighbor search
test_msgs = []
for idx, test_simi in enumerate(similarities):
    if (idx + 1) % 500 == 0:
        print(idx+1)
    max_idx = find_nearest_neighbor(test_simi, train_diff, test_diff[idx], candidate=5)
    test_msgs.append(train_commit[max_idx])
with open('nn_test_msgs.txt', 'w') as ouf:
    ouf.write('\n'.join(test_msgs))

### Evaluation
The model is evaluated using BLEU-4 score, which is a standard metric for evaluating the quality of machine translation systems. We use a perl script from [(this repo)](https://github.com/karpathy/neuraltalk/tree/master/eval)

In [None]:
!./bleu.perl <PATH_TO_REFERENCE> < nn_test_msgs.txt