# Evaluating Style for Commit Message Generation

### Notebook for initial dataset exploration

After initial meeting on Tuesday, 31st October 2022

#### Load data and display first values

In [1]:
import pandas as pd

data = pd.read_csv("results.csv")
data.head()

Unnamed: 0,hash,diff,message,author_email,author_name,committer_email,committer_name,project,split
0,1de640cc59b4b3030447d567b3c99c50777bd760,a/setup.py b/setup.py\nindex <HASH>..<HASH> 1...,setup: Detect if wheel and twine installed,gcushen@users.noreply.github.com,George Cushen,gcushen@users.noreply.github.com,George Cushen,gcushen_mezzanine-api,train
1,c1cce6fe5e49df5546c30a662fd141d41f4fc389,a/Builder.php b/Builder.php\nindex <HASH>..<H...,[Builder] Adding root page in any case,g.passault@gmail.com,Gregwar,g.passault@gmail.com,Gregwar,Gregwar_Slidey,train
2,2f7d97d15ea41f4112e74429617c5daad740d7cc,a/web.go b/web.go\nindex <HASH>..<HASH> 10064...,Added web.Urlencode method,hoisie@gmail.com,Michael Hoisie,hoisie@gmail.com,Michael Hoisie,hoisie_web,train
3,6470cb3411381c95b6fc8d53f002cd50edd10ec3,a/spec/controllers/socializer/memberships/con...,Prefer single-quoted strings when you don't ne...,tom.pietschker@acmetechnologygroup.com,Tom Pietschker,tom.pietschker@acmetechnologygroup.com,Tom Pietschker,socializer_socializer,train
4,5e1725b604be5c3edb415725e33c86976e8cf8a2,a/demosys/view/screenshot.py b/demosys/view/s...,Bug: Screenshot data should use the FBOs viewp...,eforselv@gmail.com,einarf,eforselv@gmail.com,einarf,Contraz_demosys-py,train


#### Get first overview on data

In [2]:
data.describe()

Unnamed: 0,hash,diff,message,author_email,author_name,committer_email,committer_name,project,split
count,1665091,1665091,1665091,1664259,1664949,1664285,1664937,1665091,1665091
unique,1665091,1659398,1605673,219575,195197,169316,150739,71532,3
top,1de640cc59b4b3030447d567b3c99c50777bd760,a/tests/input/logictree_test.py b/tests/input...,Apply fixes from StyleCI (#<I>),michele.simionato@gmail.com,Michele Simionato,noreply@github.com,GitHub,saltstack_salt,train
freq,1,7,1307,4991,5077,96629,96659,17501,1165564


We see that there are 1.665.091 entries in our dataset.

There are 1.605.673 unique commit messages created by 169.316 different committers (counted by emails) of 71.532 different projects.

#### Tokenize and generate Vocabulary

In [3]:
import nltk

data["tokenized_message"] = data['message'].apply(nltk.tokenize.word_tokenize)
tokenized_messages = []
tokenized_messages.extend(data["tokenized_message"].values)
tokenized_messages_flat = [item for sublist in tokenized_messages for item in sublist]
total_vocab = nltk.lm.Vocabulary(tokenized_messages_flat)
print("The vocabulary contains {vocab_count} tokens without removing stopwords.".format(vocab_count = len(total_vocab.counts)))

The vocabulary contains 799001 tokens without removing stopwords.


In [17]:
tokenized_messages[:20] # TODO: Also look at original message

[['setup', ':', 'Detect', 'if', 'wheel', 'and', 'twine', 'installed'],
 ['[', 'Builder', ']', 'Adding', 'root', 'page', 'in', 'any', 'case'],
 ['Added', 'web.Urlencode', 'method'],
 ['Prefer',
  'single-quoted',
  'strings',
  'when',
  'you',
  'do',
  "n't",
  'need',
  'string',
  'interpolation',
  'or',
  'special',
  'symbols',
  '.'],
 ['Bug',
  ':',
  'Screenshot',
  'data',
  'should',
  'use',
  'the',
  'FBOs',
  'viewport',
  'Ultra',
  'wide',
  'screens',
  'seems',
  'to',
  'allocate',
  'a',
  'wider',
  'buffer',
  '.'],
 ['Ensure',
  'extra_disk_data',
  'is',
  'skipped',
  'if',
  'nil',
  'This',
  'commit',
  'skips',
  'over',
  'adding',
  'any',
  'extra_disk_data',
  'to',
  'the',
  'storage',
  'controller',
  'data',
  'structure',
  'in',
  'case',
  'it',
  "'s",
  'nil',
  '.'],
 ['invitation',
  'send',
  'prompt',
  'shows',
  'after',
  'a',
  'successful',
  'response'],
 ['Update',
  'admin/javascript/lang/de_DE.js',
  'fixed',
  'typo',
  'in',
  

In [4]:
import numpy as np

tokenized_messages_length = [len(sublist) for sublist in tokenized_messages]
average_length = np.mean(tokenized_messages_length)
print("The average commit message contains {average_length:.3f} Tokens.".format(average_length = average_length))

The average commit message contains 14.315 Tokens.


#### Stop word removal

In [5]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
tokenized_messages_flat_without_stopwords = [item for item in tokenized_messages_flat if not item in stopwords]
vocab = nltk.lm.Vocabulary(tokenized_messages_flat_without_stopwords)
print("The vocabulary contains {vocab_count} tokens when removing stopwords.".format(vocab_count = len(vocab.counts)))

The vocabulary contains 798831 tokens when removing stopwords.


In [19]:
len(stopwords)

179

Stopword removal led to excluding 170 tokens.

### Exploration with spacy

Take a subset because unable to process all data locally.

In [6]:
subset_size = 100000

messages = data["message"][:subset_size].tolist()

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [9]:
tokens = []
token_tags = []
cats = []
ents = []
vectors = []

# mood analysis not really valuable, not able to detect imperativ mood like that, polarity sometimes given, but only neg
imperative_count = []


def token_filter(token):
    return not token.is_stop and token.is_alpha

docs = nlp.pipe(messages)

for doc in docs:
    tokens.extend([token.lemma_ if token_filter(token) else None for token in doc])
    token_tags.extend([token.tag_ for token in doc])
    cats.extend(doc.cats if doc.cats != "" else "")
    ents.extend([ent.lemma_ for ent in doc.ents])
    vectors.append(doc.vector)
    imperative_count.extend([token.morph for token in doc])


In [10]:
print("20 most common (lemmatized) tokens in first {subset_size} messages:".format(subset_size = subset_size))

from collections import Counter

tokens_count = Counter(tokens)
tokens_count.most_common(21)[1:]

20 most common (lemmatized) tokens in first 100000 messages:


[('fix', 20749),
 ('add', 17329),
 ('test', 11266),
 ('remove', 7423),
 ('use', 6156),
 ('update', 5640),
 ('change', 5470),
 ('method', 5208),
 ('error', 5186),
 ('url', 3951),
 ('file', 3932),
 ('check', 3768),
 ('version', 3606),
 ('issue', 3549),
 ('set', 3423),
 ('support', 3040),
 ('bug', 3003),
 ('return', 2833),
 ('Fix', 2655),
 ('code', 2644)]

In [11]:
print("Most common token tags in first {subset_size} messages:".format(subset_size = subset_size))

token_tags_count = Counter(token_tags)
token_tags_count.most_common(10)

Most common token tags in first 100000 messages:


[('NN', 294339),
 ('IN', 137675),
 ('VB', 94535),
 ('NNP', 90266),
 ('DT', 75444),
 ('JJ', 72476),
 ('NNS', 70804),
 ('XX', 62388),
 ('RB', 51032),
 ('_SP', 46894)]

In [26]:
spacy.explain("JJ")

'adjective (English), other noun-modifier (Chinese)'

In [12]:
print("Spacy finds the following categories: \n" + str(cats))
print("(Not expected to find any categories)")

Spacy finds the following categories: 
[]
(Not expected to find any categories)


In [13]:
ents_count = Counter(ents)
print("20 most common (lemmatized) entities in first {subset_size} messages:".format(subset_size = subset_size))
ents_count.most_common(20)

20 most common (lemmatized) entities in first 100000 messages:


[('first', 867),
 ('fix', 716),
 ('one', 666),
 ('fix #', 580),
 ('1', 477),
 ('2', 473),
 ('#', 447),
 ('0', 433),
 ('3', 429),
 ('api', 395),
 ('two', 370),
 ('Python', 321),
 ('API', 294),
 ('doc', 293),
 ('CI', 251),
 ('PHP', 236),
 ('improve', 234),
 ('second', 223),
 ('json', 200),
 ('zero', 193)]

In [14]:
from spacy import displacy

print("Structure of first commit message as a example")
displacy.render(nlp(messages[0]))

Structure of first commit message as a example


## Ideas / Questions for Next Steps

- Group by Commiters
- Imperative Mood: Unable to detect with spacy yet
- More explorations required?

Style:
- Styleformer only for generation but not for style detection
- How to find out about different styles