# K-Means Clustering Approach

#### Load Data and spaCy Model

In [1]:
import pandas as pd

Execute this block for all data:

In [2]:
data = pd.read_pickle("data/02_All_Decreased_Filesize.pkl")

Execute this block for the Frequent Committers Subset:

In [3]:
data = pd.read_pickle('data/03_Subset_Frequent_Committers.pkl')

In [4]:
data.head(3)

Unnamed: 0,message,committer_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

#### Take a Subset

In [6]:
subset_size = 100000

messages = data["message"][:subset_size].tolist()

#### Build Document Vectors

In [7]:
vectors = []

docs = nlp.pipe(messages)

for doc in docs:
    vectors.append(doc.vector)

## Clustering the Subset with K-Means

#### Training

In [8]:
from sklearn.cluster import KMeans

kmeans = KMeans(100)
kmeans.fit(vectors)
kmeans_prediction = kmeans.predict(vectors)

#### Evaluation

In [9]:
from utils.k_means import k_means_summary

k_means_summary = k_means_summary(kmeans_prediction, kmeans.n_clusters, data)
k_means_summary

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Number of different Committers'][label] = int(len(commiter_emails_count))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Average number of commits per different Committer'][label] = float(np.mean(list(commiter_emails_count.values())))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Most common committer'][label] = commiter_emails_count.most_common(1)[0]
A value is trying to be set on a copy of a slice from a DataFram

Unnamed: 0,Number of Messages,Number of different Committers,Average number of commits per different Committer,Most common committer,Number of different Projects,Average number of commits per different Project,Most common project
0,395.0,21.0,18.809524,"('avwu@qq.com', 213)",44.0,8.977273,"('avwo_whistle', 209)"
1,1290.0,41.0,31.463415,"('taylor@laravel.com', 603)",112.0,11.517857,"('laravel_framework', 389)"
2,1268.0,46.0,27.565217,"('aaron.patterson@gmail.com', 120)",173.0,7.329480,"('saltstack_salt', 165)"
3,869.0,42.0,20.690476,"('P.Rudiger@ed.ac.uk', 65)",148.0,5.871622,"('saltstack_salt', 81)"
4,597.0,29.0,20.586207,"('postmodern.mod3@gmail.com', 194)",114.0,5.236842,"('ronin-ruby_ronin', 104)"
...,...,...,...,...,...,...,...
95,770.0,41.0,18.780488,"('jmettraux@gmail.com', 170)",154.0,5.000000,"('floraison_flor', 137)"
96,458.0,40.0,11.450000,"('igor.kroitor@gmail.com', 88)",89.0,5.146067,"('ccxt_ccxt', 88)"
97,944.0,40.0,23.600000,"('thatch45@gmail.com', 183)",144.0,6.555556,"('saltstack_salt', 301)"
98,915.0,35.0,26.142857,"('avwu@qq.com', 181)",81.0,11.296296,"('avwo_whistle', 179)"


In [10]:
k_means_summary.describe()

Unnamed: 0,Number of Messages,Number of different Committers,Average number of commits per different Committer,Number of different Projects,Average number of commits per different Project
count,100.0,100.0,100.0,100.0,100.0
mean,781.45,33.35,27.612337,98.34,12.362748
std,450.961383,13.955467,20.89371,57.472562,17.862529
min,83.0,2.0,4.588235,2.0,2.26087
25%,437.75,25.0,17.55947,45.5,5.30696
50%,771.5,39.5,22.765625,99.0,6.833621
75%,1038.5,44.0,29.932292,145.0,11.351687
max,2508.0,49.0,142.0,229.0,142.0


In [11]:
from utils.k_means import print_k_means_classes

print_k_means_classes(kmeans_prediction, kmeans.n_clusters, data[:subset_size])


________________ Class 0 ________________

___
1) 
Rounding avg lon, lat

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
2) 
client/interfaces: Adds SendEvent function signature

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
3) 
shared/util: IsFalse description

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
4) 
lxd: Adds type field to instance API output

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
5) 
device: Links gpu device

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
6) 
lxd: Enable gorilla UseEncodedPath

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
7) 
client/lxd: log websocket URL

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
8) 
device/gpu: Updates unix function usage

- - - 
Committer: thomas.parrott@canonical.com
Project:   lxc_lxd
___
9) 
client/interfaces: Adds network peer management function defi

A kmeans model was trained on the vectors spacy calculates for each document.

This clustering approach can detect some commonalities in between the commit messages. For example, in one cluster all messages include the tag &lt;I>, in others there is always something that is fixed, some always have an URL or a file path and also lengths and sentence structure can be equal in classes sometimes.

That is why one could argue that messages in one resulting cluster of k-means should have approximately equal quality.

Idea: Use the "Elbow method" to calculate the best fitting number of cluster centers.

https://github.com/Hassaan-Elahi/Writing-Styles-Classification-Using-Stylometric-Analysis