# K-Means Clustering Approach

#### Load Data and spaCy Model

In [1]:
import pandas as pd

Execute this block for all data:

In [2]:
data = pd.read_pickle("../data/02_All_Decreased_Filesize.pkl")

Execute this block for the Frequent Committers Subset:

In [3]:
data = pd.read_pickle('../data/03_Subset_Frequent_Committers.pkl')

In [4]:
data.head(3)

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

#### Take a Subset

In [6]:
# Taking a subset has not any influence when training on the frequent committer dataset since it has less then 100000 samples

subset_size = 100000

messages = data["message"][:subset_size].tolist()

#### Build Document Vectors

In [7]:
vectors = []

docs = nlp.pipe(messages)

for doc in docs:
    vectors.append(doc.vector)

## Clustering the Subset with K-Means

#### Training

In [8]:
from sklearn.cluster import KMeans

kmeans = KMeans(100)
kmeans.fit(vectors)
kmeans_prediction = kmeans.predict(vectors)

#### Evaluation

In [9]:
#import warnings
#warnings.filterwarnings('ignore')

from utils.k_means import k_means_summary

k_means_summary = k_means_summary(kmeans_prediction, kmeans.n_clusters, data)
k_means_summary

Unnamed: 0,Number of Messages,Number of different Authors,Average number of commits per different Author,Most common Author,Number of different Projects,Average number of commits per different Project,Most common project
0,1295.0,37.0,35.000000,"('thomas.parrott@canonical.com', 682)",77.0,16.818182,"('lxc_lxd', 751)"
1,284.0,33.0,8.606061,"('igor.kroitor@gmail.com', 97)",60.0,4.733333,"('ccxt_ccxt', 97)"
2,827.0,40.0,20.675000,"('marijnh@gmail.com', 100)",117.0,7.068376,"('saltstack_salt', 79)"
3,1173.0,39.0,30.076923,"('none@none', 318)",115.0,10.200000,"('vmalloc_dessert', 137)"
4,977.0,28.0,34.892857,"('github@contao.org', 483)",42.0,23.261905,"('contao_contao', 483)"
...,...,...,...,...,...,...,...
95,731.0,41.0,17.829268,"('P.Rudiger@ed.ac.uk', 179)",132.0,5.537879,"('pyviz_holoviews', 169)"
96,1157.0,42.0,27.547619,"('zacharyspector@gmail.com', 134)",152.0,7.611842,"('LogicalDash_LiSE', 134)"
97,419.0,39.0,10.743590,"('moodler', 45)",102.0,4.107843,"('moodle_moodle', 50)"
98,718.0,42.0,17.095238,"('thomas.parrott@canonical.com', 130)",115.0,6.243478,"('lxc_lxd', 138)"


In [17]:
print(f"There are {len(data['author_email'].unique())} different authors.")
print(f"There are {len(data['project'].unique())} different projects.")

There are 42 different authors.
There are 774 different projects.


In [10]:
k_means_summary.describe()

Unnamed: 0,Number of Messages,Number of different Authors,Average number of commits per different Author,Number of different Projects,Average number of commits per different Project
count,100.0,100.0,100.0,100.0,100.0
mean,683.25,29.4,26.928562,92.99,11.005705
std,390.359233,11.88752,25.278546,51.304891,20.3751
min,96.0,1.0,6.038462,6.0,2.532258
25%,401.0,22.75,16.996429,43.75,5.119243
50%,641.5,33.5,22.569444,96.0,6.306936
75%,869.75,39.0,30.27297,130.5,9.584318
max,2352.0,42.0,240.6,217.0,200.5


In [11]:
from utils.k_means import print_k_means_classes

print_k_means_classes(kmeans_prediction, kmeans.n_clusters, data[:][:subset_size])


________________ Class 0 ________________

___
1) 
Added an API /extract/asset_tags

___
2) 
Fixed calc/disagg_test.py

___
3) 
Added a test for extract/composite_risk_model.attrs

___
4) 
Fixed misprint in extract/losses_by_asset

___
5) 
Fixed datadir for the user openquake to /opt/openquake/oqdata

___
6) 
Managed case of missing calculation in /v1/calc/ID/status [skip hazardlib]

___
7) 
Fixed extract/realizations with collect_rlzs=True

___
8) 
lxd/db/storage/volumes: Set Snapshot: true in StorageVolumeArgs returned from GetLocalStoragePoolVolumeSnapshotsWithType

___
9) 
lxd/cluster/heartbeat: Removes deprecated Raft field from heartbeat

___
10) 
lxd/storage/drivers/driver/lvm/volumes: Allow unsafe shrinking when allowUnsafeResize is enabled

Also skip GPT header move.

_________________

Number of messages in this class: 1295
Most common author:
('thomas.parrott@canonical.com', 682)
Most common project:
('lxc_lxd', 751)



________________ Class 1 ________________

___
1) 
Fix

A kmeans model was trained on the vectors spacy calculates for each document.

This clustering approach can detect some commonalities in between the commit messages. For example, in one cluster all messages include the tag &lt;I>, in others there is always something that is fixed, some always have an URL or a file path and also lengths and sentence structure can be equal in classes sometimes.

That is why one could argue that messages in one resulting cluster of k-means should have approximately equal quality.

## Finding the best number of clusters

Idea: Use the "Elbow method" to calculate the best fitting number of cluster centers.

https://github.com/Hassaan-Elahi/Writing-Styles-Classification-Using-Stylometric-Analysis

This method aims at finding out where the sum of the squared distances from each point to its assigned cluster centroid is minimal.

Thus, it can be argued that this provides the best possible clustering outcome.

In [None]:
inertia = []
K = range(100, 1000, 100)

for k in K:
    kmeans = KMeans(k)
    kmeans.fit(vectors)
    inertia.append(kmeans.inertia_)

# takes about 30s per iteration on M1 MacBook Air

In [None]:
import matplotlib.pyplot as plt

plt.plot(K, inertia)
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Finding the optimal value for k')
plt.show()