# K-Means Clustering Approach

#### Load Data and spaCy Model

Execute this block for all data:

In [14]:
import pandas as pd

In [15]:
data = pd.read_csv("results.csv")

Execute this block for the Frequent Committers Subset:

In [16]:
data = pd.read_pickle('Frequent_Committer_Subset')

In [17]:
data.head(3)

Unnamed: 0,message,committer_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [18]:
import spacy

nlp = spacy.load("en_core_web_sm")

#### Take a Subset

In [19]:
subset_size = 100000

messages = data["message"][:subset_size].tolist()

#### Build Document Vectors

In [20]:
vectors = []

docs = nlp.pipe(messages)

for doc in docs:
    vectors.append(doc.vector)

## Clustering the Subset with K-Means

#### Training

In [21]:
from sklearn.cluster import KMeans

kmeans = KMeans(100)
kmeans.fit(vectors)
kmeans_prediction = kmeans.predict(vectors)

#### Evaluation

In [22]:
from collections import Counter

for label in range(kmeans.n_clusters):
    print("\n________________ Class " + str(label) + " ________________\n")
    class_counter = 1
    committer_emails = []
    projects = []
    print_examples = True
    for i, message in enumerate(messages):
        if class_counter == 11:
            print_examples = False
        if kmeans_prediction[i] == label:
            committer_emails.append(data["committer_email"][i])
            projects.append(data["project"][i])
            if print_examples:
                print("___")
                print(str(class_counter) + ") ")
                print(messages[i] + "\n")
                print("- - - ")
                print("Committer: " + str(data["committer_email"][i]))
                print("Project:   " + str(data["project"][i]))
            class_counter += 1
    print("_________________")
    print()
    print("Number of messages in this class: " + str(class_counter - 1))
    print("Most common committers:")
    print(Counter(committer_emails).most_common(2))
    print("Most common project:")
    print(Counter(projects).most_common(1)[0])
    print()
    print()


________________ Class 0 ________________

___
1) 
Cleaned up /tmp in datastore test_read

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
2) 
Only display menu when user has access.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_control
___
3) 
Handle binding ImagineInterface as well.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_imagine
___
4) 
remove register installer, no longer applicable.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_foundation
___
5) 
Ignore search by identifier if it's empty.

- - - 
Committer: crynobone@gmail.com
Project:   laravie_authen
___
6) 
Use v-cloak until page is ready.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_foundation
___
7) 
Load relationship when available.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_model
___
8) 
Check if method exists.

- - - 
Committer: crynobone@gmail.com
Project:   orchestral_testbench-core
___
9) 
Add fullname valida

A kmeans model was trained on the vectors spacy calculates for each document.

This clustering approach can detect some commonalities in between the commit messages. For example, in one cluster all messages include the tag &lt;I>, in others there is always something that is fixed, some always have an URL or a file path and also lengths and sentence structure can be equal in classes sometimes.

That is why one could argue that messages in one resulting cluster of k-means should have approximately equal quality.

Idea: Use the "Elbow method" to calculate the best fitting number of cluster centers.

https://github.com/Hassaan-Elahi/Writing-Styles-Classification-Using-Stylometric-Analysis