# Develop a Prototype Featureset as Style Embedding

Goal: Model the style of committers by creating a self-built style embedding including features like length, polarity, and others that were explored earlier.

#### Load Data

In [1]:
import pandas as pd

data = pd.read_pickle('data/03_Subset_Frequent_Committers.pkl')
data.head(3)

Unnamed: 0,message,committer_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


### Construct First Feature Set

A first feature set is set up with the length and the amount of point characters of a message to learn how to do the implementation.

In [2]:
import numpy as np

feature_set = np.array([[len(message), message.count(".")] for message in data["message"]])

This featureset can be extended heavily.

### Normalize

In [3]:
feature_set_normed = feature_set / feature_set.max(axis=0)

This normalizing approach does only work for positive data, use the minimum value to include this.

### Calculate Distance Matrix

Use Subset until now because computationally intensive.

In [4]:
distance_matrix = np.array([[np.linalg.norm(feat_vector - compare_feat_vector) for compare_feat_vector in feature_set[:1000]] for feat_vector in feature_set[:1000]])

In [5]:
distance_matrix

array([[  0.        , 122.00409829, 109.        , ..., 126.        ,
        124.        ,  56.        ],
       [122.00409829,   0.        ,  13.03840481, ...,   4.12310563,
          2.23606798,  66.00757532],
       [109.        ,  13.03840481,   0.        , ...,  17.        ,
         15.        ,  53.        ],
       ...,
       [126.        ,   4.12310563,  17.        , ...,   0.        ,
          2.        ,  70.        ],
       [124.        ,   2.23606798,  15.        , ...,   2.        ,
          0.        ,  68.        ],
       [ 56.        ,  66.00757532,  53.        , ...,  70.        ,
         68.        ,   0.        ]])

How to evaluate a large distance matrix?

### Train K-Means

In [6]:
from sklearn.cluster import KMeans

kmeans = KMeans(100)
kmeans.fit(feature_set_normed)
kmeans_prediction = kmeans.predict(feature_set_normed)

### Evaluate K-Means

In [7]:
from utils.k_means import k_means_summary

k_means_summary = k_means_summary(kmeans_prediction, kmeans.n_clusters, data)
k_means_summary

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Number of different Committers'][label] = int(len(commiter_emails_count))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Average number of commits per different Committer'][label] = float(np.mean(list(commiter_emails_count.values())))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  k_means_summary['Most common committer'][label] = commiter_emails_count.most_common(1)[0]
A value is trying to be set on a copy of a slice from a DataFram

Unnamed: 0,Number of Messages,Number of different Committers,Average number of commits per different Committer,Most common committer,Number of different Projects,Average number of commits per different Project,Most common project
0,2885.0,49.0,58.877551,"('jaraco@jaraco.com', 407)",234.0,12.329060,"('saltstack_salt', 276)"
1,340.0,39.0,8.717949,"('mark@mark-story.com', 53)",75.0,4.533333,"('cakephp_cakephp', 52)"
2,4267.0,49.0,87.081633,"('michele.simionato@gmail.com', 321)",286.0,14.919580,"('saltstack_salt', 393)"
3,78.0,25.0,3.120000,"('mark@mark-story.com', 9)",27.0,2.888889,"('saltstack_salt', 17)"
4,1750.0,47.0,37.234043,"('michele.simionato@gmail.com', 255)",154.0,11.363636,"('gem_oq-engine', 282)"
...,...,...,...,...,...,...,...
95,33.0,13.0,2.538462,"('palehose@gmail.com', 9)",15.0,2.200000,"('saltstack_salt', 13)"
96,83.0,27.0,3.074074,"('ingo@silverstripe.com', 17)",30.0,2.766667,"('saltstack_salt', 15)"
97,104.0,25.0,4.160000,"('thomas.parrott@canonical.com', 12)",31.0,3.354839,"('saltstack_salt', 18)"
98,217.0,23.0,9.434783,"('michele.simionato@gmail.com', 135)",24.0,9.041667,"('gem_oq-engine', 158)"


In [8]:
k_means_summary.describe()

Unnamed: 0,Number of Messages,Number of different Committers,Average number of commits per different Committer,Number of different Projects,Average number of commits per different Project
count,100.0,100.0,100.0,100.0,100.0
mean,781.45,32.46,17.83051,83.44,5.621311
std,1230.788516,14.141721,25.31955,79.484109,4.295692
min,1.0,1.0,1.0,1.0,1.0
25%,57.0,22.0,2.653846,25.0,2.340476
50%,235.0,36.5,6.440798,56.5,4.26758
75%,682.0,45.25,15.17345,111.0,7.638716
max,4958.0,49.0,103.291667,309.0,17.519435


In [9]:
from utils.k_means import print_k_means_classes

print_k_means_classes(kmeans_prediction, kmeans.n_clusters, data)


________________ Class 0 ________________

___
1) 
Change .read_df to return strings and not bytes

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
2) 
Small fix to ucerf_test.NO_SHARED_DIR [skip CI]

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
3) 
Added method GeoPackager.read_all [ci skip]

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
4) 
Raised some minimum dependencies in setup.py

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
5) 
Preserving the argument order in sap.Script

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
6) 
Not storing src.nsites in the SourceFilter

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
7) 
Fixed memory issue in nrcan<I>_site_term.py

- - - 
Committer: michele.simionato@gmail.com
Project:   gem_oq-engine
___
8) 
Fixed a serious bug in ProbabilityMap.__or__

- - - 
Committer: michele.