
# REARRANGE: An Effort Estimation Approach for Software Clustering-based Remodularisation

## Abstract

Software clustering is often used as a remodularisation technique to suggest ways to improve the internal quality of the software through some suggested refactoring operations. This project aims to provide an end-to-end pipeline to help developers in carrying out refactoring activities through the following steps:


1. Estimate the effort needed to convert the current project structure to the suggested clustering result. (**REARRANGE**)

## Introduction

This repo contains scripts and notebooks that are used in REARRANGE. This README shows the general flow of the end-to-end pipeline. For more details and the full implementation, kindly refer back to the specific notebooks.

## REARRANGE Experiment Design
1. Software Systems
2. Identifying Refactoring Operations
3. Proxy Measure for Refactoring Operations
4. Data Preparation
5. Building and Training Models
6. Model Validation
7. Estimation Techniques

In [8]:
import json
import pandas as pd
import numpy as np
import networkx as nx
import jellyfish
import os
import shutil
import subprocess
import requests
from github import Github
from git import Repo
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering
from zipfile import ZipFile
from filecmp import dircmp
import configparser

## 1. Software Systems 

**REARRANGE 01 - Crawl Github Commits.ipynb**

The main function of this notebook is to crawl github commits for each release to obtain the following data from the selected dataset of software systems.

**Inputs**: 
1. Project Name

**Outputs**: 
1. Project Release Name
2. Project Release Commit SHA

A sample of the data is given below.

In [5]:
github_commits_df = pd.read_csv('volatile_projects_complete_links_limit10_filtered.csv')
github_commits_df.head()

Unnamed: 0,project_name,project_link,version_name,commit,timestamp
0,Dbeaver,https://github.com/dbeaver/dbeaver,21.1.4,"Commit(sha=""113a0a672f277a6e8181757a0c54f92d42...",29/7/2021 11:08
1,Dbeaver,https://github.com/dbeaver/dbeaver,21.1.3,"Commit(sha=""4430459a3fe06c6140aa40b71ddc41ddf8...",15/7/2021 8:06
2,Dbeaver,https://github.com/dbeaver/dbeaver,21.1.2,"Commit(sha=""b0693d44048a9c50e750b6df69cfe83fcb...",2/7/2021 13:34
3,Dbeaver,https://github.com/dbeaver/dbeaver,21.1.1,"Commit(sha=""073dfc26c7a065f5d5abf18be8cce8258a...",18/6/2021 13:50
4,Dbeaver,https://github.com/dbeaver/dbeaver,21.1.0,"Commit(sha=""17ce2d14317b1160ec9480da549028d182...",28/5/2021 5:16


## 2. Identifying Refactoring Operations 

**REARRANGE 02 - Crawl Refactoring Miner Data.ipynb**

The main function of this notebooks is to run Refactoring Miner for each project, version and commit. This is to obtain any and all refactoring details between the current commit and the previous commit.


**Inputs**:
1. Project Name
2. Release Name
3. Commit SHA (Current Release)
4. Commit SHA (Previous Release)

**Outputs**:
1. Number of Refactoring
2. Type of Refactoring
3. Location of Refactoring (Previous Location)
    * Start Line Number
    * End Line Number
4. Location of Refactoring (New Location)
    * Start Line Number
    * End Line Number
    
A sample of the data is given below.

In [11]:
refactoring_miner_filename = f'raw_refactoringMiner/Okhttp/Okhttp_parent-5.0.0-alpha.2.json'
print(refactoring_miner_filename)
f = open(refactoring_miner_filename)
refactoring_miner = json.load(f)
for i in refactoring_miner['commits']:
    if len(i['refactorings']) > 0:
        print(i)
        break

raw_refactoringMiner/Okhttp/Okhttp_parent-5.0.0-alpha.2.json
{'repository': 'https://github.com/square/okhttp', 'sha1': '3e331c108905a97fa9718b40844ddc1356fc86b5', 'url': 'https://github.com/square/okhttp/commit/3e331c108905a97fa9718b40844ddc1356fc86b5', 'refactorings': [{'type': 'Move Class', 'description': 'Move Class okhttp3.mockwebserver.CustomDispatcherTest moved to mockwebserver3.CustomDispatcherTest', 'leftSideLocations': [{'filePath': 'mockwebserver/src/test/java/okhttp3/mockwebserver/CustomDispatcherTest.java', 'startLine': 31, 'endLine': 98, 'startColumn': 0, 'endColumn': 2, 'codeElementType': 'TYPE_DECLARATION', 'description': 'original type declaration', 'codeElement': 'okhttp3.mockwebserver.CustomDispatcherTest'}], 'rightSideLocations': [{'filePath': 'mockwebserver/src/test/java/mockwebserver3/CustomDispatcherTest.java', 'startLine': 31, 'endLine': 98, 'startColumn': 0, 'endColumn': 2, 'codeElementType': 'TYPE_DECLARATION', 'description': 'moved type declaration', 'codeEle

## 3. Proxy Measure for Refactoring Operations & 4. Data Preparation

**REARRANGE 03 - Merge Data (Github Commit, Refactoring Miner, Depends, CKMetrics).ipynb**

The main function of this notebooks is to 
1. Merge the data from the previous 2 notebooks with dependency features (Depends) and software features (CKMetrics).
2. Calculate the proxy measure for refactoring operations given by refactoring loc / total loc in commit.
3. Calculate the effort needed for other Sofware Estimation Models
    * COCOMOII
    * GeneticP
    * SoftwareMaintenance
    * Fuzzy


**Inputs**:
1. Github Commit Data
2. Refactoring Miner Data
3. Depends Data
4. CKMetrics Data

**Outputs**:
1. Main DataFrame
    
A sample of the data is given below.

In [18]:
effort_estimation_df = pd.read_csv('Effort_Estimation_Results_3E_v2/Okhttp.csv')
effort_estimation_df.head()

Unnamed: 0,kmean_label,time_taken_mean,time_taken_min,time_taken_max,time_taken_q10,time_taken_q20,time_taken_q25,time_taken_q30,time_taken_q40,time_taken_q50,...,actual_num_of_classes_touched_min,actual_num_of_classes_touched_max,actual_num_of_classes_touched_std,commit_line_changed,refactoring_perc,refactoring_perc_time_taken,cocomoII_time_taken,geneticP_time_taken,softwareMaintenance_time_taken,fuzzy_time_taken
0,5,25.302839,1.0,167.0,1.0,2.0,3.0,5.0,8.0,11.0,...,2,40,15.649814,1182,0.060068,1.0,611.61408,364.998402,7713.92,1176.156883
1,5,25.302839,1.0,167.0,1.0,2.0,3.0,5.0,8.0,11.0,...,39,82,21.733231,1000,0.344,3.096,517.44,308.94336,6520.0,1052.384959
2,5,25.302839,1.0,167.0,1.0,2.0,3.0,5.0,8.0,11.0,...,2,2,,338,1.0,146.0,174.89472,104.602433,2177.28,511.569022
3,5,25.302839,1.0,167.0,1.0,2.0,3.0,5.0,8.0,11.0,...,1,8,2.366432,14,1.0,1.0,7.24416,4.336286,51.84,61.567253
4,7,24.024272,1.0,148.0,2.0,4.0,5.0,6.5,10.0,14.0,...,2,48,19.605194,358,0.360335,1.081006,185.24352,110.78618,2308.48,531.504377


In [19]:
for column in effort_estimation_df:
    print(column)

kmean_label
time_taken_mean
time_taken_min
time_taken_max
time_taken_q10
time_taken_q20
time_taken_q25
time_taken_q30
time_taken_q40
time_taken_q50
time_taken_q60
time_taken_q70
time_taken_q75
time_taken_q80
time_taken_q90
sha
name
email
date
login
message
parent_sha
parent_date
time_taken
contains_refactoring
project_name
commit_compared_with
cbo_mean
cbo_min
cbo_max
cbo_std
wmc_mean
wmc_min
wmc_max
wmc_std
dit_mean
dit_min
dit_max
dit_std
rfc_mean
rfc_min
rfc_max
rfc_std
lcom_mean
lcom_min
lcom_max
lcom_std
totalMethods_mean
totalMethods_min
totalMethods_max
totalMethods_std
staticMethods_mean
staticMethods_min
staticMethods_max
staticMethods_std
publicMethods_mean
publicMethods_min
publicMethods_max
publicMethods_std
privateMethods_mean
privateMethods_min
privateMethods_max
privateMethods_std
protectedMethods_mean
protectedMethods_min
protectedMethods_max
protectedMethods_std
defaultMethods_mean
defaultMethods_min
defaultMethods_max
defaultMethods_std
abstractMethods_mean
abstractMe

## 5. Building and Training Models 

**REARRANGE 04 - Model Building.ipynb**

The main function of this notebooks is to build a maching learning model using H2O AutoML using the following as features to predict ``refactoring_perc_time_taken_log``.

'cbo_mean','cbo_min','cbo_max', 'cbo_std',
 'wmc_mean','wmc_min','wmc_max', 'wmc_std',
 'dit_mean','dit_min','dit_max', 'dit_std',
 'rfc_mean', 'rfc_min', 'rfc_max', 'rfc_std',
 'lcom_mean', 'lcom_min', 'lcom_max', 'lcom_std',
 'totalMethods_mean', 'totalMethods_min', 'totalMethods_max', 'totalMethods_std',
 'staticMethods_mean', 'staticMethods_min', 'staticMethods_max', 'staticMethods_std',
 'publicMethods_mean', 'publicMethods_min', 'publicMethods_max', 'publicMethods_std',
 'privateMethods_mean', 'privateMethods_min', 'privateMethods_max', 'privateMethods_std',
 'protectedMethods_mean', 'protectedMethods_min', 'protectedMethods_max', 'protectedMethods_std',
 'defaultMethods_mean', 'defaultMethods_min', 'defaultMethods_max', 'defaultMethods_std',
 'abstractMethods_mean', 'abstractMethods_min', 'abstractMethods_max', 'abstractMethods_std',
 'finalMethods_mean', 'finalMethods_min', 'finalMethods_max', 'finalMethods_std',
 'synchronizedMethods_mean', 'synchronizedMethods_min', 'synchronizedMethods_max', 'synchronizedMethods_std',
 'totalFields_mean', 'totalFields_min', 'totalFields_max', 'totalFields_std',
 'staticFields_mean', 'staticFields_min', 'staticFields_max', 'staticFields_std',
 'publicFields_mean', 'publicFields_min', 'publicFields_max', 'publicFields_std',
 'privateFields_mean', 'privateFields_min', 'privateFields_max', 'privateFields_std',
 'protectedFields_mean', 'protectedFields_min', 'protectedFields_max', 'protectedFields_std',
 'defaultFields_mean', 'defaultFields_min', 'defaultFields_max', 'defaultFields_std',
 'finalFields_mean', 'finalFields_min', 'finalFields_max', 'finalFields_std',
 'synchronizedFields_mean', 'synchronizedFields_min', 'synchronizedFields_max', 'synchronizedFields_std',
 'nosi_mean', 'nosi_min', 'nosi_max', 'nosi_std',
 'loc_mean', 'loc_min', 'loc_max','loc_std',
 'returnQty_mean', 'returnQty_min', 'returnQty_max', 'returnQty_std',
 'loopQty_mean', 'loopQty_min', 'loopQty_max', 'loopQty_std',
 'comparisonsQty_mean', 'comparisonsQty_min', 'comparisonsQty_max', 'comparisonsQty_std',
 'tryCatchQty_mean','tryCatchQty_min', 'tryCatchQty_max', 'tryCatchQty_std',
 'parenthesizedExpsQty_mean','parenthesizedExpsQty_min', 'parenthesizedExpsQty_max', 'parenthesizedExpsQty_std',
 'stringLiteralsQty_mean', 'stringLiteralsQty_min', 'stringLiteralsQty_max', 'stringLiteralsQty_std',
 'numbersQty_mean', 'numbersQty_min', 'numbersQty_max', 'numbersQty_std',
 'assignmentsQty_mean', 'assignmentsQty_min', 'assignmentsQty_max', 'assignmentsQty_std',
 'mathOperationsQty_mean', 'mathOperationsQty_min', 'mathOperationsQty_max', 'mathOperationsQty_std',
 'variablesQty_mean', 'variablesQty_min', 'variablesQty_max', 'variablesQty_std',
 'maxNestedBlocks_mean', 'maxNestedBlocks_min', 'maxNestedBlocks_max', 'maxNestedBlocks_std',
 'anonymousClassesQty_mean', 'anonymousClassesQty_min', 'anonymousClassesQty_max', 'anonymousClassesQty_std',
 'subClassesQty_mean', 'subClassesQty_min', 'subClassesQty_max', 'subClassesQty_std',
 'lambdasQty_mean', 'lambdasQty_min', 'lambdasQty_max', 'lambdasQty_std',
 'uniqueWordsQty_mean', 'uniqueWordsQty_min', 'uniqueWordsQty_max', 'uniqueWordsQty_std',
 'modifiers_mean', 'modifiers_min', 'modifiers_max', 'modifiers_std',
 'num_dependency_mean', 'num_line_affected_mean'


**Inputs**:
1. Software Features (Predictors)
2. refactoring_perc_time_taken_log (Target Variable)

**Outputs**:
1. Maching Learning Model (H2O Automl)
    
Details of the model is given below.

In [22]:
import h2o
h2o.init()
model_path = "models/EffortEstimationModelv3/Log_Regression_GBM_grid__1_AutoML_20220228_154246_model_3"
model = h2o.load_model(model_path)
model

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,15 secs
H2O_cluster_timezone:,Asia/Kuala_Lumpur
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.7
H2O_cluster_version_age:,11 months and 3 days !!!
H2O_cluster_name:,H2O_from_python_tanji_1qzstq
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.984 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_grid__1_AutoML_20220228_154246_model_3


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,33.0,33.0,3454.0,2.0,3.0,2.575757,3.0,4.0,3.666667




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 3.2044038310962666
RMSE: 1.7900848670094573
MAE: 1.424412915308594
RMSLE: NaN
Mean Residual Deviance: 3.2044038310962666

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 3.8037588457541194
RMSE: 1.9503227542522594
MAE: 1.5519355816853435
RMSLE: NaN
Mean Residual Deviance: 3.8037588457541194

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,1.5520502,0.08237011,1.4910178,1.467426,1.6308649,1.5229915,1.647951
1,mean_residual_deviance,3.80427,0.3442388,3.4929738,3.5808723,4.040876,3.6165812,4.2900476
2,mse,3.80427,0.3442388,3.4929738,3.5808723,4.040876,3.6165812,4.2900476
3,r2,0.17276023,0.03720194,0.19935311,0.19155143,0.121441506,0.20593041,0.14552468
4,residual_deviance,3.80427,0.3442388,3.4929738,3.5808723,4.040876,3.6165812,4.2900476
5,rmse,1.9488872,0.08738271,1.8689499,1.8923193,2.0101929,1.9017311,2.071243
6,rmsle,,0.0,,,,,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2022-02-28 15:43:08,1.653 sec,0.0,2.146777,1.667285,4.608652
1,,2022-02-28 15:43:08,1.673 sec,5.0,2.014709,1.580553,4.059051
2,,2022-02-28 15:43:08,1.693 sec,10.0,1.932624,1.522236,3.735036
3,,2022-02-28 15:43:08,1.717 sec,15.0,1.881267,1.490755,3.539164
4,,2022-02-28 15:43:08,1.742 sec,20.0,1.851262,1.472042,3.42717
5,,2022-02-28 15:43:08,1.771 sec,25.0,1.823754,1.451458,3.32608
6,,2022-02-28 15:43:08,1.792 sec,30.0,1.801777,1.435145,3.2464
7,,2022-02-28 15:43:08,1.805 sec,33.0,1.790085,1.424413,3.204404



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,uniqueWordsQty_min,1156.941528,1.0,0.268673
1,loc_min,555.740234,0.480353,0.129058
2,cbo_std,502.424957,0.43427,0.116676
3,privateFields_std,247.56366,0.213981,0.057491
4,finalFields_std,219.819809,0.190001,0.051048
5,nosi_min,205.712997,0.177808,0.047772
6,staticFields_mean,192.080521,0.166024,0.044606
7,nosi_mean,106.552544,0.092098,0.024744
8,subClassesQty_mean,103.979286,0.089874,0.024147
9,finalFields_mean,102.416138,0.088523,0.023784



See the whole table with table.as_data_frame()




## 6. Model Validation & 7. Estimation Techniques

**REARRANGE 05 - Model Validation.ipynb**

The main function of this notebooks is to 
1. Validate the Machine Learning Model built in the previous notebook as a baseline model.
    * MAE
    * SA
    * RE*
2. Compare the performance of the model against other software estimation models.
    * COCOMOII
    * GeneticP
    * SoftwareMaintenance
    * Fuzzy

Requirements of a baseline model.
1. Be simple to describe, implement, and interpret.
2. Be deterministic in its outcomes.
3. Be applicable to mixed qualitative and quantitative data.
4. Offer some explanatory information regarding the prediction by representing generalised properties of the underlying data.
5. Have no parameters within the modelling process that require tuning.
6. Be publicly available via a reference implementation and associated environment for execution.
7. Generally be more accurate than a random guess or an estimate based purely on the distribution of the response variable.
8. Be robust to different data splits and validation methods.
9. Do not be expensive to apply.
10. Offer comparable performance to standard methods.

**Inputs**:
1. Training Data
2. Testing Data
3. Machine Learning Model

**Outputs**:
1. Validation Results