This notebook purpose is to convert all the data (.csv files) from the long-lived java projects into new .csv files. The new csvs have the ACR, ACDIF and ACDEN metrics, their means and median. This notebook also compiles all projects into two csvs that have all the normal metrics, and all the average metrics.
PS: The geometric mean done here is unnecessary, and not the one we used on our paper.


In [1]:
#These are all the necessary imports for this notebook. Make sure you have all of the dependencies installed.
import pandas as pd

from scipy.stats.mstats import gmean

The first steps are getting our data from the original .csv files. These were made by extracting data from the projects using BOHR (https://github.com/wendellmfm/bohr) and JMetriX (https://github.com/lincolnrocha/JMetriX)

In [2]:
data_atoms_lang = pd.read_csv(r'.\Data\reports\commons-lang-all.csv', sep=';')
data_atoms_dbcp = pd.read_csv(r'.\Data\reports\dbcp-all.csv', sep=';')
data_atoms_struts = pd.read_csv(r'.\Data\reports\struts-all.csv', sep=';')
data_atoms_codec = pd.read_csv(r'.\Data\reports\commons-codec-all.csv', sep=';')
data_atoms_bcel = pd.read_csv(r'.\Data\reports\commons-bcel-all.csv', sep=';')
data_atoms_compress = pd.read_csv(r'.\Data\reports\commons-compress-all.csv', sep=';')
data_atoms_configuration = pd.read_csv(r'.\Data\reports\commons-configuration-all.csv', sep=';')
data_atoms_net = pd.read_csv(r'.\Data\reports\commons-net-all.csv', sep=';')
data_atoms_freemarker = pd.read_csv(r'.\Data\reports\\freemarker-all.csv', sep=';')
data_atoms_vfs = pd.read_csv(r'.\Data\reports\commons-vfs-all.csv', sep=';')

These functions prepare the new metrics and how the new .csv files will be created

In [3]:
def csv_preparation (data, name):
    data['Number of Atoms per LoC (10^-3)'] = data['N.Atoms']*1000/data['LoC']
    data['Atom Diffusion'] = data['Classes w/ Atoms']/data['Classes Total']
    data['Atom Density'] = data['N.Atoms']/data['Classes w/ Atoms']
    data['Project'] = name
    data = data.iloc[::-1]
    data = data.iloc[:,[20,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]
    return data

In [4]:
def new_csv_creation (data, name):
    new_data = []

    mean_class_total_before = data[data["Period"] == "Before CI/CD"]["Classes Total"].mean()
    mean_class_total_after = data[data["Period"] == "After CI/CD"]["Classes Total"].mean()  
    mean_class_atoms_before = data[data["Period"] == "Before CI/CD"]["Classes w/ Atoms"].mean()
    mean_class_atoms_after = data[data["Period"] == "After CI/CD"]["Classes w/ Atoms"].mean()  
    mean_atoms_before = data[data["Period"] == "Before CI/CD"]["N.Atoms"].mean() 
    mean_atoms_after = data[data["Period"] == "After CI/CD"]["N.Atoms"].mean()
    mean_loc_before = data[data["Period"] == "Before CI/CD"]["LoC"].mean() 
    mean_loc_after = data[data["Period"] == "After CI/CD"]["LoC"].mean()
    mean_num_per_loc_before = data[data["Period"] == "Before CI/CD"]["Number of Atoms per LoC (10^-3)"].mean() 
    mean_num_per_loc_after = data[data["Period"] == "After CI/CD"]["Number of Atoms per LoC (10^-3)"].mean()
    mean_diffusion_before = data[data["Period"] == "Before CI/CD"]["Atom Diffusion"].mean() 
    mean_diffusion_after = data[data["Period"] == "After CI/CD"]["Atom Diffusion"].mean()
    mean_density_before = data[data["Period"] == "Before CI/CD"]["Atom Density"].mean() 
    mean_density_after = data[data["Period"] == "After CI/CD"]["Atom Density"].mean()

    median_class_total_before = data[data["Period"] == "Before CI/CD"]["Classes Total"].median()
    median_class_total_after = data[data["Period"] == "After CI/CD"]["Classes Total"].median()  
    median_class_atoms_before = data[data["Period"] == "Before CI/CD"]["Classes w/ Atoms"].median()
    median_class_atoms_after = data[data["Period"] == "After CI/CD"]["Classes w/ Atoms"].median() 
    median_atoms_before = data[data["Period"] == "Before CI/CD"]["N.Atoms"].median() 
    median_atoms_after = data[data["Period"] == "After CI/CD"]["N.Atoms"].median()
    median_loc_before = data[data["Period"] == "Before CI/CD"]["LoC"].median() 
    median_loc_after = data[data["Period"] == "After CI/CD"]["LoC"].median()
    median_num_per_loc_before = data[data["Period"] == "Before CI/CD"]["Number of Atoms per LoC (10^-3)"].median() 
    median_num_per_loc_after = data[data["Period"] == "After CI/CD"]["Number of Atoms per LoC (10^-3)"].median()
    median_diffusion_before = data[data["Period"] == "Before CI/CD"]["Atom Diffusion"].median() 
    median_diffusion_after = data[data["Period"] == "After CI/CD"]["Atom Diffusion"].median()
    median_density_before = data[data["Period"] == "Before CI/CD"]["Atom Density"].median() 
    median_density_after = data[data["Period"] == "After CI/CD"]["Atom Density"].median()

    gmean_class_total_before = gmean(data[data["Period"] == "Before CI/CD"]["Classes Total"])
    gmean_class_total_after = gmean(data[data["Period"] == "After CI/CD"]["Classes Total"])  
    gmean_class_atoms_before = gmean(data[data["Period"] == "Before CI/CD"]["Classes w/ Atoms"])
    gmean_class_atoms_after = gmean(data[data["Period"] == "After CI/CD"]["Classes w/ Atoms"])  
    gmean_atoms_before = gmean(data[data["Period"] == "Before CI/CD"]["N.Atoms"]) 
    gmean_atoms_after = gmean(data[data["Period"] == "After CI/CD"]["N.Atoms"])
    gmean_loc_before = gmean(data[data["Period"] == "Before CI/CD"]["LoC"]) 
    gmean_loc_after = gmean(data[data["Period"] == "After CI/CD"]["LoC"])
    gmean_num_per_loc_before = gmean(data[data["Period"] == "Before CI/CD"]["Number of Atoms per LoC (10^-3)"])
    gmean_num_per_loc_after = gmean(data[data["Period"] == "After CI/CD"]["Number of Atoms per LoC (10^-3)"])
    gmean_diffusion_before = gmean(data[data["Period"] == "Before CI/CD"]["Atom Diffusion"]) 
    gmean_diffusion_after = gmean(data[data["Period"] == "After CI/CD"]["Atom Diffusion"])
    gmean_density_before = gmean(data[data["Period"] == "Before CI/CD"]["Atom Density"]) 
    gmean_density_after = gmean(data[data["Period"] == "After CI/CD"]["Atom Density"])

    new_data.append([name, "Before CI/CD", "Mean", mean_class_total_before, mean_class_atoms_before, mean_loc_before, mean_atoms_before, mean_num_per_loc_before, mean_diffusion_before, mean_density_before])
    new_data.append([name, "After CI/CD", "Mean", mean_class_total_after, mean_class_atoms_after, mean_loc_after, mean_atoms_after, mean_num_per_loc_after, mean_diffusion_after, mean_density_after])
    new_data.append([name, "Before CI/CD", "Median", median_class_total_before, median_class_atoms_before, median_loc_before, median_atoms_before, median_num_per_loc_before, median_diffusion_before, median_density_before])
    new_data.append([name, "After CI/CD", "Median", median_class_total_after, median_class_atoms_after, median_loc_after, median_atoms_after, median_num_per_loc_after, median_diffusion_after, median_density_after])
    new_data.append([name, "Before CI/CD", "Geo-Mean", gmean_class_total_before, gmean_class_atoms_before, gmean_loc_before, gmean_atoms_before, gmean_num_per_loc_before, gmean_diffusion_before, gmean_density_before])
    new_data.append([name, "After CI/CD", "Geo-Mean", gmean_class_total_after, gmean_class_atoms_after, gmean_loc_after, gmean_atoms_after, gmean_num_per_loc_after, gmean_diffusion_after, gmean_density_after])

    df = pd.DataFrame(new_data)
    new_columns = data.columns.delete([7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])
    df.columns = new_columns
    return df

From now up until almost the end, the  notebook will go through every long-lived Java project and calculate the metrics, organize the data, and create the necessary .csv files for the other notebooks

Commons-Lang

In [5]:
data_lang_prepared = csv_preparation(data_atoms_lang, "Commons-Lang")

In [6]:
new_data_lang = new_csv_creation(data_lang_prepared, "Commons-Lang")

In [7]:
data_lang_prepared.to_csv(".\Data\data_atoms_lang.csv", index=False)

In [8]:
new_data_lang.to_csv(".\Data\mean_median_lang.csv", index=False)

Commons-DBCP

In [9]:
data_dbcp_prepared = csv_preparation(data_atoms_dbcp, 'Commons-DBCP')

In [10]:
new_data_dbcp = new_csv_creation(data_dbcp_prepared, 'Commons-DBCP')

In [11]:
data_dbcp_prepared.to_csv(".\Data\data_atoms_dbcp.csv", index=False)

In [12]:
new_data_dbcp.to_csv(".\Data\mean_median_dbcp.csv", index=False)

Struts (Without Plugins)

In [13]:
data_struts_prepared = csv_preparation(data_atoms_struts, 'Struts')

In [14]:
new_data_struts = new_csv_creation(data_struts_prepared, 'Struts')

In [15]:
data_struts_prepared.to_csv(".\Data\data_atoms_struts.csv", index=False)

In [16]:
new_data_struts.to_csv(".\Data\mean_median_struts.csv", index=False)

Commons-Codec

In [17]:
data_codec_prepared = csv_preparation(data_atoms_codec, 'Commons-Codec')

In [18]:
new_data_codec = new_csv_creation(data_codec_prepared, 'Commons-Codec')

In [19]:
data_codec_prepared.to_csv(".\Data\data_atoms_codec.csv", index=False)

In [20]:
new_data_codec.to_csv(".\Data\mean_median_codec.csv", index=False)

Commons-bcel

In [21]:
data_bcel_prepared = csv_preparation(data_atoms_bcel, 'Commons-Bcel')

In [22]:
new_data_bcel = new_csv_creation(data_bcel_prepared, 'Commons-Bcel')

In [23]:
data_bcel_prepared.to_csv(".\Data\data_atoms_bcel.csv", index=False)

In [24]:
new_data_bcel.to_csv(".\Data\mean_median_bcel.csv", index=False)

Commons-Compress

In [25]:
data_compress_prepared = csv_preparation(data_atoms_compress, 'Commons-Compress')

In [26]:
new_data_compress = new_csv_creation(data_compress_prepared, 'Commons-Compress')

In [27]:
data_compress_prepared.to_csv(".\Data\data_atoms_compress.csv", index=False)

In [28]:
new_data_compress.to_csv(".\Data\mean_median_compress.csv", index=False)

Commons-Configuration

In [29]:
data_configuration_prepared = csv_preparation(data_atoms_configuration, 'Commons-Configuration')

In [30]:
new_data_configuration = new_csv_creation(data_configuration_prepared, 'Commons-Configuration')

In [31]:
data_configuration_prepared.to_csv(".\Data\data_atoms_configuration.csv", index=False)

In [32]:
new_data_configuration.to_csv(".\Data\mean_median_configuration.csv", index=False)

Commons-Net

In [33]:
data_net_prepared = csv_preparation(data_atoms_net, 'Commons-Net')

In [34]:
new_data_net = new_csv_creation(data_net_prepared, 'Commons-Net')

In [35]:
data_net_prepared.to_csv(".\Data\data_atoms_net.csv", index=False)

In [36]:
new_data_net.to_csv(".\Data\mean_median_net.csv", index=False)

Freemarker

In [37]:
data_freemarker_prepared = csv_preparation(data_atoms_freemarker, 'Freemarker')

In [38]:
new_data_freemarker = new_csv_creation(data_freemarker_prepared, 'Freemarker')

In [39]:
data_freemarker_prepared.to_csv(".\Data\data_atoms_freemarker.csv", index=False)

In [40]:
new_data_freemarker.to_csv(".\Data\mean_median_freemarker.csv", index=False)

Commons-vfs

In [41]:
data_vfs_prepared = csv_preparation(data_atoms_vfs, 'Commons-Vfs')

In [42]:
new_data_vfs = new_csv_creation(data_vfs_prepared, 'Commons-Vfs')

In [43]:
data_vfs_prepared.to_csv(".\Data\data_atoms_vfs.csv", index=False)

In [44]:
new_data_vfs.to_csv(".\Data\mean_median_vfs.csv", index=False)

Finally, here we create .csv files with all the data, both in raw and average formats

In [45]:
df = data_lang_prepared
df = df.append(data_dbcp_prepared)
df = df.append(data_struts_prepared)
df = df.append(data_codec_prepared)
df = df.append(data_bcel_prepared)
df = df.append(data_compress_prepared)
df = df.append(data_configuration_prepared)
df = df.append(data_net_prepared)
df = df.append(data_freemarker_prepared)
df = df.append(data_vfs_prepared)
df.to_csv(".\Data\projects_java.csv", index=False)

In [46]:
df = new_data_lang
df = df.append(new_data_dbcp)
df = df.append(new_data_struts)
df = df.append(new_data_codec)
df = df.append(new_data_bcel)
df = df.append(new_data_compress)
df = df.append(new_data_configuration)
df = df.append(new_data_net)
df = df.append(new_data_freemarker)
df = df.append(new_data_vfs)
df.to_csv(".\Data\projects_java_mean_median.csv", index=False)