# **Integrate Data**

In this notebook we combine the different tables obtained from the previous sections (selection, cleaning and cosntructing) into one single table that has, for each developer, the attributes needed to difined their profile.

In [99]:
import pandas as pd
import numpy as np
from datetime import datetime

### COMMITS_FREQUENCY

In [100]:
numberCommits = pd.read_csv("../../data/interim/DataPreparation/ConstructData/NUMBER_COMMITS.csv")
numberCommits.head()

Unnamed: 0,committer,numberCommits
0,-l,27
1,1028332163,14
2,A. J. David Bosschaert,432
3,A195882,1
4,A744013,5


### FIXED_ISSUES

In [101]:
fixedIssues = pd.read_csv("../../data/interim/DataPreparation/ConstructData/FIXED_ISSUES.csv").iloc[:,1:]
fixedIssues.head()

Unnamed: 0,committer,SZZIssues,SonarIssues,JiraIssues
0,Carsten Ziegeler,560.0,655.0,0.0
1,Josh Elser,452.0,68.0,0.0
2,Felix Meschberger,431.0,0.0,0.0
3,Richard S. Hall,409.0,87.0,0.0
4,Guillaume Nodet,390.0,194.0,0.0


In [102]:
fixedIssues = fixedIssues.rename(columns={'SZZIssues':'fixedSZZIssues','SonarIssues':'fixedSonarIssues','JiraIssues':'fixedJiraIssues'})
fixedIssues.head()

Unnamed: 0,committer,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues
0,Carsten Ziegeler,560.0,655.0,0.0
1,Josh Elser,452.0,68.0,0.0
2,Felix Meschberger,431.0,0.0,0.0
3,Richard S. Hall,409.0,87.0,0.0
4,Guillaume Nodet,390.0,194.0,0.0


### INDUCED_ISSUES

In [103]:
inducedIssues = pd.read_csv("../../data/interim/DataPreparation/ConstructData/INDUCED_ISSUES.csv").iloc[:,1:]
inducedIssues.head()

Unnamed: 0,committer,SZZIssues,SonarIssues
0,Richard S. Hall,61.0,99.0
1,Gary D. Gregory,40.0,399.0
2,Sebastian Bazley,33.0,710.0
3,Eric C. Newton,31.0,420.0
4,Keith Turner,22.0,303.0


In [104]:
inducedIssues = inducedIssues.rename(columns={'SZZIssues':'inducedSZZIssues','SonarIssues':'inducedSonarIssues'})
inducedIssues.head()

Unnamed: 0,committer,inducedSZZIssues,inducedSonarIssues
0,Richard S. Hall,61.0,99.0
1,Gary D. Gregory,40.0,399.0
2,Sebastian Bazley,33.0,710.0
3,Eric C. Newton,31.0,420.0
4,Keith Turner,22.0,303.0


---

After reading these three tables, we can start by joining them according to the attribute `committer` (which identifies the developers). First we join `COMMITS_FREQUENCY` with `FIXED_ISSUES`:

In [105]:
dataFrame = pd.merge(numberCommits, fixedIssues,  how='outer', left_on=['committer'], right_on = ['committer'])

In [106]:
len(dataFrame.committer.unique())

2459

In [107]:
print(dataFrame.shape)
dataFrame

(2459, 5)


Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues
0,-l,27.0,,,
1,1028332163,14.0,,,
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0
3,A195882,1.0,0.0,1.0,0.0
4,A744013,5.0,0.0,4.0,0.0
...,...,...,...,...,...
2454,geojins,,0.0,0.0,1.0
2455,sabhyankar_impala_741e,,0.0,0.0,1.0
2456,jain.samit@gmail.com,,0.0,0.0,1.0
2457,yuppie-flu,,0.0,0.0,1.0


After the merge, we see that some of the attributes have missing values. A missing values means that the developer has never made a commit so we can replace these `NaN`'s by $0$. The other attributes are the number of issues fixed by the developer so a missing in this case means that the person did not fix any issue. Therefore, we can replace de `NaN`'s by zeros too.

In [108]:
dataFrame = dataFrame.fillna(0.0)
dataFrame

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues
0,-l,27.0,0.0,0.0,0.0
1,1028332163,14.0,0.0,0.0,0.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0
3,A195882,1.0,0.0,1.0,0.0
4,A744013,5.0,0.0,4.0,0.0
...,...,...,...,...,...
2454,geojins,0.0,0.0,0.0,1.0
2455,sabhyankar_impala_741e,0.0,0.0,0.0,1.0
2456,jain.samit@gmail.com,0.0,0.0,0.0,1.0
2457,yuppie-flu,0.0,0.0,0.0,1.0


Now we join the Data Frame obtenined in the previous join with the table `INDUCED_ISSUES`:

In [109]:
dataFrame = pd.merge(dataFrame, inducedIssues,  how='outer', left_on=['committer'], right_on = ['committer'])

In [110]:
print(dataFrame.shape)
dataFrame.head()

(2460, 7)


Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues
0,-l,27.0,0.0,0.0,0.0,0.0,2.0
1,1028332163,14.0,0.0,0.0,0.0,,
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0


For the same reason as before, we fill the missing values with zeros:

In [111]:
dataFrame = dataFrame.fillna(0)
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues
0,-l,27.0,0.0,0.0,0.0,0.0,2.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0


After joining these three table we have 2.460 developers:

In [112]:
print(len(dataFrame.committer.unique()))

2460


---

### TIME_IN_PROJECT

In [113]:
timeInProject = pd.read_csv("../../data/interim/DataPreparation/ConstructData/TIME_IN_PROJECT.csv").iloc[:,1:]
timeInProject.head()

Unnamed: 0,projectID,committer,time
0,accumulo,Adam Fuchs,124052833.0
1,accumulo,Adam J. Shook,1404.0
2,accumulo,Andrew L. Farris,13455828.0
3,accumulo,Benson Margulies,30412048.0
4,accumulo,Bill Havanki,18912456.0


In [114]:
timeInProject = timeInProject.rename(columns={'time':'timeInProject'})
timeInProject.head()

Unnamed: 0,projectID,committer,timeInProject
0,accumulo,Adam Fuchs,124052833.0
1,accumulo,Adam J. Shook,1404.0
2,accumulo,Andrew L. Farris,13455828.0
3,accumulo,Benson Margulies,30412048.0
4,accumulo,Bill Havanki,18912456.0


In [115]:
timeInProject.groupby(['projectID', 'committer']).size().shape

(1587,)

In [116]:
timeInProject.groupby(['committer']).size().shape

(1016,)

As we can see above, there are some committers that are in more than one project, so we have to compute the mean of time in projects by `committer`:

In [117]:
timeInProject = timeInProject.groupby(['committer']).mean().iloc[1:,:]
timeInProject

Unnamed: 0_level_0,timeInProject
committer,Unnamed: 1_level_1
-l,4235880.0
1028332163,77939.0
A. J. David Bosschaert,173937105.0
A195882,0.0
A744013,351970.0
...,...
Łukasz Gajowy,44232208.0
성준영,0.0
“Erin,70778.0
吴雪山,0.0


Now, we have a row per developer so can join the table with the previous Data Frame using the attribute `committer` on both sides:

In [118]:
dataFrame = pd.merge(dataFrame, timeInProject,  how='outer', left_on=['committer'], right_on = ['committer'])

In [119]:
print(dataFrame.shape)
dataFrame.head()

(2460, 8)


Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0


In [120]:
dataFrame.isna().sum()

committer                0
numberCommits            0
fixedSZZIssues           0
fixedSonarIssues         0
fixedJiraIssues          0
inducedSZZIssues         0
inducedSonarIssues       0
timeInProject         1445
dtype: int64

As we can see, there are many `NaN` values in the `timeInProject` attribute; a missing value means that the developer did not do any commit so we could not compute its time in the project; so we also replace these `NaN`s with zeros:

In [121]:
dataFrame = dataFrame.fillna(0)
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0


After joining thin new table we have 2460 developers:

In [122]:
len(dataFrame.committer.unique())

2460

---

### JIRA_ISSUES_time

From this table we are interested in knowing the type, priority and resolution time of the bugs in which a developer is working on, so we are interested in the attribute `assignee` to identify the developers.

In [123]:
jiraIssues = pd.read_csv("../../data/interim/DataPreparation/ConstructData/JIRA_ISSUES_time.csv").iloc[:,1:]
jiraIssues.head()

Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter,resolutionTime
0,commons-exec,EXEC-108,2018-09-18 11:15:58+00:00,2019-07-07 10:32:12+00:00,Bug,Major,not-assigned,natanieljr,7007.270556
1,commons-exec,EXEC-107,2018-07-04 12:09:47+00:00,2019-07-07 10:32:12+00:00,New Feature,Major,not-assigned,stefanreich,8830.373611
2,commons-exec,EXEC-106,2018-03-06 11:32:51+00:00,2019-07-07 10:32:12+00:00,Improvement,Major,not-assigned,sebb,11710.989167
3,commons-exec,EXEC-105,2018-02-16 13:47:10+00:00,2019-07-07 10:32:12+00:00,Wish,Trivial,not-assigned,IP,12140.750556
4,commons-exec,EXEC-104,2017-08-04 11:57:39+00:00,2019-07-07 10:32:12+00:00,Bug,Major,not-assigned,krichter,16846.575833


Before joining this table with the previous ones, we have to aggregate by developer (`assignee`). For the categorical attributs, we first have to transform them into binary attributes:

In [124]:
dum = pd.get_dummies(jiraIssues[["type", 'priority']], prefix=['type', 'priority'])
dum

Unnamed: 0,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66340,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
66341,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
66342,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
66343,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [125]:
TypePriority = jiraIssues[['assignee']].join(dum)
TypePriority

Unnamed: 0,assignee,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
0,not-assigned,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,not-assigned,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,not-assigned,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3,not-assigned,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,not-assigned,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66340,mahadev,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
66341,fpj,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
66342,mahadev,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
66343,fpj,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


Erase the rows that have been not assigned as we cannot extract information from any developer.

In [126]:
TypePriority = TypePriority[TypePriority.assignee!='not-assigned'].reset_index().iloc[:,1:]
TypePriority

Unnamed: 0,assignee,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
0,sgoeschl,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,sgoeschl,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,sgoeschl,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3,sgoeschl,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
4,sgoeschl,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46647,mahadev,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
46648,fpj,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
46649,mahadev,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
46650,fpj,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


Now we can aggregate by developer, adding the values:

In [127]:
TypePriority = TypePriority.groupby(["assignee"]).sum()
TypePriority

Unnamed: 0_level_0,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
assignee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
397090770,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
A1YCEUL3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
AaronLeeIV,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
AdamWesterman,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
AlexKbit,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zmanji,41.0,0.0,0.0,3.0,0.0,0.0,0.0,12.0,0.0,30.0,0.0,0.0,0.0,5.0,4.0,57.0,19.0,1.0
zookeeperatcabot,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
zsilver,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
zsombor,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0


The attribute `resolutionTime` can be aggregated by `assignee` computing the mean value:

In [128]:
resolutionTime = jiraIssues.loc[:,['assignee','resolutionTime']]
resolutionTime = resolutionTime.groupby(["assignee"]).mean()
resolutionTime

Unnamed: 0_level_0,resolutionTime
assignee,Unnamed: 1_level_1
397090770,1003.802639
A1YCEUL3,1.442778
AaronLeeIV,47.307222
AdamWesterman,0.361667
AlexKbit,7142.560000
...,...
zmanji,2594.802380
zookeeperatcabot,64359.943333
zsilver,4.728889
zsombor,1349.754259


And now, we can join this attribute, with the ones previously aggregated:

In [129]:
jiraIssues = resolutionTime.join(TypePriority)
jiraIssues

Unnamed: 0_level_0,resolutionTime,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
assignee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
397090770,1003.802639,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
A1YCEUL3,1.442778,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
AaronLeeIV,47.307222,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
AdamWesterman,0.361667,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
AlexKbit,7142.560000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zmanji,2594.802380,41.0,0.0,0.0,3.0,0.0,0.0,0.0,12.0,0.0,30.0,0.0,0.0,0.0,5.0,4.0,57.0,19.0,1.0
zookeeperatcabot,64359.943333,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
zsilver,4.728889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
zsombor,1349.754259,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0


In [130]:
jiraIssues = jiraIssues.reset_index().rename(columns={'assignee':'committer'})
jiraIssues

Unnamed: 0,committer,resolutionTime,type_Bug,type_Dependency upgrade,type_Documentation,type_Epic,type_Improvement,type_New Feature,type_Question,type_Story,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
0,397090770,1003.802639,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
1,A1YCEUL3,1.442778,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,AaronLeeIV,47.307222,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,AdamWesterman,0.361667,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,AlexKbit,7142.560000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1505,zmanji,2594.802380,41.0,0.0,0.0,3.0,0.0,0.0,0.0,12.0,0.0,30.0,0.0,0.0,0.0,5.0,4.0,57.0,19.0,1.0
1506,zookeeperatcabot,64359.943333,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1507,zsilver,4.728889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1508,zsombor,1349.754259,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0


Having the table aggregated by developer, we can merge it with the Data Frame:

In [131]:
dataFrame = pd.merge(dataFrame, jiraIssues, how='left', left_on=['committer'], right_on = ['committer'])
print(dataFrame.shape)
dataFrame.head()

(2460, 27)


Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,type_Sub-task,type_Task,type_Technical task,type_Test,type_Wish,priority_Blocker,priority_Critical,priority_Major,priority_Minor,priority_Trivial
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,,,...,,,,,,,,,,
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,,,...,,,,,,,,,,
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,,,...,,,,,,,,,,
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,,,...,,,,,,,,,,
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,,,...,,,,,,,,,,


The missing values that we get can be filled with zeros because they all make referance to quantities:

In [132]:
dataFrame.isna().sum()

committer                    0
numberCommits                0
fixedSZZIssues               0
fixedSonarIssues             0
fixedJiraIssues              0
inducedSZZIssues             0
inducedSonarIssues           0
timeInProject                0
resolutionTime             951
type_Bug                   951
type_Dependency upgrade    951
type_Documentation         951
type_Epic                  951
type_Improvement           951
type_New Feature           951
type_Question              951
type_Story                 951
type_Sub-task              951
type_Task                  951
type_Technical task        951
type_Test                  951
type_Wish                  951
priority_Blocker           951
priority_Critical          951
priority_Major             951
priority_Minor             951
priority_Trivial           951
dtype: int64

In [133]:
dataFrame = dataFrame.fillna(0.0)

With this new table we still have 2460 developers:

In [134]:
len(dataFrame.committer.unique())

2460

---

### GIT_COMMITS_CHANGES_clean


Once more, some of the attributes of this table are cathegorical, so we first have to binarize them in order to aggregate:

In [135]:
gitCommitsChanges = pd.read_csv("../../data/interim/DataPreparation/CleanData/GIT_COMMITS_CHANGES_clean.csv").iloc[:,2:]
print(gitCommitsChanges.shape)
gitCommitsChanges.head()

(891711, 4)


Unnamed: 0,commitHash,changeType,linesAdded,linesRemoved
0,e0880e263e4bf8662ba3848405200473a25dfc9f,ModificationType.ADD,196,0
1,e0880e263e4bf8662ba3848405200473a25dfc9f,ModificationType.ADD,22,0
2,e0880e263e4bf8662ba3848405200473a25dfc9f,ModificationType.ADD,87,0
3,e0880e263e4bf8662ba3848405200473a25dfc9f,ModificationType.ADD,167,0
4,e0880e263e4bf8662ba3848405200473a25dfc9f,ModificationType.ADD,96,0


In [136]:
dum = pd.get_dummies(gitCommitsChanges[["changeType"]])
dum = dum.rename(columns={'changeType_ModificationType.ADD':'ADD', 'changeType_ModificationType.DELETE':'DELETE', 'changeType_ModificationType.MODIFY':'MODIFY', 'changeType_ModificationType.RENAME':'RENAME', 'changeType_ModificationType.UNKNOWN':'UNKNOWN'})
dum

Unnamed: 0,ADD,DELETE,MODIFY,RENAME,UNKNOWN
0,1,0,0,0,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0
...,...,...,...,...,...
891706,0,0,1,0,0
891707,0,0,1,0,0
891708,0,0,1,0,0
891709,0,0,1,0,0


In [137]:
Lines = gitCommitsChanges[["commitHash",'linesAdded','linesRemoved']]
gitCommitsChanges = pd.concat([Lines,dum], axis=1)
gitCommitsChanges

Unnamed: 0,commitHash,linesAdded,linesRemoved,ADD,DELETE,MODIFY,RENAME,UNKNOWN
0,e0880e263e4bf8662ba3848405200473a25dfc9f,196,0,1,0,0,0,0
1,e0880e263e4bf8662ba3848405200473a25dfc9f,22,0,1,0,0,0,0
2,e0880e263e4bf8662ba3848405200473a25dfc9f,87,0,1,0,0,0,0
3,e0880e263e4bf8662ba3848405200473a25dfc9f,167,0,1,0,0,0,0
4,e0880e263e4bf8662ba3848405200473a25dfc9f,96,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...
891706,616d19716a999312d6d4a47bcfc9935f4e9e6efe,1,3,0,0,1,0,0
891707,616d19716a999312d6d4a47bcfc9935f4e9e6efe,1,3,0,0,1,0,0
891708,616d19716a999312d6d4a47bcfc9935f4e9e6efe,2,8,0,0,1,0,0
891709,616d19716a999312d6d4a47bcfc9935f4e9e6efe,13,7,0,0,1,0,0


Now we can aggregate by commit, computing the addition or the mean:

In [138]:
gitCommitsChanges = gitCommitsChanges.groupby(['commitHash']).agg({'ADD':'sum',	'DELETE':'sum',	'MODIFY':'sum',	'RENAME':'sum',	'UNKNOWN':'sum', 'linesAdded':'mean', 'linesRemoved':'mean'})
gitCommitsChanges

Unnamed: 0_level_0,ADD,DELETE,MODIFY,RENAME,UNKNOWN,linesAdded,linesRemoved
commitHash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
00016b9ca1063feea341c28528d2e53f0f64bf5f,0.0,0.0,1.0,0.0,0.0,7.000000,1.000000
0001afa3f7ea5a419098d1d6b9e9d0c25dd144b3,2.0,0.0,2.0,0.0,0.0,42.000000,0.250000
0001f90914b418859eb9fa86903e89a793e48e9b,0.0,0.0,5.0,0.0,0.0,57.200000,4.800000
0003160b08bd55b32eae0049048a6164d5b02d32,1.0,0.0,0.0,0.0,0.0,0.000000,0.000000
000333c5ec2f3a9568a8b56020faf81421c4a515,0.0,0.0,14.0,0.0,0.0,10.785714,7.714286
...,...,...,...,...,...,...,...
fffe0e0f1e2275899db903a925d68af53fd32dac,0.0,0.0,25.0,0.0,0.0,1.560000,0.960000
fffe3f91155871a3e24312977dffac9196c148d8,0.0,0.0,2.0,0.0,0.0,4.500000,4.500000
ffffc9f11a21d21458d4a8c4414201c5447098bc,0.0,0.0,5.0,0.0,0.0,3.800000,4.200000
ffffe87d10e35d450cd3acbbeb06461c6bf19554,0.0,0.0,3.0,0.0,0.0,6.000000,6.000000


To be able to aggregate by developer, we first have to join this table with the table `GIT_COMMITS`, which has `commitHash` (as this table) and `committer`:

In [139]:
gitCommits = pd.read_csv("../../data/interim/DataPreparation/CleanData/GIT_COMMITS_clean.csv")
gitCommits.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,projectID,commitHash,author,committer,committerDate
0,0,0,accumulo,e0880e263e4bf8662ba3848405200473a25dfc9f,Keith Turner,Keith Turner,2011-10-04T00:46:07Z
1,1,1,accumulo,e8774c5ec3a35e042f320540b5f7e66ebd2d9e87,Billie Rinaldi,Billie Rinaldi,2011-10-04T16:57:13Z
2,2,2,accumulo,2032ebbd0ed90734da39ca238bbd10dee24d0030,Keith Turner,Keith Turner,2011-10-04T18:39:18Z
3,3,3,accumulo,de297d4932e08625a5df146f0802041bb5aeb892,Billie Rinaldi,Billie Rinaldi,2011-10-04T19:31:01Z
4,4,4,accumulo,34efaae87639a83b60fdb7274de4b45051025a3a,Billie Rinaldi,Billie Rinaldi,2011-10-05T17:19:06Z


In [140]:
gitCommitsChanges = pd.merge(gitCommits, gitCommitsChanges, how='left', left_on=['commitHash'], right_on = ['commitHash'])
gitCommitsChanges

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,projectID,commitHash,author,committer,committerDate,ADD,DELETE,MODIFY,RENAME,UNKNOWN,linesAdded,linesRemoved
0,0,0,accumulo,e0880e263e4bf8662ba3848405200473a25dfc9f,Keith Turner,Keith Turner,2011-10-04T00:46:07Z,1376.0,0.0,0.0,0.0,0.0,202.684593,0.000000
1,1,1,accumulo,e8774c5ec3a35e042f320540b5f7e66ebd2d9e87,Billie Rinaldi,Billie Rinaldi,2011-10-04T16:57:13Z,1.0,1.0,4.0,27.0,0.0,1.575758,29.666667
2,2,2,accumulo,2032ebbd0ed90734da39ca238bbd10dee24d0030,Keith Turner,Keith Turner,2011-10-04T18:39:18Z,0.0,0.0,1.0,0.0,0.0,1.000000,1.000000
3,3,3,accumulo,de297d4932e08625a5df146f0802041bb5aeb892,Billie Rinaldi,Billie Rinaldi,2011-10-04T19:31:01Z,0.0,0.0,1.0,0.0,0.0,891.000000,1.000000
4,4,4,accumulo,34efaae87639a83b60fdb7274de4b45051025a3a,Billie Rinaldi,Billie Rinaldi,2011-10-05T17:19:06Z,0.0,0.0,4.0,0.0,0.0,1.000000,2.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140648,140682,140682,zookeeper,cc900a3b05bc31a237753680c8b00dc5866df4b2,Brian Nixon,Norbert Kalmar,2019-07-15T14:15:03Z,0.0,0.0,7.0,0.0,0.0,15.142857,0.000000
140649,140683,140683,zookeeper,1c83846615701e88749690f06993a6e77452b83c,Ivan Yurchenko,Andor Molnar,2019-07-15T14:46:48Z,3.0,0.0,9.0,0.0,0.0,43.166667,7.333333
140650,140684,140684,zookeeper,f873dcf10e222e220732ab27cc6fc8c0ff0beec6,Andor Molnar,Norbert Kalmar,2019-07-16T09:21:14Z,0.0,0.0,1.0,0.0,0.0,3.000000,3.000000
140651,140685,140685,zookeeper,a6c36b69cc72d7d67e392dab5360007d6f737bef,maoling,Andor Molnar,2019-07-17T13:42:32Z,0.0,0.0,6.0,0.0,0.0,25.500000,39.666667


Once these two table are merged, we can group by commiter, computing the mean of the values:

In [141]:
gitCommitsChanges = gitCommitsChanges[['committer','ADD',	'DELETE',	'MODIFY',	'RENAME',	'UNKNOWN',	'linesAdded',	'linesRemoved']]

gitCommitsChanges = gitCommitsChanges.groupby(['committer']).mean()
gitCommitsChanges

Unnamed: 0_level_0,ADD,DELETE,MODIFY,RENAME,UNKNOWN,linesAdded,linesRemoved
committer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
(no author),,,,,,,
-l,0.111111,1.111111,4.000000,0.407407,0.0,13.634957,14.141744
1028332163,0.000000,0.000000,8.307692,0.000000,0.0,6.843590,4.105769
A. J. David Bosschaert,0.595349,0.111628,2.527907,0.393023,0.0,20.001483,6.916947
A195882,1.000000,0.000000,1.000000,0.000000,0.0,58.000000,26.500000
...,...,...,...,...,...,...,...
Łukasz Gajowy,0.664234,0.036496,2.423358,0.751825,0.0,27.043863,7.846810
성준영,0.000000,0.000000,1.000000,0.000000,0.0,1.000000,1.000000
“Erin,1.000000,0.000000,0.500000,0.000000,0.0,179.250000,0.000000
吴雪山,0.000000,0.000000,1.000000,0.000000,0.0,1.000000,0.000000


Now that we have the table `GIT_COMMITS_CHANGES` aggregated by commiter, we can join it with the Data Frame using the attribute `commiter`:

In [142]:
dataFrame = pd.merge(dataFrame, gitCommitsChanges, how='left', left_on=['committer'], right_on = ['committer'])
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,priority_Major,priority_Minor,priority_Trivial,ADD,DELETE,MODIFY,RENAME,UNKNOWN,linesAdded,linesRemoved
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,0.0,0.0,...,0.0,0.0,0.0,0.111111,1.111111,4.0,0.407407,0.0,13.634957,14.141744
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,8.307692,0.0,0.0,6.84359,4.105769
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,0.0,0.0,...,0.0,0.0,0.0,0.595349,0.111628,2.527907,0.393023,0.0,20.001483,6.916947
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,58.0,26.5
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,1.5,1.0,0.0,6.45,3.7


As in previous cases, the missing values that appear after the merging, can be replaced by zeros:

In [143]:
dataFrame = dataFrame.fillna(0.0)

---

### REFACTORING_MINER_bug

In [144]:
refactoringMinerBug = pd.read_csv("../../data/interim/DataPreparation/ConstructData/REFACTORING_MINER_bug.csv").iloc[:,1:]
refactoringMinerBug.head()

Unnamed: 0,projectID,commitHash,refactoringType,bug
0,accumulo,4093a3015d6b789888077e317e535df4c8102e5d,Extract Method,False
1,accumulo,123bd993cff822e02242197a24f47ee36bfa3744,Extract Variable,False
2,accumulo,8c04c6ae5e5ba1432e40684428338ce68431766b,Extract Variable,False
3,accumulo,812f18b4534ae1eec41845a70a53adb783e77d61,Rename Variable,False
4,accumulo,eac6c062b586196d32b7770d7052148acaf3c276,Extract Method,False


As in the previous case, first we have to binarize the cathegorical attributs:

In [145]:
dum = pd.get_dummies(refactoringMinerBug[['refactoringType', 'bug']])
dum

Unnamed: 0,bug,refactoringType_Change Package,refactoringType_Extract And Move Method,refactoringType_Extract Class,refactoringType_Extract Interface,refactoringType_Extract Method,refactoringType_Extract Subclass,refactoringType_Extract Superclass,refactoringType_Extract Variable,refactoringType_Inline Method,...,refactoringType_Push Down Attribute,refactoringType_Push Down Method,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute
0,False,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,False,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,False,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,False,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,False,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31987,False,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31988,False,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31989,False,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
31990,False,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [146]:
commitHash = refactoringMinerBug[["commitHash"]]
refactoringMinerBug = pd.concat([commitHash,dum], axis=1)
refactoringMinerBug

Unnamed: 0,commitHash,bug,refactoringType_Change Package,refactoringType_Extract And Move Method,refactoringType_Extract Class,refactoringType_Extract Interface,refactoringType_Extract Method,refactoringType_Extract Subclass,refactoringType_Extract Superclass,refactoringType_Extract Variable,...,refactoringType_Push Down Attribute,refactoringType_Push Down Method,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute
0,4093a3015d6b789888077e317e535df4c8102e5d,False,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,123bd993cff822e02242197a24f47ee36bfa3744,False,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,8c04c6ae5e5ba1432e40684428338ce68431766b,False,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,812f18b4534ae1eec41845a70a53adb783e77d61,False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,eac6c062b586196d32b7770d7052148acaf3c276,False,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31987,2d5dd1da4d144cd1ab76edda05e4faa2d6f368e3,False,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31988,2d5dd1da4d144cd1ab76edda05e4faa2d6f368e3,False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31989,344a30792bb30430a5949fa20ae69872c42394e0,False,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
31990,a1c481ceca909e32ec49ff9738b5355eb1c367a7,False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And now we can group by commit computing the sum of the values:

In [147]:
refactoringMinerBug = refactoringMinerBug.groupby(['commitHash']).sum()
refactoringMinerBug

Unnamed: 0_level_0,bug,refactoringType_Change Package,refactoringType_Extract And Move Method,refactoringType_Extract Class,refactoringType_Extract Interface,refactoringType_Extract Method,refactoringType_Extract Subclass,refactoringType_Extract Superclass,refactoringType_Extract Variable,refactoringType_Inline Method,...,refactoringType_Push Down Attribute,refactoringType_Push Down Method,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute
commitHash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000c48dcee0a4c164687c19ad59fb762b96e5042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
00131775cf82db598a0cda06bb36c67cb3602a81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
00195d2543eb347cc3669a4ac89e98da0bc4dca4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
001db0530d1ad937a0d6ea6dded9b70b2cbe2cff,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ffea3c2835c76293a06b6b5306df08d26a7b9261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
fff4bdfef9d8f6267177f6dba38691ec9bd7bcb0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
fff93f39c2da623845a15c5c888d5c9806ea19c8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
fffd2c55895e4eb03ae0796d3ce9c4f75b9b34c3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


As in the case of the previous tables, we have to first join this table with `GIT_COMMITS` in order to have the attribute `committer`:

In [148]:
refactoringMinerBug = pd.merge(refactoringMinerBug, gitCommits, how='left', left_on=['commitHash'], right_on = ['commitHash'])
refactoringMinerBug = pd.concat([refactoringMinerBug[['committer']], refactoringMinerBug.iloc[:,:-4]], axis=1)
refactoringMinerBug

Unnamed: 0.2,committer,commitHash,bug,refactoringType_Change Package,refactoringType_Extract And Move Method,refactoringType_Extract Class,refactoringType_Extract Interface,refactoringType_Extract Method,refactoringType_Extract Subclass,refactoringType_Extract Superclass,...,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute,Unnamed: 0,Unnamed: 0.1
0,,000c48dcee0a4c164687c19ad59fb762b96e5042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
1,Oleg Kalnichevski,0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,131403.0,131403.0
2,David Leangen,00131775cf82db598a0cda06bb36c67cb3602a81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,126553.0,126553.0
3,Kenneth Knowles,00195d2543eb347cc3669a4ac89e98da0bc4dca4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,48576.0,48576.0
4,Marcel Offermans,001db0530d1ad937a0d6ea6dded9b70b2cbe2cff,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117206.0,117206.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11693,Kenneth Knowles,ffea3c2835c76293a06b6b5306df08d26a7b9261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,59548.0,59548.0
11694,Oleg Kalnichevski,fff4bdfef9d8f6267177f6dba38691ec9bd7bcb0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,133297.0,133297.0
11695,,fff93f39c2da623845a15c5c888d5c9806ea19c8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
11696,Dan Halperin,fffd2c55895e4eb03ae0796d3ce9c4f75b9b34c3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,52018.0,52018.0


Now we can aggregate by committer computing the mean of the values:

In [149]:
refactoringMinerBug = refactoringMinerBug.groupby(['committer']).sum()
refactoringMinerBug

Unnamed: 0_level_0,bug,refactoringType_Change Package,refactoringType_Extract And Move Method,refactoringType_Extract Class,refactoringType_Extract Interface,refactoringType_Extract Method,refactoringType_Extract Subclass,refactoringType_Extract Superclass,refactoringType_Extract Variable,refactoringType_Inline Method,...,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute,Unnamed: 0,Unnamed: 0.1
committer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A. J. David Bosschaert,0.0,0.0,1.0,0.0,1.0,13.0,0.0,2.0,0.0,3.0,...,0.0,12.0,19.0,3.0,0.0,0.0,0.0,0.0,7788583.0,7788583.0
A744013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,69577.0,69577.0
Aaron Dossett,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,35118.0,35118.0
Abraham Fine,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,2.0,561467.0,561467.0
Adrian Crum,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,101487.0,101487.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vvarma,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,122847.0,122847.0
wtanaka.com,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,53102.0,53102.0
xiliu,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,59811.0,59811.0
zmanji@apache.org,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,80923.0,80923.0


Since we have the values aggregated by developer, we can join this table with the Data Frame using the attribute `committer`; and the fill the new missing values with zeros:

In [150]:
dataFrame = pd.merge(dataFrame, refactoringMinerBug, how='left', left_on=['committer'], right_on = ['committer'])
dataFrame.head()

Unnamed: 0.2,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,refactoringType_Rename Attribute,refactoringType_Rename Class,refactoringType_Rename Method,refactoringType_Rename Package,refactoringType_Rename Parameter,refactoringType_Rename Variable,refactoringType_Replace Attribute,refactoringType_Replace Variable With Attribute,Unnamed: 0,Unnamed: 0.1
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,0.0,0.0,...,,,,,,,,,,
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,0.0,0.0,...,,,,,,,,,,
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,0.0,0.0,...,0.0,12.0,19.0,3.0,0.0,0.0,0.0,0.0,7788583.0,7788583.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,,,,,,,,,,
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,69577.0,69577.0


In [151]:
dataFrame = dataFrame.fillna(0.0)

---

### SONAR_MEASURES_difference

In [152]:
sonarMeasures = pd.read_csv("../../data/interim/DataPreparation/ConstructData/SONAR_MEASURES_difference.csv").iloc[:,1:]
sonarMeasures.head()

Unnamed: 0,commitHash,projectID,functions,commentLinesDensity,complexity,functionComplexity,duplicatedLinesDensity,violations,blockerViolations,criticalViolations,...,minorViolations,codeSmells,bugs,vulnerabilities,cognitiveComplexity,ncloc,sqaleIndex,sqaleDebtRatio,reliabilityRemediationEffort,securityRemediationEffort
0,e0880e263e4bf8662ba3848405200473a25dfc9f,accumulo,17295.0,6.2,43137.0,2.5,17.6,18314.0,142.0,893.0,...,9889.0,17012.0,464.0,838.0,39453.0,203873.0,212384.0,3.5,7322.0,9505.0
1,e8774c5ec3a35e042f320540b5f7e66ebd2d9e87,accumulo,0.0,0.0,0.0,0.0,0.0,-145.0,0.0,0.0,...,1.0,-25.0,-120.0,0.0,0.0,-917.0,-184.0,0.0,-241.0,0.0
2,2032ebbd0ed90734da39ca238bbd10dee24d0030,accumulo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,de297d4932e08625a5df146f0802041bb5aeb892,accumulo,0.0,0.0,0.0,0.0,0.0,146.0,0.0,0.0,...,0.0,26.0,120.0,0.0,0.0,885.0,185.0,0.0,241.0,0.0
4,34efaae87639a83b60fdb7274de4b45051025a3a,accumulo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-4.0,0.0,0.0,0.0,0.0


First of all, we merge this table with `GIT_COMMITS` using `commitHash` in order to have the attribut `committer`:

In [153]:
gitCommits = pd.read_csv("../../data/interim/DataPreparation/CleanData/GIT_COMMITS_clean.csv")[['commitHash', 'committer']]
sonarMeasures = pd.merge(sonarMeasures, gitCommits, how='left', on='commitHash').iloc[:,2:]
sonarMeasures

Unnamed: 0,functions,commentLinesDensity,complexity,functionComplexity,duplicatedLinesDensity,violations,blockerViolations,criticalViolations,infoViolations,majorViolations,...,codeSmells,bugs,vulnerabilities,cognitiveComplexity,ncloc,sqaleIndex,sqaleDebtRatio,reliabilityRemediationEffort,securityRemediationEffort,committer
0,17295.0,6.2,43137.0,2.5,17.6,18314.0,142.0,893.0,80.0,7310.0,...,17012.0,464.0,838.0,39453.0,203873.0,212384.0,3.5,7322.0,9505.0,Keith Turner
1,0.0,0.0,0.0,0.0,0.0,-145.0,0.0,0.0,0.0,-146.0,...,-25.0,-120.0,0.0,0.0,-917.0,-184.0,0.0,-241.0,0.0,Billie Rinaldi
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Keith Turner
3,0.0,0.0,0.0,0.0,0.0,146.0,0.0,0.0,0.0,146.0,...,26.0,120.0,0.0,0.0,885.0,185.0,0.0,241.0,0.0,Billie Rinaldi
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-4.0,0.0,0.0,0.0,0.0,Billie Rinaldi
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55620,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fpj
55621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fpj
55622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fpj
55623,6.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,4.0,43.0,93.0,0.0,0.0,0.0,fpj


In [154]:
list(sonarMeasures)

['functions',
 'commentLinesDensity',
 'complexity',
 'functionComplexity',
 'duplicatedLinesDensity',
 'violations',
 'blockerViolations',
 'criticalViolations',
 'infoViolations',
 'majorViolations',
 'minorViolations',
 'codeSmells',
 'bugs',
 'vulnerabilities',
 'cognitiveComplexity',
 'ncloc',
 'sqaleIndex',
 'sqaleDebtRatio',
 'reliabilityRemediationEffort',
 'securityRemediationEffort',
 'committer']

Then, we aggregate by commiter computing the mean of the values:

In [155]:
sonarMeasures_committer = sonarMeasures.groupby(['committer']).agg({'functions':'sum', 'commentLinesDensity':'mean', 
'complexity':'sum', 'functionComplexity':'sum', 'duplicatedLinesDensity':'mean', 'violations':'sum', 'blockerViolations':'sum',
 'criticalViolations':'sum','infoViolations':'sum','majorViolations':'sum','minorViolations':'sum','codeSmells':'sum',
 'bugs':'sum','vulnerabilities':'sum','cognitiveComplexity':'sum','ncloc':'sum','sqaleIndex':'sum',
 'sqaleDebtRatio':'sum','reliabilityRemediationEffort':'sum','securityRemediationEffort':'sum'}).reset_index()
sonarMeasures_committer

Unnamed: 0,committer,functions,commentLinesDensity,complexity,functionComplexity,duplicatedLinesDensity,violations,blockerViolations,criticalViolations,infoViolations,...,minorViolations,codeSmells,bugs,vulnerabilities,cognitiveComplexity,ncloc,sqaleIndex,sqaleDebtRatio,reliabilityRemediationEffort,securityRemediationEffort
0,Adam Fuchs,1401.0,0.032692,1855.0,-0.6,-7.115385e-02,2497.0,-23.0,38.0,586.0,...,106.0,1780.0,733.0,-16.0,-1259.0,15267.0,13920.0,-0.2,1137.0,-395.0
1,Adrian Crum,2.0,-0.100000,29.0,0.0,1.750000e-01,23.0,0.0,0.0,0.0,...,-1.0,3.0,0.0,20.0,53.0,549.0,72.0,-0.1,0.0,600.0
2,Adrian Nistor,0.0,0.000000,0.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alex Karasulu,-152.0,0.010638,-496.0,-0.1,-2.765957e-02,-155.0,-2.0,-15.0,-7.0,...,-20.0,-152.0,-4.0,1.0,-646.0,-749.0,-2697.0,-0.3,-21.0,10.0
4,Alex Yarmula,0.0,0.000000,0.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
327,kpaul,0.0,0.000000,1.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0
328,markt,2.0,0.000000,2.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
329,pjack,228.0,0.066667,448.0,0.0,3.558407e-19,214.0,2.0,0.0,9.0,...,95.0,206.0,7.0,1.0,226.0,1900.0,2867.0,0.2,30.0,10.0
330,root,2448.0,0.066667,4440.0,0.0,6.666667e-02,1967.0,0.0,53.0,15.0,...,955.0,1929.0,1.0,37.0,3626.0,34952.0,37755.0,-0.1,5.0,520.0


Once we have the table aggregated by commiter, we can merge it with the Data Frame using the attribute `committer`. Also, we fill the created missing values with zeros:

In [156]:
dataFrame = pd.merge(dataFrame, sonarMeasures_committer, how='left', on='committer')
dataFrame = dataFrame.fillna(0)
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,minorViolations,codeSmells,bugs,vulnerabilities,cognitiveComplexity,ncloc,sqaleIndex,sqaleDebtRatio,reliabilityRemediationEffort,securityRemediationEffort
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

### SONAR_ISSUES_time

We begin by taking the attributs from the table that we are interested in:

In [157]:
sonarIssues = pd.read_csv("../../data/interim/DataPreparation/ConstructData/SONAR_ISSUES_time.csv").iloc[:,4:]
sonarIssues = pd.concat([sonarIssues[['creationCommitHash']], sonarIssues.iloc[:,2:-2], sonarIssues[['closeTime']]], axis=1)
sonarIssues.head()

Unnamed: 0,creationCommitHash,type,severity,debt,closeTime
0,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MAJOR,20min,138827.054722
1,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,1min,57200.685278
2,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MAJOR,30min,138827.054722
3,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,1min,57200.685278
4,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,1min,138827.054722


The first thing that we have to do in this table is to transform the values of the attribute `debt` so it can be used in the future modelling section. This attribute has time values are string that can be in minuts or hours depending on the last letters of the string. For example, a value can be `20min`, `1h` but also it is possible to have `1h20min`. To convert these strings to floats (in hours) we use the Timedelta function from the pandas library as follows:

In [158]:
debtSec = sonarIssues.debt.apply(pd.Timedelta)
debtHour = debtSec.apply(lambda x: x.seconds/3600 + x.days*24)
debtHour

0          0.333333
1          0.016667
2          0.500000
3          0.016667
4          0.016667
             ...   
1532441    0.333333
1532442    0.033333
1532443    0.166667
1532444    0.166667
1532445    0.166667
Name: debt, Length: 1532446, dtype: float64

In [159]:
sonarIssues[['debt']] = debtHour.to_frame()
sonarIssues.head()

Unnamed: 0,creationCommitHash,type,severity,debt,closeTime
0,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MAJOR,0.333333,138827.054722
1,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,0.016667,57200.685278
2,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MAJOR,0.5,138827.054722
3,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,0.016667,57200.685278
4,d3416d3a25b16da3d18b3849522fa96183918e5b,CODE_SMELL,MINOR,0.016667,138827.054722


In this table we also have some cathegorical attributs that have to be binarize in order to aggregate:

In [160]:
dum = pd.get_dummies(sonarIssues[['type',	'severity']])
dum

Unnamed: 0,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR,severity_MINOR
0,0,1,0,0,0,0,1,0
1,0,1,0,0,0,0,0,1
2,0,1,0,0,0,0,1,0
3,0,1,0,0,0,0,0,1
4,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
1532441,0,1,0,0,0,0,1,0
1532442,0,1,0,0,0,0,0,1
1532443,0,1,0,0,0,0,0,1
1532444,0,1,0,0,0,0,0,1


In [161]:
sonarIssues = pd.concat([sonarIssues[['creationCommitHash','debt','closeTime']], dum], axis=1)
sonarIssues.head()

Unnamed: 0,creationCommitHash,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR,severity_MINOR
0,d3416d3a25b16da3d18b3849522fa96183918e5b,0.333333,138827.054722,0,1,0,0,0,0,1,0
1,d3416d3a25b16da3d18b3849522fa96183918e5b,0.016667,57200.685278,0,1,0,0,0,0,0,1
2,d3416d3a25b16da3d18b3849522fa96183918e5b,0.5,138827.054722,0,1,0,0,0,0,1,0
3,d3416d3a25b16da3d18b3849522fa96183918e5b,0.016667,57200.685278,0,1,0,0,0,0,0,1
4,d3416d3a25b16da3d18b3849522fa96183918e5b,0.016667,138827.054722,0,1,0,0,0,0,0,1


In [162]:
closeTime = sonarIssues.loc[:,['creationCommitHash','closeTime']]
closeTime = closeTime.groupby(["creationCommitHash"]).mean()
closeTime

Unnamed: 0_level_0,closeTime
creationCommitHash,Unnamed: 1_level_1
0001f90914b418859eb9fa86903e89a793e48e9b,12757.497963
0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,17235.954444
0010fcad01ce4ac5051a61bec349bd6ac397c994,44169.537222
001484f7f4144b7c85cfd419aa6e892ffc65b751,1703.955556
001720ab4ce466d667c13e3874e702028e653279,5589.060000
...,...
ffe57d1369153323981007fb7f1078ad7e2886f0,2480.662222
ffe60557caf6e83f6704d0819bfd3f09b3db0d20,5020.214090
fff00791ed30f338ffc085f099be5517dc359261,73649.223056
fff09e8cb504067b385e05722c30df06906a405b,12471.290056


Now we aggregate by commit the `debt` attribute previously computed:

In [163]:
Debt = sonarIssues[['creationCommitHash','debt']]
Debt = Debt.groupby(['creationCommitHash']).sum()
Debt

Unnamed: 0_level_0,debt
creationCommitHash,Unnamed: 1_level_1
0001f90914b418859eb9fa86903e89a793e48e9b,2.700000
0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,0.083333
0010fcad01ce4ac5051a61bec349bd6ac397c994,4.250000
001484f7f4144b7c85cfd419aa6e892ffc65b751,0.166667
001720ab4ce466d667c13e3874e702028e653279,0.066667
...,...
ffe57d1369153323981007fb7f1078ad7e2886f0,0.083333
ffe60557caf6e83f6704d0819bfd3f09b3db0d20,78.333333
fff00791ed30f338ffc085f099be5517dc359261,2.583333
fff09e8cb504067b385e05722c30df06906a405b,22.500000


We also aggregate by commit the cathegorical attributes previously binerized:

In [164]:
TypesSeverities = pd.concat([sonarIssues[['creationCommitHash']], sonarIssues.iloc[:,3:]], axis=1)
TypesSeverities = TypesSeverities.groupby(['creationCommitHash']).sum()
TypesSeverities

Unnamed: 0_level_0,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR,severity_MINOR
creationCommitHash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0001f90914b418859eb9fa86903e89a793e48e9b,0.0,12.0,0.0,0.0,0.0,0.0,2.0,10.0
0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
0010fcad01ce4ac5051a61bec349bd6ac397c994,0.0,9.0,0.0,0.0,0.0,2.0,7.0,0.0
001484f7f4144b7c85cfd419aa6e892ffc65b751,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
001720ab4ce466d667c13e3874e702028e653279,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...
ffe57d1369153323981007fb7f1078ad7e2886f0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
ffe60557caf6e83f6704d0819bfd3f09b3db0d20,0.0,94.0,0.0,0.0,0.0,0.0,94.0,0.0
fff00791ed30f338ffc085f099be5517dc359261,3.0,11.0,0.0,0.0,0.0,0.0,14.0,0.0
fff09e8cb504067b385e05722c30df06906a405b,0.0,15.0,0.0,0.0,0.0,0.0,15.0,0.0


We concat the attributes that we have previously aggregated by commit:

In [165]:
sonarIssues = pd.concat([Debt, closeTime, TypesSeverities], axis=1)
sonarIssues

Unnamed: 0_level_0,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR,severity_MINOR
creationCommitHash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0001f90914b418859eb9fa86903e89a793e48e9b,2.700000,12757.497963,0.0,12.0,0.0,0.0,0.0,0.0,2.0,10.0
0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,0.083333,17235.954444,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
0010fcad01ce4ac5051a61bec349bd6ac397c994,4.250000,44169.537222,0.0,9.0,0.0,0.0,0.0,2.0,7.0,0.0
001484f7f4144b7c85cfd419aa6e892ffc65b751,0.166667,1703.955556,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
001720ab4ce466d667c13e3874e702028e653279,0.066667,5589.060000,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...
ffe57d1369153323981007fb7f1078ad7e2886f0,0.083333,2480.662222,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
ffe60557caf6e83f6704d0819bfd3f09b3db0d20,78.333333,5020.214090,0.0,94.0,0.0,0.0,0.0,0.0,94.0,0.0
fff00791ed30f338ffc085f099be5517dc359261,2.583333,73649.223056,3.0,11.0,0.0,0.0,0.0,0.0,14.0,0.0
fff09e8cb504067b385e05722c30df06906a405b,22.500000,12471.290056,0.0,15.0,0.0,0.0,0.0,0.0,15.0,0.0


Now, we have to merge this table with the `GIT_COMMITS` table using `creationCommitHash` and `commitHash` in order to have the attribute `committer`:

In [166]:
sonarIssues2 = pd.merge(sonarIssues, gitCommits, how='left', left_on=['creationCommitHash'], right_on=['commitHash'])
sonarIssues2

Unnamed: 0,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR,severity_MINOR,commitHash,committer
0,2.700000,12757.497963,0.0,12.0,0.0,0.0,0.0,0.0,2.0,10.0,0001f90914b418859eb9fa86903e89a793e48e9b,Santhosh Kumar
1,0.083333,17235.954444,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0010369ccf1cdf25e10ed2fd3a080edaf374d0ed,Oleg Kalnichevski
2,4.250000,44169.537222,0.0,9.0,0.0,0.0,0.0,2.0,7.0,0.0,0010fcad01ce4ac5051a61bec349bd6ac397c994,Scott Sanders
3,0.166667,1703.955556,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,001484f7f4144b7c85cfd419aa6e892ffc65b751,Bill Farner
4,0.066667,5589.060000,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,001720ab4ce466d667c13e3874e702028e653279,Oleg Kalnichevski
...,...,...,...,...,...,...,...,...,...,...,...,...
19221,0.083333,2480.662222,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,ffe57d1369153323981007fb7f1078ad7e2886f0,Vadim Gritsenko
19222,78.333333,5020.214090,0.0,94.0,0.0,0.0,0.0,0.0,94.0,0.0,ffe60557caf6e83f6704d0819bfd3f09b3db0d20,Eric C. Newton
19223,2.583333,73649.223056,3.0,11.0,0.0,0.0,0.0,0.0,14.0,0.0,fff00791ed30f338ffc085f099be5517dc359261,Craig R. McClanahan
19224,22.500000,12471.290056,0.0,15.0,0.0,0.0,0.0,0.0,15.0,0.0,fff09e8cb504067b385e05722c30df06906a405b,Colm O Heigeartaigh


As before, we aggregate the `closeTime` attribute by `committer`:

In [167]:
closeTime = sonarIssues2.loc[:,['committer','closeTime']]
closeTime = closeTime.groupby(["committer"]).mean()
closeTime

Unnamed: 0_level_0,closeTime
committer,Unnamed: 1_level_1
(no author),47061.911846
-l,22617.766389
A195882,809.977238
A744013,821.388563
Aaron Dossett,640.382545
...,...
root,605.878369
sanjay-patel-1991,16.971944
sposetti,32092.525736
svimal2106,23277.632500


We also aggregate the `debt` attribute by `committer`:

In [168]:
Debt = sonarIssues2[['committer','debt']]
Debt = Debt.groupby(['committer']).sum()
Debt

Unnamed: 0_level_0,debt
committer,Unnamed: 1_level_1
(no author),1329.316667
-l,0.833333
A195882,1.933333
A744013,17.133333
Aaron Dossett,3.833333
...,...
root,9.650000
sanjay-patel-1991,0.583333
sposetti,5.433333
svimal2106,0.400000


And the cathegorical variables previously binerized:

In [169]:
TypesSeverities = pd.concat([sonarIssues2[['committer']], sonarIssues2.iloc[:,2:-3]], axis=1)
TypesSeverities = TypesSeverities.groupby(['committer']).sum()
TypesSeverities

Unnamed: 0_level_0,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR
committer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
(no author),133.0,4723.0,140.0,356.0,29.0,2.0,2475.0
-l,0.0,2.0,0.0,0.0,0.0,0.0,2.0
A195882,0.0,18.0,0.0,0.0,0.0,7.0,8.0
A744013,14.0,78.0,3.0,0.0,25.0,0.0,24.0
Aaron Dossett,0.0,19.0,0.0,0.0,0.0,0.0,19.0
...,...,...,...,...,...,...,...
root,0.0,71.0,0.0,0.0,0.0,0.0,21.0
sanjay-patel-1991,0.0,16.0,0.0,0.0,0.0,0.0,0.0
sposetti,0.0,51.0,2.0,0.0,5.0,0.0,9.0
svimal2106,0.0,3.0,0.0,0.0,0.0,0.0,0.0


Then, we concat the attributes to obtein the table aggregated by `committer`. This table, then, can be merged with the Data Frame using the `committer` attribute:

In [170]:
sonarIssues2 = pd.concat([Debt, closeTime, TypesSeverities], axis=1)
sonarIssues2['committer'] = sonarIssues2.index
sonarIssues2 = sonarIssues2.iloc[:,:-1]
sonarIssues2

Unnamed: 0_level_0,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR
committer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
(no author),1329.316667,47061.911846,133.0,4723.0,140.0,356.0,29.0,2.0,2475.0
-l,0.833333,22617.766389,0.0,2.0,0.0,0.0,0.0,0.0,2.0
A195882,1.933333,809.977238,0.0,18.0,0.0,0.0,0.0,7.0,8.0
A744013,17.133333,821.388563,14.0,78.0,3.0,0.0,25.0,0.0,24.0
Aaron Dossett,3.833333,640.382545,0.0,19.0,0.0,0.0,0.0,0.0,19.0
...,...,...,...,...,...,...,...,...,...
root,9.650000,605.878369,0.0,71.0,0.0,0.0,0.0,0.0,21.0
sanjay-patel-1991,0.583333,16.971944,0.0,16.0,0.0,0.0,0.0,0.0,0.0
sposetti,5.433333,32092.525736,0.0,51.0,2.0,0.0,5.0,0.0,9.0
svimal2106,0.400000,23277.632500,0.0,3.0,0.0,0.0,0.0,0.0,0.0


In [171]:
dataFrame = pd.merge(dataFrame, sonarIssues2, how='left', on='committer')
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,securityRemediationEffort,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,0.0,0.0,...,0.0,0.833333,22617.766389,0.0,2.0,0.0,0.0,0.0,0.0,2.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,0.0,0.0,...,0.0,,,,,,,,,
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,0.0,0.0,...,0.0,,,,,,,,,
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.933333,809.977238,0.0,18.0,0.0,0.0,0.0,7.0,8.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,0.0,0.0,...,0.0,17.133333,821.388563,14.0,78.0,3.0,0.0,25.0,0.0,24.0


Again, the missing values can be replaced with zeros:

In [172]:
dataFrame = dataFrame.fillna(0.0)
dataFrame.head()

Unnamed: 0,committer,numberCommits,fixedSZZIssues,fixedSonarIssues,fixedJiraIssues,inducedSZZIssues,inducedSonarIssues,timeInProject,resolutionTime,type_Bug,...,securityRemediationEffort,debt,closeTime,type_BUG,type_CODE_SMELL,type_VULNERABILITY,severity_BLOCKER,severity_CRITICAL,severity_INFO,severity_MAJOR
0,-l,27.0,0.0,0.0,0.0,0.0,2.0,4235880.0,0.0,0.0,...,0.0,0.833333,22617.766389,0.0,2.0,0.0,0.0,0.0,0.0,2.0
1,1028332163,14.0,0.0,0.0,0.0,0.0,0.0,77939.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A. J. David Bosschaert,432.0,51.0,0.0,0.0,1.0,0.0,173937105.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,A195882,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.933333,809.977238,0.0,18.0,0.0,0.0,0.0,7.0,8.0
4,A744013,5.0,0.0,4.0,0.0,0.0,3.0,351970.0,0.0,0.0,...,0.0,17.133333,821.388563,14.0,78.0,3.0,0.0,25.0,0.0,24.0


---

## **Save the final Data Frame:**

This is the final list of attributs included in the Data Frame:

In [173]:
list(dataFrame)

['committer',
 'numberCommits',
 'fixedSZZIssues',
 'fixedSonarIssues',
 'fixedJiraIssues',
 'inducedSZZIssues',
 'inducedSonarIssues',
 'timeInProject',
 'resolutionTime',
 'type_Bug',
 'type_Dependency upgrade',
 'type_Documentation',
 'type_Epic',
 'type_Improvement',
 'type_New Feature',
 'type_Question',
 'type_Story',
 'type_Sub-task',
 'type_Task',
 'type_Technical task',
 'type_Test',
 'type_Wish',
 'priority_Blocker',
 'priority_Critical',
 'priority_Major',
 'priority_Minor',
 'priority_Trivial',
 'ADD',
 'DELETE',
 'MODIFY',
 'RENAME',
 'UNKNOWN',
 'linesAdded',
 'linesRemoved',
 'bug',
 'refactoringType_Change Package',
 'refactoringType_Extract And Move Method',
 'refactoringType_Extract Class',
 'refactoringType_Extract Interface',
 'refactoringType_Extract Method',
 'refactoringType_Extract Subclass',
 'refactoringType_Extract Superclass',
 'refactoringType_Extract Variable',
 'refactoringType_Inline Method',
 'refactoringType_Inline Variable',
 'refactoringType_Move And

We now save this Data Frame into a *csv* file:

In [174]:
dataFrame.to_csv('../../data/interim/DataPreparation/DATA_FRAME.csv', header=True)