# **SONAR_ISSUES**

This notebook the cleaning of the attributes of the table `SONAR_ISSUES`.

First, we import the libraries we need and, then, we read the corresponding csv.

In [2]:
import pandas as pd
import numpy as np

In [3]:
sonarIssues = pd.read_csv("../../../data/interim/DataPreparation/SelectData/SONAR_ISSUES_select.csv").iloc[:,1:]
print(sonarIssues.shape)
sonarIssues.head()

(1941508, 9)


Unnamed: 0,projectID,creationDate,closeDate,creationCommitHash,closeCommitHash,type,severity,debt,author
0,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,,CODE_SMELL,MAJOR,20min,yoavs@apache.org
1,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
2,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
3,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
4,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org


We define 2 functions that returns, given two lists, their intersection and their difference, respectively.

In [4]:
def intersection(l1, l2):
  temp = set(l2)
  l3 = [value for value in l1 if value in temp]
  return l3

def difference(li1, li2): 
    return (list(list(set(li1)-set(li2)) + list(set(li2)-set(li1))))

Next, for each attribute, we treat the missing values.

#### projectID

In [5]:
len(sonarIssues.projectID.unique())

33

In [6]:
projectIDNan = list(np.where(sonarIssues.projectID.isna()))[0]
len(projectIDNan)

0

#### creationDate

In [7]:
len(sonarIssues.creationDate.unique())

19741

In [8]:
creationDateNan = list(np.where(sonarIssues.creationDate.isna()))[0]
len(creationDateNan)

0

#### closeDate

In [9]:
print(len(sonarIssues.closeDate.unique()))
closeDateNan = list(np.where(sonarIssues.closeDate.isna()))[0]
print(len(closeDateNan))
len(closeDateNan)/sonarIssues.shape[0]

16069
124457


0.06410326406071981

#### creationCommitHash

In [10]:
len(sonarIssues.creationCommitHash.unique())

19749

In [11]:
creationCommitHashNan = list(np.where(sonarIssues.creationCommitHash.isna()))[0]
len(creationCommitHashNan)

0

#### closeCommitHash

In [12]:
len(sonarIssues.closeCommitHash.unique())

16070

In [13]:
closeCommitHashNan = list(np.where(sonarIssues.closeCommitHash.isna()))[0]
print(len(closeCommitHashNan))
len(closeCommitHashNan)/sonarIssues.shape[0]

127832


0.0658416035370444

#### type

In [14]:
print(len(sonarIssues.type.unique()))
typeNan = list(np.where(sonarIssues.type.isna()))[0]
len(typeNan)

3


0

#### severity

In [15]:
print(len(sonarIssues.severity.unique()))
severityNan = list(np.where(sonarIssues.severity.isna()))[0]
len(severityNan)

5


0

#### debt

In [16]:
print(len(sonarIssues.debt.unique()))
debtNan = list(np.where(sonarIssues.debt.isna()))[0]
len(debtNan)

376


24088

#### author

In [17]:
print(len(sonarIssues.author.unique()))
authorNan = list(np.where(sonarIssues.author.isna()))[0]
print(len(authorNan))
len(authorNan)/sonarIssues.shape[0]

567
387753


0.1997174361372706

---

So we have 4 attributes with missing values: `closeDate`, `closeCommitHash`, `debt` and `author`.

#### closeDate and closeCommitHash

We intersect the rows indices of the missing values of each attribute.

In [18]:
print(len(closeDateNan))
print(len(closeCommitHashNan))
inter = intersection(closeDateNan, closeCommitHashNan)
len(inter)

124457
127832


124457

Then, we compute the difference of the rows indices of the missing values of each attribute.

In [19]:
diff = difference(closeCommitHashNan, closeDateNan)
len(diff)

3375

#### debt

In [20]:
print(len(sonarIssues.debt.unique()))
debtNan = list(np.where(sonarIssues.debt.isna())[0])
print(len(debtNan))
len(debtNan)/sonarIssues.shape[0]

376
24088


0.01240685075724643

After that, we remove the rows with missing values in the `debt` attribute and the ones with only missing value in the `closeCommitHash` column.

In [21]:
sonarIssues = sonarIssues.drop(debtNan).reset_index()
sonarIssues.shape

(1917420, 10)

Moreover, we change the missing values of the `closeCommitHash` attribute to 'non-resolved'.

In [22]:
sonarIssues = sonarIssues.fillna({'closeCommitHash': 'not-resolved'})

Now, we try to complete the `author` and `closeDate` columns using the `creationCommitHash` of this table and the `commitHash` attribute of the `GIT_COMMITS` table.

In [23]:
gitCommits = pd.read_csv("../../../data/interim/DataPreparation/CleanData/GIT_COMMITS_clean.csv").iloc[:,1:]
gitCommits.shape
gitCommits.head()

Unnamed: 0,Unnamed: 0.1,projectID,commitHash,author,committer,committerDate
0,0,accumulo,e0880e263e4bf8662ba3848405200473a25dfc9f,Keith Turner,Keith Turner,2011-10-04T00:46:07Z
1,1,accumulo,e8774c5ec3a35e042f320540b5f7e66ebd2d9e87,Billie Rinaldi,Billie Rinaldi,2011-10-04T16:57:13Z
2,2,accumulo,2032ebbd0ed90734da39ca238bbd10dee24d0030,Keith Turner,Keith Turner,2011-10-04T18:39:18Z
3,3,accumulo,de297d4932e08625a5df146f0802041bb5aeb892,Billie Rinaldi,Billie Rinaldi,2011-10-04T19:31:01Z
4,4,accumulo,34efaae87639a83b60fdb7274de4b45051025a3a,Billie Rinaldi,Billie Rinaldi,2011-10-05T17:19:06Z


We fill the missing values of the `closeDate` by the timestamp of the last commit of the project.

In [24]:
lastTimestamp = gitCommits.loc[:,['projectID', 'committerDate']].groupby(['projectID']).max()
lastTimestamp.head()

Unnamed: 0_level_0,committerDate
projectID,Unnamed: 1_level_1
accumulo,2019-07-18T15:21:42Z
ambari,2019-07-17T12:12:16Z
atlas,2019-07-19T11:18:34Z
aurora,2019-06-24T22:51:26Z
batik,2019-07-05T10:10:47Z


In [25]:
closeDateNan = list(np.where(sonarIssues.closeDate.isna()))[0]
sonarIssues_notresolved = sonarIssues.iloc[closeDateNan,:]
sonarIssues_notresolved = pd.merge(sonarIssues_notresolved, lastTimestamp, how='left', on='projectID')
sonarIssues_notresolved

Unnamed: 0,index,projectID,creationDate,closeDate,creationCommitHash,closeCommitHash,type,severity,debt,author,committerDate
0,0,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,20min,yoavs@apache.org,2019-07-07T10:31:36Z
1,13,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,30min,yoavs@apache.org,2019-07-07T10:31:36Z
2,18,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
3,20,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
4,21,commons-daemon,2003-09-04T23:28:19Z,,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
...,...,...,...,...,...,...,...,...,...,...,...
121854,1941503,commons-cli,2017-07-29T14:36:30Z,,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121855,1941504,commons-cli,2017-07-29T14:36:30Z,,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121856,1941505,commons-cli,2017-07-29T14:36:30Z,,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121857,1941506,commons-cli,2018-02-26T17:23:40Z,,b0024d482050a08efc36c3cabee37c0af0e57a10,not-resolved,CODE_SMELL,MAJOR,20min,deep.alexander@gmail.com,2019-07-07T10:13:26Z


In [26]:
sonarIssues_notresolved = sonarIssues_notresolved.loc[:,['projectID', 'creationDate', 'creationCommitHash', 'closeCommitHash', 'type', 'severity', 'debt', 'author', 'committerDate']].rename(columns={'committerDate': 'closeDate'})
sonarIssues_notresolved

Unnamed: 0,projectID,creationDate,creationCommitHash,closeCommitHash,type,severity,debt,author,closeDate
0,commons-daemon,2003-09-04T23:28:19Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,20min,yoavs@apache.org,2019-07-07T10:31:36Z
1,commons-daemon,2003-09-04T23:28:19Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,30min,yoavs@apache.org,2019-07-07T10:31:36Z
2,commons-daemon,2003-09-04T23:28:19Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
3,commons-daemon,2003-09-04T23:28:19Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
4,commons-daemon,2003-09-04T23:28:19Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org,2019-07-07T10:31:36Z
...,...,...,...,...,...,...,...,...,...
121854,commons-cli,2017-07-29T14:36:30Z,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121855,commons-cli,2017-07-29T14:36:30Z,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121856,commons-cli,2017-07-29T14:36:30Z,3bc9b84d3ae252800eb234a7b41981a9fef8696d,not-resolved,CODE_SMELL,MINOR,10min,rubin@raaftech.com,2019-07-07T10:13:26Z
121857,commons-cli,2018-02-26T17:23:40Z,b0024d482050a08efc36c3cabee37c0af0e57a10,not-resolved,CODE_SMELL,MAJOR,20min,deep.alexander@gmail.com,2019-07-07T10:13:26Z


Then, we concatenate the Sonar issues that have been resolved and the ones that had a missing value in this attribute.

In [27]:
sonarIssues_resolved = sonarIssues.drop(closeDateNan)
sonarIssues = pd.concat([sonarIssues_resolved, sonarIssues_notresolved], sort=False).sort_index().reset_index().iloc[:,1:]
sonarIssues

Unnamed: 0,index,projectID,creationDate,closeDate,creationCommitHash,closeCommitHash,type,severity,debt,author
0,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,20min,yoavs@apache.org
1,1.0,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
2,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,30min,yoavs@apache.org
3,2.0,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
4,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org
...,...,...,...,...,...,...,...,...,...,...
1917415,1941495.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MAJOR,20min,rubin@raaftech.com
1917416,1941496.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,2min,rubin@raaftech.com
1917417,1941497.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com
1917418,1941498.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com


And we fill the author.

In [28]:
sonarIssues.groupby(['author']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,author,count
0,a195882@wvdi1404-10074.oa2.aeth.aetna.com,18
1,a760104@aetna.com,2010
2,aaron.dossett@target.com,19
3,abaranchuk@hortonworks.com,117
4,abaranchuk@hortonworks.con,4
...,...,...
558,zhouyunqing@google.com,66
559,zmanji@apache.org,1608
560,zmanji@gmail.com,7
561,zmanji@twitter.com,47


In [29]:
df1 = gitCommits[['commitHash', 'committer']]
df2 = (sonarIssues[['creationCommitHash', 'author']]).rename(columns={'creationCommitHash': 'commitHash'})
df2.head()

Unnamed: 0,commitHash,author
0,d3416d3a25b16da3d18b3849522fa96183918e5b,yoavs@apache.org
1,d3416d3a25b16da3d18b3849522fa96183918e5b,yoavs@apache.org
2,d3416d3a25b16da3d18b3849522fa96183918e5b,yoavs@apache.org
3,d3416d3a25b16da3d18b3849522fa96183918e5b,yoavs@apache.org
4,d3416d3a25b16da3d18b3849522fa96183918e5b,yoavs@apache.org


In [30]:
merge = pd.merge(df1, df2, on='commitHash', how='inner').drop_duplicates()
merge.head()

Unnamed: 0,commitHash,committer,author
0,e0880e263e4bf8662ba3848405200473a25dfc9f,Keith Turner,kturner@apache.org
284520,228dd5b26313a9ce158712df1acabbd193c5ef98,Keith Turner,kturner@apache.org
284527,82fc84f1331ade3f17a721fa0b86cd63b73746a2,Billie Rinaldi,billie@apache.org
284549,0218f14a6f938528c44a33afc917289fc175d87d,Billie Rinaldi,billie@apache.org
284554,527a100ef5de9c51fb17b0b340f036ee4cd98590,Eric C. Newton,ecn@apache.org


In [31]:
print(len(merge.committer.unique()))
print(len(merge.author.unique()))

415
564


In [32]:
pairs = merge.groupby(['committer', 'author']).size().reset_index().rename(columns={0:'count'})
pairs

Unnamed: 0,committer,author,count
0,(no author),geirm@apache.org,1
1,-l,maxim@apache.org,2
2,A195882,a195882@wvdi1404-10074.oa2.aeth.aetna.com,1
3,A744013,hbutani@hortonworks.com,1
4,A744013,markwatd@aetna.com,3
...,...,...,...
2022,tbeerbower,ncole@hortonworks.com,3
2023,tbeerbower,rnettleton@hortonworks.com,1
2024,tbeerbower,sgunturi@hortonworks.com,1
2025,tbeerbower,swagle@hortonworks.com,2


In [33]:
index1 = list(np.where(pairs.committer.value_counts()==1))[0]
pairs.committer.value_counts()

Dan Halperin        82
Davor Bonaci        72
Kenneth Knowles     72
Thomas Groh         69
Kenn Knowles        62
                    ..
Kelly Westbrooks     1
Bella Robinson       1
Guido Casper         1
Mario Ivankovits     1
Daniel Savarese      1
Name: committer, Length: 414, dtype: int64

In [34]:
committer_1 = (pairs.committer.value_counts())[index1].index

In [35]:
index2 = list(np.where(pairs.author.value_counts()==1))[0]
pairs.author.value_counts()

mahadev@apache.org         26
bchambers@google.com       21
jhurley@hortonworks.com    20
ncole@hortonworks.com      20
jitendra@apache.org        20
                           ..
nsochele@apache.org         1
wickman@twitter.com         1
sandymac@apache.org         1
fx@apache.org               1
piotr.turski@gmail.com      1
Name: author, Length: 563, dtype: int64

In [36]:
author_1 = (pairs.author.value_counts())[index2].index

In [37]:
index_author_1 = pairs.loc[pairs['author'].isin(author_1)].index

In [38]:
index_committer_1 = pairs.loc[pairs['committer'].isin(committer_1)].index

In [39]:
inter_pairs = intersection(index_author_1, index_committer_1)
len(inter_pairs)

153

In [40]:
pairs_unique = pairs.loc[inter_pairs]
pairs_unique

Unnamed: 0,committer,author,count
2,A195882,a195882@wvdi1404-10074.oa2.aeth.aetna.com,1
6,Aaron Dossett,aaron.dossett@target.com,2
18,Adrian Crum,adrianc@apache.org,1
72,Alex Karasulu,akarasulu@apache.org,21
78,Alfred Nathaniel,anathaniel@apache.org,4
...,...,...,...
1887,arpitgupta,arpit@hortonworks.com,3
1937,billh,billh@apache.org,45
1945,dlaha,dlaha@unknown,1
1962,henrib <>,henri.biestro@l-hbiestro.koeos.lan,1


In [41]:
print(pairs_unique.committer.value_counts())
pairs_unique.author.value_counts()

Carl Hall          1
Simone Gianni      1
Trygve Laugstol    1
Dennis Fusaro      1
Guido Casper       1
                  ..
Jonathan Boulle    1
A195882            1
henrib <>          1
Jeremy Quinn       1
Leszek Gawron      1
Name: committer, Length: 153, dtype: int64


antonio@apache.org                    1
henri.biestro@l-hbiestro.koeos.lan    1
michaelm@apache.org                   1
odiachenko@hortonworks.com            1
imario@apache.org                     1
                                     ..
mchucarroll@twopensource.com          1
mvdb@apache.org                       1
tbennett@apache.org                   1
cam@apache.org                        1
enver@apache.org                      1
Name: author, Length: 153, dtype: int64

After that, we take the rows that contain these authors and committers.

In [42]:
commiters = list(pairs_unique.committer)
authors = list(pairs_unique.author)

In [43]:
merge[merge.author.isna() == True]

Unnamed: 0,commitHash,committer,author
390415,eafe0661f57dd5fb5604439354affe8c43f07500,Billie Rinaldi,
390522,f264f882d0d28f7d70d5f0c0deb162c7717025e7,Keith Turner,
394436,b76b45627a84120a5da5c5b1e79ee67514552968,Billie Rinaldi,
394452,01667e6341ff780009819b34cafd2b2ea7bb4dc0,Keith Turner,
394557,e9316a9426f6680c077c1c5521bf3e5f78e885dd,Adam Fuchs,
...,...,...,...
1866195,83c2c001091e6b5d15dc887e08e778b80328d9f4,Oleg Kalnichevski,
1909767,f26d63c53a80272c6ce1ec0e77fd6a9cceeb894f,Marc Giger,
1909867,18b0fde1f8a5c7de811bc8ec3a886890d31276b9,Colm O Heigeartaigh,
1911104,bce01afb9495317f66936bdbdadfe8ffc096b533,Colm O hEigeartaigh,


In [44]:
merge2 = pd.merge(merge, pairs_unique, on='committer', how='inner')
merge2 = merge2[['commitHash', 'committer', 'author_y']].rename(columns={'author_y': 'author', 'commitHash': 'creationCommitHash'})
merge2 = merge2.drop_duplicates()
merge2.head()

Unnamed: 0,creationCommitHash,committer,author
0,1725ec3a6f8a7f56fb167e7b24686b1848520aa9,Bill Slacum,ujustgotbilld@apache.org
1,89e282fb559225ff144328bf2295f94185c048e9,Owen O'Malley,omalley@apache.org
2,c1987ed5f9593624d8d224623793e58d355718ad,Owen O'Malley,omalley@apache.org
3,d1435b837ba4a465b5925510e8b643c94915b571,Owen O'Malley,omalley@apache.org
4,ac57063579204a0fecd1c08e45b80b857a8e5cd0,Owen O'Malley,omalley@apache.org


In [45]:
print(merge2.shape)
print(len(merge2.creationCommitHash.unique()))

(2897, 3)
2897


We have all the `commitHash` with the author. We fill in the hashes of the `SONAR_ISSUES` table that had no author.

In [46]:
prova2 = merge2[['creationCommitHash', 'author']]
dictionary = prova2.set_index('creationCommitHash').T.to_dict('records')[0]

In [47]:
sonarIssues.author = sonarIssues.author.fillna(sonarIssues.creationCommitHash.map(dictionary))
sonarIssues

Unnamed: 0,index,projectID,creationDate,closeDate,creationCommitHash,closeCommitHash,type,severity,debt,author
0,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,20min,yoavs@apache.org
1,1.0,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
2,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,30min,yoavs@apache.org
3,2.0,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
4,,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org
...,...,...,...,...,...,...,...,...,...,...
1917415,1941495.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MAJOR,20min,rubin@raaftech.com
1917416,1941496.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,2min,rubin@raaftech.com
1917417,1941497.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com
1917418,1941498.0,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com


Then, we remove the rows with missing values in the `author` attribute.

In [48]:
print(sonarIssues.shape)
sonarIssues = sonarIssues.dropna(subset=['author'])
sonarIssues.shape

(1917420, 10)


(1532446, 10)

---

In [49]:
sonarIssues = sonarIssues.iloc[:,1:]
sonarIssues

Unnamed: 0,projectID,creationDate,closeDate,creationCommitHash,closeCommitHash,type,severity,debt,author
0,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,20min,yoavs@apache.org
1,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
2,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MAJOR,30min,yoavs@apache.org
3,commons-daemon,2003-09-04T23:28:19Z,2010-03-15T08:09:26Z,d3416d3a25b16da3d18b3849522fa96183918e5b,6cbc872eb202dfc27f2eb59b02d953c3deca32c8,CODE_SMELL,MINOR,1min,yoavs@apache.org
4,commons-daemon,2003-09-04T23:28:19Z,2019-07-07T10:31:36Z,d3416d3a25b16da3d18b3849522fa96183918e5b,not-resolved,CODE_SMELL,MINOR,1min,yoavs@apache.org
...,...,...,...,...,...,...,...,...,...
1917415,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MAJOR,20min,rubin@raaftech.com
1917416,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,2min,rubin@raaftech.com
1917417,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com
1917418,commons-cli,2017-06-23T11:04:59Z,2017-07-28T15:29:53Z,4f17a89ad04bcf718aeac43d202f8c261ce0b796,e420dd2bebd532abf36d12916358652998e20834,CODE_SMELL,MINOR,10min,rubin@raaftech.com


We save it into a new csv.

In [50]:
sonarIssues.to_csv('../../../data/interim/DataPreparation/CleanData/SONAR_ISSUES_clean.csv', header=True)