# **JIRA_ISSUES**

This notebook the cleaning of the attributes of the table `JIRA_ISSUES`.

First, we import the libraries we need and, then, we read the corresponding csv.

In [12]:
import pandas as pd
import numpy as np

In [13]:
jiraIssues = pd.read_csv("../../../data/interim/DataPreparation/SelectData/JIRA_ISSUES_select.csv").iloc[:,1:]
print(jiraIssues.shape)
jiraIssues.head()

(67427, 8)


Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter
0,commons-exec,EXEC-108,2018-09-18T11:15:58.000+0000,,Bug,Major,,natanieljr
1,commons-exec,EXEC-107,2018-07-04T12:09:47.000+0000,,New Feature,Major,,stefanreich
2,commons-exec,EXEC-106,2018-03-06T11:32:51.000+0000,,Improvement,Major,,sebb
3,commons-exec,EXEC-105,2018-02-16T13:47:10.000+0000,,Wish,Trivial,,IP
4,commons-exec,EXEC-104,2017-08-04T11:57:39.000+0000,,Bug,Major,,krichter


We define a function that returns, given two lists, their intersection.

In [14]:
def intersection(l1, l2):
  temp = set(l2)
  l3 = [value for value in l1 if value in temp]
  return l3

Next, for each attribute, we treat the missing values.

#### projectID

In [15]:
len(jiraIssues.projectID.unique())

33

In [16]:
projectID_nan = list(np.where(jiraIssues.projectID.isna()))[0]
len(projectID_nan)

0

#### key

In [17]:
len(jiraIssues.key.unique())

67427

In [18]:
key_nan = list(np.where(jiraIssues.key.isna()))[0]
len(key_nan)

0

#### creationDate

In [19]:
creationDate_nan = list(np.where(jiraIssues.creationDate.isna()))[0]
len(creationDate_nan)

0

#### resolutionDate

In [20]:
resolutionDate_nan = list(np.where(jiraIssues.resolutionDate.isna()))[0]
len(resolutionDate_nan)

10705

---

These missing values indicate that the issue has not been resolved yet, so we can relabel the `NaNs` as the timestamp of the last commit of this project. So we have to relate this table with `GIT_COMMITS` to know the timestamp of the last commit of the project.

In [21]:
jiraIssues_notresolved = jiraIssues.iloc[resolutionDate_nan,:]
jiraIssues_notresolved

Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter
0,commons-exec,EXEC-108,2018-09-18T11:15:58.000+0000,,Bug,Major,,natanieljr
1,commons-exec,EXEC-107,2018-07-04T12:09:47.000+0000,,New Feature,Major,,stefanreich
2,commons-exec,EXEC-106,2018-03-06T11:32:51.000+0000,,Improvement,Major,,sebb
3,commons-exec,EXEC-105,2018-02-16T13:47:10.000+0000,,Wish,Trivial,,IP
4,commons-exec,EXEC-104,2017-08-04T11:57:39.000+0000,,Bug,Major,,krichter
...,...,...,...,...,...,...,...,...
67403,zookeeper,ZOOKEEPER-24,2008-06-10T21:30:20.000+0000,,New Feature,Major,breed,phunt
67405,zookeeper,ZOOKEEPER-22,2008-06-10T21:27:49.000+0000,,New Feature,Major,mahadev,phunt
67413,zookeeper,ZOOKEEPER-14,2008-06-10T21:14:24.000+0000,,Bug,Major,breed,phunt
67415,zookeeper,ZOOKEEPER-12,2008-06-10T21:11:36.000+0000,,Bug,Major,,phunt


In [22]:
gitCommits = pd.read_csv("../../../data/interim/DataPreparation/SelectData/GIT_COMMITS_select.csv").iloc[:,[1,-1]]
gitCommits.head()

Unnamed: 0,projectID,committerDate
0,accumulo,2011-10-04T00:46:07Z
1,accumulo,2011-10-04T16:57:13Z
2,accumulo,2011-10-04T18:39:18Z
3,accumulo,2011-10-04T19:31:01Z
4,accumulo,2011-10-05T17:19:06Z


In [23]:
lastTimestamp = gitCommits.groupby(['projectID']).max()
lastTimestamp.head()

Unnamed: 0_level_0,committerDate
projectID,Unnamed: 1_level_1
accumulo,2019-07-18T15:21:42Z
ambari,2019-07-17T12:12:16Z
atlas,2019-07-19T11:18:34Z
aurora,2019-06-24T22:51:26Z
batik,2019-07-05T10:10:47Z


In [24]:
jiraIssues_notresolved = pd.merge(jiraIssues_notresolved, lastTimestamp, how='left', on='projectID')
jiraIssues_notresolved.head()

Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter,committerDate
0,commons-exec,EXEC-108,2018-09-18T11:15:58.000+0000,,Bug,Major,,natanieljr,2019-07-07T10:32:12Z
1,commons-exec,EXEC-107,2018-07-04T12:09:47.000+0000,,New Feature,Major,,stefanreich,2019-07-07T10:32:12Z
2,commons-exec,EXEC-106,2018-03-06T11:32:51.000+0000,,Improvement,Major,,sebb,2019-07-07T10:32:12Z
3,commons-exec,EXEC-105,2018-02-16T13:47:10.000+0000,,Wish,Trivial,,IP,2019-07-07T10:32:12Z
4,commons-exec,EXEC-104,2017-08-04T11:57:39.000+0000,,Bug,Major,,krichter,2019-07-07T10:32:12Z


In [25]:
jiraIssues_notresolved = jiraIssues_notresolved.iloc[:,[0,1,2,4,5,6,7,8]].rename(columns={'committerDate': 'resolutionDate'})
jiraIssues_notresolved

Unnamed: 0,projectID,key,creationDate,type,priority,assignee,reporter,resolutionDate
0,commons-exec,EXEC-108,2018-09-18T11:15:58.000+0000,Bug,Major,,natanieljr,2019-07-07T10:32:12Z
1,commons-exec,EXEC-107,2018-07-04T12:09:47.000+0000,New Feature,Major,,stefanreich,2019-07-07T10:32:12Z
2,commons-exec,EXEC-106,2018-03-06T11:32:51.000+0000,Improvement,Major,,sebb,2019-07-07T10:32:12Z
3,commons-exec,EXEC-105,2018-02-16T13:47:10.000+0000,Wish,Trivial,,IP,2019-07-07T10:32:12Z
4,commons-exec,EXEC-104,2017-08-04T11:57:39.000+0000,Bug,Major,,krichter,2019-07-07T10:32:12Z
...,...,...,...,...,...,...,...,...
10700,zookeeper,ZOOKEEPER-24,2008-06-10T21:30:20.000+0000,New Feature,Major,breed,phunt,2019-07-19T13:08:30Z
10701,zookeeper,ZOOKEEPER-22,2008-06-10T21:27:49.000+0000,New Feature,Major,mahadev,phunt,2019-07-19T13:08:30Z
10702,zookeeper,ZOOKEEPER-14,2008-06-10T21:14:24.000+0000,Bug,Major,breed,phunt,2019-07-19T13:08:30Z
10703,zookeeper,ZOOKEEPER-12,2008-06-10T21:11:36.000+0000,Bug,Major,,phunt,2019-07-19T13:08:30Z


Then , we concatenate the Jira issues that have been resolved and the ones that had a missing value in this attribute.

In [26]:
jiraIssues_resolved = jiraIssues.drop(resolutionDate_nan)
print(jiraIssues_resolved.shape)
jiraIssues_resolved.head()

(56722, 8)


Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter
8,commons-exec,EXEC-100,2016-01-11T16:45:23.000+0000,2016-01-11T18:01:01.000+0000,Task,Minor,sgoeschl,sgoeschl
10,commons-exec,EXEC-98,2016-01-08T21:40:01.000+0000,2016-01-09T00:59:19.000+0000,Bug,Major,sgoeschl,sgoeschl
11,commons-exec,EXEC-97,2015-11-17T14:50:47.000+0000,2015-11-17T14:53:13.000+0000,Bug,Major,,TimBarham
15,commons-exec,EXEC-93,2015-03-18T13:28:02.000+0000,2015-09-29T18:14:28.000+0000,Bug,Major,,sadovnikov
16,commons-exec,EXEC-92,2015-03-17T00:55:04.000+0000,2016-01-08T21:28:07.000+0000,Bug,Major,sgoeschl,belugabehr


In [27]:
jiraIssues = pd.concat([jiraIssues_resolved, jiraIssues_notresolved], sort=False).sort_index().reset_index().iloc[:,1:]
jiraIssues

Unnamed: 0,projectID,key,creationDate,resolutionDate,type,priority,assignee,reporter
0,commons-exec,EXEC-108,2018-09-18T11:15:58.000+0000,2019-07-07T10:32:12Z,Bug,Major,,natanieljr
1,commons-exec,EXEC-107,2018-07-04T12:09:47.000+0000,2019-07-07T10:32:12Z,New Feature,Major,,stefanreich
2,commons-exec,EXEC-106,2018-03-06T11:32:51.000+0000,2019-07-07T10:32:12Z,Improvement,Major,,sebb
3,commons-exec,EXEC-105,2018-02-16T13:47:10.000+0000,2019-07-07T10:32:12Z,Wish,Trivial,,IP
4,commons-exec,EXEC-104,2017-08-04T11:57:39.000+0000,2019-07-07T10:32:12Z,Bug,Major,,krichter
...,...,...,...,...,...,...,...,...
67422,zookeeper,ZOOKEEPER-5,2008-06-09T23:43:48.000+0000,2008-10-17T00:24:34.000+0000,New Feature,Major,mahadev,mahadev
67423,zookeeper,ZOOKEEPER-4,2008-06-09T16:42:38.000+0000,2008-09-09T21:09:01.000+0000,Bug,Major,fpj,breed
67424,zookeeper,ZOOKEEPER-3,2008-06-09T16:39:34.000+0000,2009-11-18T17:48:01.000+0000,Bug,Trivial,mahadev,breed
67425,zookeeper,ZOOKEEPER-2,2008-06-09T16:34:31.000+0000,2008-08-25T21:13:14.000+0000,Bug,Major,fpj,breed


In [28]:
resolutionDate_nan = list(np.where(jiraIssues.resolutionDate.isna()))[0]
len(resolutionDate_nan)

0

#### type

In [29]:
len(jiraIssues.type.unique())

13

In [30]:
jiraIssues_nan = list(np.where(jiraIssues.type.isna()))[0]
len(jiraIssues_nan)

0

#### priority

In [31]:
len(jiraIssues.priority.unique())

6

In [32]:
priority_nan = list(np.where(jiraIssues.priority.isna()))[0]
len(priority_nan)

1082

We remove these rows because we can not obtain this information from anywhere else. Finally we will have 66.345 rows.

In [33]:
jiraIssues = jiraIssues.drop(priority_nan)
jiraIssues.shape

(66345, 8)

#### assignee

In [34]:
len(jiraIssues.assignee.unique())

1510

In [35]:
assignee_nan = list(np.where(jiraIssues.assignee.isna()))[0]
len(assignee_nan)

19693

These missing values indicate that the issue has not been assigned to anyone, so we can relabel the NaNs as 
`not-assigned`.

In [36]:
jiraIssues.assignee = jiraIssues.assignee.fillna('not-assigned')
jiraIssues.assignee

0        not-assigned
1        not-assigned
2        not-assigned
3        not-assigned
4        not-assigned
             ...     
67422         mahadev
67423             fpj
67424         mahadev
67425             fpj
67426           phunt
Name: assignee, Length: 66345, dtype: object

In [37]:
assignee_nan = list(np.where(jiraIssues.assignee.isna()))[0]
len(assignee_nan)

0

#### reporter

In [38]:
len(jiraIssues.reporter.unique())

10409

In [39]:
reporter_nan = list(np.where(jiraIssues.reporter.isna()))[0]
len(reporter_nan)

0

---

We save it into a new csv.

In [40]:
jiraIssues.to_csv('../../../data/interim/DataPreparation/CleanData/JIRA_ISSUES_clean.csv', header=True)