<a href="https://colab.research.google.com/github/ggp6101/DSCI644_Group1/blob/main/DSCI644_Notebook01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Initial exploration and preparation of the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Change the line below for your path to the .csv file.
data = pd.read_csv('drive/Shared drives/DSCI644_Group1/Code_review.csv')
data.shape

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


(3838, 11)

In [None]:
data.head()

Unnamed: 0,id,url,subject,description,Category,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,openstack%2Frally~master~I9da0124d5a644fccb6e6...,https://review.opendev.org/240219,Refactoring log utils,Refactoring log utils * Moved log functions f...,testing,,,,,,
1,openstack%2Frally~master~I9da0124d5a644fccb6e6...,https://review.opendev.org/240219,Refactoring log utils,Refactoring log utils * Moved log functions f...,objective,,,,,,
2,zuul%2Fzuul~master~Icbe206db6bcbaaf78a3d89997f...,https://review.opendev.org/223063,(WIP) Refactor for better connection testing,(WIP) Refactor for better connection testing ...,testing,,,,,,
3,zuul%2Fzuul~master~Icbe206db6bcbaaf78a3d89997f...,https://review.opendev.org/223063,(WIP) Refactor for better connection testing,(WIP) Refactor for better connection testing ...,objective,,,,,,
4,openstack%2Fhorizon~master~I5d2272a0abb521ddb9...,https://review.opendev.org/142839,Refactor project instance test,Refactor project instance test Refactoring th...,testing,,,,,,


Now we should only need the subject, descripton, and category columns. The goal of the project is to predict the category from the subject and description.

In [None]:
# Select subject, description, and category columns. Drop any rows with NaNs.
data = data[["subject","description","Category"]].copy()
data.rename(columns = {'Category':'category'}, inplace = True)
print(data.shape)
data = data.dropna()
print(data.shape)

(3838, 3)
(3836, 3)


In [None]:
data.category.value_counts()

objective      1549
quality         937
testing         745
integration     440
refactoring     165
Name: category, dtype: int64

 There are five different categories. These five categories are also not balanced, so this will be a multiclass classification problem with imbalanced data. Note that these counts match the counts given in our Excel file, so that is good.

### How many UNIQUE subject/description pairs are there ?

In [None]:
subjects_descriptions = np.array(data[["subject","description"]]).astype('str')
uniquearr=np.unique(subjects_descriptions, axis=0)
uniquearr.shape

(1706, 2)

So there are 3836 lines in the file, but only 1706 unique subject/description pairs. This means that subject/description pairs can be assigned multiple labels; for example, to both 'testing' and 'objective'. So this will also be a multi-label classification problem, so something like this article might be a helpful guide: [Article on multi-label classification](https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff)

### Now make a dataframe where we add a column of lists containing each of the categories that a description has been assigned to.

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

uniquedf = pd.DataFrame(uniquearr, columns = ['subject','description'])
uniquedf['category_list'] = np.empty((len(uniquedf), 0)).tolist()

for i in range(len(uniquedf)):
  mask1 = (data.subject == uniquedf.subject[i])
  mask2 = (data.description == uniquedf.description[i])
  stage = data[mask1 & mask2]
  uniquedf.category_list[i].extend(stage.category)
uniquedf.tail()

Unnamed: 0,subject,description,category_list
1701,xenapi: refactor spawn to prep for more code s...,xenapi: refactor spawn so more code can be sha...,"[objective, quality, objective]"
1702,xenapi: refactor volumeops attach,xenapi: refactor volumeops attach Refactor th...,"[quality, objective]"
1703,xenapi: refactor: move RawTGZImage to common,xenapi: refactor: move RawTGZImage to common ...,"[quality, objective, objective]"
1704,xenapi: refactor: move UpdateGlanceImage to co...,xenapi: refactor: move UpdateGlanceImage to co...,"[refactoring, objective, objective]"
1705,zmq: Refactor test case shared code,zmq: Refactor test case shared code A number ...,"[testing, quality, objective]"


### Now add a column for each category. This column will count how many times a given category was assigned to that description.

In [None]:
uniquedf['objective'], uniquedf['quality'], uniquedf['testing'], uniquedf['integration'], uniquedf['refactoring'] = [np.nan, np.nan, np.nan, np.nan, np.nan]
for i in range(len(uniquedf)):
  uniquedf.objective[i] = uniquedf.category_list[i].count('objective')
  uniquedf.quality[i] = uniquedf.category_list[i].count('quality')
  uniquedf.testing[i] = uniquedf.category_list[i].count('testing')
  uniquedf.integration[i] = uniquedf.category_list[i].count('integration')
  uniquedf.refactoring[i] = uniquedf.category_list[i].count('refactoring')
uniquedf = uniquedf.astype({"subject":'string', "description":'string'})
uniquedf = uniquedf.astype({"objective":'int', "quality":'int', "testing":'int', "integration":'int', "refactoring":'int'})
uniquedf.tail()

Unnamed: 0,subject,description,category_list,objective,quality,testing,integration,refactoring
1701,xenapi: refactor spawn to prep for more code s...,xenapi: refactor spawn so more code can be sha...,"[objective, quality, objective]",2,1,0,0,0
1702,xenapi: refactor volumeops attach,xenapi: refactor volumeops attach Refactor th...,"[quality, objective]",1,1,0,0,0
1703,xenapi: refactor: move RawTGZImage to common,xenapi: refactor: move RawTGZImage to common ...,"[quality, objective, objective]",2,1,0,0,0
1704,xenapi: refactor: move UpdateGlanceImage to co...,xenapi: refactor: move UpdateGlanceImage to co...,"[refactoring, objective, objective]",2,0,0,0,1
1705,zmq: Refactor test case shared code,zmq: Refactor test case shared code A number ...,"[testing, quality, objective]",1,1,1,0,0


Check that the sum is still what we had before from the original data frame and from the Excel file.

In [None]:
uniquedf[['objective','quality','testing','integration','refactoring']].sum(axis=0)

objective      1549
quality         937
testing         745
integration     440
refactoring     165
dtype: int64

Note from the data frame above that some desciptions have the same label applied more than once. This is probably because in the original paper, each of these themes had sub-themes that could be assigned. So a description might be assigned to more than one sub-theme within the 'objective' theme, say. We can just set numbers greater than one to one, since we are just interested in if a description belongs to that theme or not.

In [None]:
uniquedf.loc[uniquedf.objective > 1, 'objective'] = 1
uniquedf.loc[uniquedf.quality > 1, 'quality'] = 1
uniquedf.loc[uniquedf.testing > 1, 'testing'] = 1
uniquedf.loc[uniquedf.integration > 1, 'integration'] = 1
uniquedf.loc[uniquedf.refactoring > 1, 'refactoring'] = 1
uniquedf.tail()

Unnamed: 0,subject,description,category_list,objective,quality,testing,integration,refactoring
1701,xenapi: refactor spawn to prep for more code s...,xenapi: refactor spawn so more code can be sha...,"[objective, quality, objective]",1,1,0,0,0
1702,xenapi: refactor volumeops attach,xenapi: refactor volumeops attach Refactor th...,"[quality, objective]",1,1,0,0,0
1703,xenapi: refactor: move RawTGZImage to common,xenapi: refactor: move RawTGZImage to common ...,"[quality, objective, objective]",1,1,0,0,0
1704,xenapi: refactor: move UpdateGlanceImage to co...,xenapi: refactor: move UpdateGlanceImage to co...,"[refactoring, objective, objective]",1,0,0,0,1
1705,zmq: Refactor test case shared code,zmq: Refactor test case shared code A number ...,"[testing, quality, objective]",1,1,1,0,0


In [None]:
#See what the counts are now that duplicates have been removed, and also check what the class balance is.
print(uniquedf[['objective','quality','testing','integration','refactoring']].sum(axis=0))
print(uniquedf[['objective','quality','testing','integration','refactoring']].sum(axis=0)/len(uniquedf))

objective      1327
quality         917
testing         740
integration     433
refactoring     162
dtype: int64
objective      0.777843
quality        0.537515
testing        0.433763
integration    0.253810
refactoring    0.094959
dtype: float64


### Now we can delete the category_list column. We can also delete the subject column for now because it is repeated as the first part of the description entry.

In [None]:
cleaned_df = uniquedf.drop(['subject', 'category_list'], axis=1)
cleaned_df.head()

Unnamed: 0,description,objective,quality,testing,integration,refactoring
0,(WIP) Refactor for better connection testing ...,1,0,1,0,0
1,(refactor) Refactor Ansible for standard-conta...,1,1,0,0,0
2,- switch to testtools - remove pep8 warnings -...,1,0,1,1,0
3,A minor refactor in wsgi.py A minor refactor ...,1,0,0,0,1
4,a minor refactor in wsgi.py a minor refactor ...,1,0,0,0,1


### This should now be a cleaned and prepared data frame, ready for Phase 2.