# Eclipse Defect Prediction

This notebook investigates defect prediction using the Eclipse 2.0a dataset from the PROMISE 2007 paper, Predicting Defects for Eclipse. The dataset is available on the web page [Eclipse bug data](https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/). See also the [revised paper](https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise2007-dataset-20a.pdf) which uses the version of the dataset used here.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

In [2]:
# Use fixed random seed, so that notebook always produces the same results
# We will use the random_state argument to set the seed in appropriate function calls
myseed = 1

## Load Data

In this section, we load CSV files containing the software metrics and defects data for Eclipse versions 2.0, 2.1, and 3.0, removed columns unused in the paper, and replace the count of post defects with a binary-valued column has_defects which we can use for classification models. We also check for missing data.

In [3]:
files_20 = pd.read_csv('data/eclipse-metrics-files-2.0.csv', sep=';')
files_21 = pd.read_csv('data/eclipse-metrics-files-2.1.csv', sep=';')
files_30 = pd.read_csv('data/eclipse-metrics-files-3.0.csv', sep=';')

In [4]:
# How many rows (samples) and columns (features + label) are in the data?
display(files_20.shape)
display(files_21.shape)
display(files_30.shape)

(6729, 202)

(7888, 202)

(10593, 202)

In [5]:
# Replace post column which counts post-release defects with has_defects binary column for classification
files_20['has_defects'] = files_20['post'] > 0
files_21['has_defects'] = files_21['post'] > 0
files_30['has_defects'] = files_30['post'] > 0
files_20['has_defects'] = files_20['has_defects'].astype(int)
files_21['has_defects'] = files_21['has_defects'].astype(int)
files_30['has_defects'] = files_30['has_defects'].astype(int)

In [6]:
# Discard plugin and filename labels, keep only predictors used in paper
files_20 = files_20.filter(regex='^(pre|has_defects|ACD|FOUT|MLOC|NBD|NOI|NOM|NSF|NSM|PAR|TLOC|VG)')
files_21 = files_21.filter(regex='^(pre|has_defects|ACD|FOUT|MLOC|NBD|NOI|NOM|NSF|NSM|PAR|TLOC|VG)')
files_30 = files_30.filter(regex='^(pre|has_defects|ACD|FOUT|MLOC|NBD|NOI|NOM|NSF|NSM|PAR|TLOC|VG)')

In [7]:
# Examine a random sample of the data
files_20.sample(10)

Unnamed: 0,pre,ACD,FOUT_avg,FOUT_max,FOUT_sum,MLOC_avg,MLOC_max,MLOC_sum,NBD_avg,NBD_max,...,NSM_max,NSM_sum,PAR_avg,PAR_max,PAR_sum,TLOC,VG_avg,VG_max,VG_sum,has_defects
5956,0,0.0,2.307692,8.0,30.0,3.692308,13.0,48.0,1.461538,3.0,...,12.0,12.0,1.461538,3.0,19.0,79.0,1.615385,3.0,21.0,0
2740,1,0.0,0.583333,6.0,7.0,1.583333,7.0,19.0,1.0,1.0,...,0.0,0.0,0.583333,2.0,7.0,49.0,1.083333,2.0,13.0,0
2242,0,0.0,1.058824,3.0,18.0,3.764706,17.0,64.0,1.176471,3.0,...,3.0,3.0,0.882353,4.0,15.0,107.0,1.705882,5.0,29.0,0
4683,0,0.0,6.210526,24.0,118.0,17.789474,73.0,338.0,2.368421,6.0,...,0.0,0.0,0.631579,2.0,12.0,398.0,6.947368,23.0,132.0,1
3525,0,0.0,5.0,9.0,20.0,7.5,12.0,30.0,1.5,2.0,...,0.0,0.0,1.75,3.0,7.0,48.0,2.0,3.0,8.0,0
2340,18,0.0,0.004,2.0,2.0,0.016,8.0,8.0,0.004,2.0,...,500.0,500.0,2.12,13.0,1060.0,736.0,0.004,2.0,2.0,1
4710,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0,0
2808,2,0.0,3.975,26.0,159.0,10.775,60.0,431.0,2.0,6.0,...,0.0,0.0,0.675,6.0,27.0,520.0,4.3,23.0,172.0,1
6228,0,2.0,9.428571,21.0,66.0,13.142857,30.0,92.0,2.428571,4.0,...,0.0,0.0,1.142857,2.0,8.0,135.0,3.857143,7.0,27.0,0
6271,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0


In [8]:
# Check for missing data. Counts should all be zero.
display(files_20.isna().sum().sum())
display(files_21.isna().sum().sum())
display(files_30.isna().sum().sum())

0

0

0

## Exploratory Data Analysis

We first look at the distribution of each predictor then examine their correlations.

In [9]:
files_20.hist(bins=30, figsize=(20,16));

In [10]:
# We can use pandas style options to produce a heatmap of correlations. High correlations are red.
corr_20 = files_20.corr()
corr_20.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,pre,ACD,FOUT_avg,FOUT_max,FOUT_sum,MLOC_avg,MLOC_max,MLOC_sum,NBD_avg,NBD_max,NBD_sum,NOI,NOM_avg,NOM_max,NOM_sum,NSF_avg,NSF_max,NSF_sum,NSM_avg,NSM_max,NSM_sum,PAR_avg,PAR_max,PAR_sum,TLOC,VG_avg,VG_max,VG_sum,has_defects
pre,1.0,0.36,0.28,0.41,0.47,0.21,0.29,0.4,0.23,0.35,0.45,-0.11,0.24,0.34,0.36,0.068,0.081,0.082,0.083,0.087,0.088,0.044,0.19,0.2,0.42,0.16,0.21,0.37,0.34
ACD,0.36,1.0,0.33,0.44,0.44,0.26,0.33,0.35,0.24,0.32,0.36,-0.13,0.15,0.26,0.3,0.0067,0.025,0.026,-0.0069,-0.0023,-0.0022,-0.0045,0.14,0.1,0.37,0.12,0.13,0.26,0.12
FOUT_avg,0.28,0.33,1.0,0.75,0.51,0.84,0.58,0.4,0.69,0.62,0.3,-0.37,0.13,0.15,0.16,0.0098,0.016,0.017,0.0025,0.0051,0.0052,0.22,0.26,0.083,0.38,0.69,0.45,0.34,0.26
FOUT_max,0.41,0.44,0.75,1.0,0.7,0.68,0.83,0.62,0.5,0.61,0.52,-0.29,0.33,0.38,0.39,0.034,0.047,0.047,0.021,0.025,0.025,0.13,0.31,0.17,0.62,0.58,0.66,0.57,0.31
FOUT_sum,0.47,0.44,0.51,0.7,1.0,0.42,0.56,0.87,0.34,0.52,0.85,-0.2,0.61,0.71,0.73,0.062,0.082,0.083,0.053,0.06,0.06,0.094,0.34,0.35,0.87,0.41,0.51,0.85,0.33
MLOC_avg,0.21,0.26,0.84,0.68,0.42,1.0,0.75,0.49,0.75,0.68,0.33,-0.37,0.15,0.16,0.16,0.024,0.03,0.03,0.0016,0.0041,0.0041,0.28,0.32,0.097,0.46,0.9,0.65,0.43,0.28
MLOC_max,0.29,0.33,0.58,0.83,0.56,0.75,1.0,0.7,0.47,0.61,0.51,-0.26,0.35,0.37,0.38,0.045,0.055,0.055,0.022,0.025,0.025,0.15,0.34,0.18,0.67,0.69,0.88,0.64,0.29
MLOC_sum,0.4,0.35,0.4,0.62,0.87,0.49,0.7,1.0,0.36,0.54,0.9,-0.19,0.65,0.73,0.74,0.07,0.088,0.089,0.049,0.056,0.056,0.12,0.38,0.35,0.98,0.48,0.69,0.96,0.34
NBD_avg,0.23,0.24,0.69,0.5,0.34,0.75,0.47,0.36,1.0,0.84,0.37,-0.67,0.16,0.18,0.19,-0.031,-0.025,-0.024,-0.015,-0.011,-0.011,0.29,0.35,0.074,0.35,0.77,0.43,0.34,0.27
NBD_max,0.35,0.32,0.62,0.61,0.52,0.68,0.61,0.54,0.84,1.0,0.57,-0.54,0.35,0.38,0.4,0.0043,0.015,0.015,0.027,0.033,0.033,0.21,0.42,0.19,0.55,0.69,0.55,0.53,0.35


Eliminating one member of highly correlated pairs can improve model quality. We will look into that later after building our initial models.

## Format data for sci-kit learn

In [11]:
# Format data for sklearn: create vector of response variable (label) and matrix of features
y_20 = files_20['has_defects'].values                  # y is response vector
X_20 = files_20.drop('has_defects', axis=1).values     # X is array of predictors(features)
y_21 = files_21['has_defects'].values                  # y is response vector
X_21 = files_21.drop('has_defects', axis=1).values     # X is array of predictors(features)
y_30 = files_30['has_defects'].values                  # y is response vector
X_30 = files_30.drop('has_defects', axis=1).values     # X is array of predictors(features)

In [12]:
# Check shape of y to verify it's a vector with the same number of samples as the data frame
y_20.shape

(6729,)

In [13]:
# Check shape of X to verify it's an array with 4 columns for features and same number of rows as y
X_20.shape

(6729, 28)

## Building a logistic regression model

In [14]:
# Fit model using Eclipse 2.0 data
model_20 = LogisticRegression(random_state=myseed)
model_20.fit(X_20, y_20)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=1, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
# Predict defect status on files of each Eclipse version
y_20_pred = model_20.predict(X_20)
y_21_pred = model_20.predict(X_21)
y_30_pred = model_20.predict(X_30)

In [16]:
# See count of false positives for 2.0 -> 2.1 prediction
cm_20 = confusion_matrix(y_21, y_21_pred)
tn, fp, fn, tp = cm_20.ravel()
display(cm_20)
display(fp)
display(fn)

array([[6870,  164],
       [ 696,  158]])

164

696

In [17]:
# Accuracy metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_accuracy = {
    'e20': round(accuracy_score(y_20, y_20_pred), 3),
    'e21': round(accuracy_score(y_21, y_21_pred), 3),
    'e30': round(accuracy_score(y_30, y_30_pred), 3)
}
model_20_accuracy

{'e20': 0.876, 'e21': 0.891, 'e30': 0.862}

In [18]:
# Precision metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_precision = {
    'e20': round(precision_score(y_20, y_20_pred), 3),
    'e21': round(precision_score(y_21, y_21_pred), 3),
    'e30': round(precision_score(y_30, y_30_pred), 3)
}
model_20_precision

{'e20': 0.693, 'e21': 0.491, 'e30': 0.624}

In [19]:
# Recall metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_recall = {
    'e20': round(recall_score(y_20, y_20_pred), 3),
    'e21': round(recall_score(y_21, y_21_pred), 3),
    'e30': round(recall_score(y_30, y_30_pred), 3)
}  
model_20_recall

{'e20': 0.256, 'e21': 0.185, 'e30': 0.166}

Compare the performance results to the Eclipse 2.0 results in Table 4 of the paper.

## Data Scaling

In this section, we will examine the effect of scaling the data on model performance.

In [20]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler

In [21]:
# Min Max Scaler scales original data set to range [0,1]
mms_20 = MinMaxScaler()
mms_20.fit(X_20)
X_20_mms = mms_20.transform(X_20)
X_21_mms = mms_20.transform(X_21)
X_30_mms = mms_20.transform(X_30)

In [22]:
display((X_20_mms.min(), X_20_mms.max()))
display((X_21_mms.min(), X_21_mms.max()))
display((X_30_mms.min(), X_30_mms.max()))

(0.0, 1.0)

(0.0, 1.7368421052631584)

(0.0, 2.333333333333333)

In [23]:
# Build logistic regression model on min-max scaled data
model_20 = LogisticRegression(random_state=myseed)
model_20.fit(X_20_mms, y_20)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=1, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [24]:
# Predict defect status on files of each Eclipse version with scaled data
y_20_pred = model_20.predict(X_20_mms)
y_21_pred = model_20.predict(X_21_mms)
y_30_pred = model_20.predict(X_30_mms)

In [25]:
# Accuracy metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_accuracy = {
    'e20': round(accuracy_score(y_20, y_20_pred), 3),
    'e21': round(accuracy_score(y_21, y_21_pred), 3),
    'e30': round(accuracy_score(y_30, y_30_pred), 3)
} 
# Precision metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_precision = {
    'e20': round(precision_score(y_20, y_20_pred), 3),
    'e21': round(precision_score(y_21, y_21_pred), 3),
    'e30': round(precision_score(y_30, y_30_pred), 3)
}
# Recall metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_recall = {
    'e20': round(recall_score(y_20, y_20_pred), 3),
    'e21': round(recall_score(y_21, y_21_pred), 3),
    'e30': round(recall_score(y_30, y_30_pred), 3)
}
print(model_20_accuracy)
print(model_20_precision)
print(model_20_recall)

{'e20': 0.873, 'e21': 0.894, 'e30': 0.863}
{'e20': 0.723, 'e21': 0.529, 'e30': 0.658}
{'e20': 0.198, 'e21': 0.169, 'e30': 0.154}


## Random Forest Classifier

Let's build a defect prediction model for Eclipse using the random forest algorithm and compare its performance with that of the logistic regression model. We will use the min-max scaled data.

In [26]:
from sklearn.ensemble import RandomForestClassifier 

In [27]:
# Build random forest model on min-max scaled data
rf_20 = RandomForestClassifier(random_state=myseed)
rf_20.fit(X_20_mms, y_20)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [28]:
# Predict defect status on files of each Eclipse version with scaled data
rf_20_pred = rf_20.predict(X_20_mms)
rf_21_pred = rf_20.predict(X_21_mms)
rf_30_pred = rf_20.predict(X_30_mms)

In [29]:
# Accuracy metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
rf_20_accuracy = {
    'e20': round(accuracy_score(y_20, rf_20_pred), 3),
    'e21': round(accuracy_score(y_21, rf_21_pred), 3),
    'e30': round(accuracy_score(y_30, rf_30_pred), 3)
}
# Precision metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
rf_20_precision = {
    'e20': round(precision_score(y_20, rf_20_pred), 3),
    'e21': round(precision_score(y_21, rf_21_pred), 3),
    'e30': round(precision_score(y_30, rf_30_pred), 3)
}
# Recall metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
rf_20_recall = {
    'e20': round(recall_score(y_20, rf_20_pred), 3),
    'e21': round(recall_score(y_21, rf_21_pred), 3),
    'e30': round(recall_score(y_30, rf_30_pred), 3)
}
print(rf_20_accuracy)
print(rf_20_precision)
print(rf_20_recall)

{'e20': 0.987, 'e21': 0.872, 'e30': 0.852}
{'e20': 1.0, 'e21': 0.356, 'e30': 0.499}
{'e20': 0.908, 'e21': 0.231, 'e30': 0.21}


## Cross Validation

When measuring model performance, we are interested primarily in performance on new data. Performance on training data is an overestimate of model pefromance on new data. Note the 100% recall of the random forest model above when tested on its training data (Eclipse 2.0) compared to the under 50% recall when measured on later versions of Eclipse.

It is important that test data not be used in any way to train the classifier. Use of the test data will leak information from the test data, resulting in better model performance that we would expect on new data. This is why we fit the min-max data scaling transformer above on the Eclipse 2.0 data and applied it to all versions of Eclipse.

When there is a large amount of data, we can separate the data into large training and test sets. However, when the amount of data is limited, we would like to use as much of the data as possible for training the model. We would also like to limit the impact of chance in dividing the data, so that a model isn't accidentally trained on very easy or difficult examples, causing our performance estimates to be substantially different from model performance on new data.

Cross-validation is a technique for addressing these concernts. In k-fold cross-validation, we split the data into k equal sized partitions (folds.) We train k models, each one using k-1 folds for training and one fold for testing, then compute model performance as the average performance of the k models. To further reduce the impact of randomness, we can repeated cross-validation mutliple times and average across all models. When classes are imbalanced (a relatively small percentage of files contain defects), we can apply stratified cross validation, which ensures that the membership of each class is equal in each fold, i.e. if 10% of files are defective, then each 10% of the files in each fold are defective.

In [30]:
from sklearn.model_selection import cross_validate

In [31]:
# Create random forest model object, then perform 10-fold cross validation (stratified is default setting)
rf_model = RandomForestClassifier(random_state=myseed)
mymetrics = ('accuracy', 'precision', 'recall')
scores = cross_validate(rf_model, X_20_mms, y_20, scoring=mymetrics, cv=10, return_train_score=False)



In [32]:
print( round(np.mean(scores['test_accuracy']), 3) )
print( round(np.mean(scores['test_precision']), 3) )
print( round(np.mean(scores['test_recall']), 3) )

0.853
0.485
0.212


Note how different these results are compared to using the training data as test data.

Next, let's apply 10-fold repeated cross validation on our model to see how stable our results are. This approach requires a bit of coding, as cross_validate cannot do this for us.

In [33]:
from sklearn.model_selection import RepeatedStratifiedKFold

In [34]:
n_folds = 10
n_reps = 10
rskf = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_reps, random_state=myseed)

In [35]:
accuracy_scores = np.zeros(n_reps*n_folds)
precision_scores = np.zeros(n_reps*n_folds)
recall_scores = np.zeros(n_reps*n_folds)

i = 0
for train_index, test_index in rskf.split(X_20, y_20):
    rf_model = RandomForestClassifier()
    rf_model.fit(X_20[train_index], y_20[train_index])
    y_pred = rf_model.predict(X_20[test_index])
    accuracy_scores[i] = accuracy_score(y_20[test_index], y_pred)
    precision_scores[i] = precision_score(y_20[test_index], y_pred)
    recall_scores[i] = recall_score(y_20[test_index], y_pred)
    i = i + 1







In [36]:
accuracy_scores

array([0.86498516, 0.8768546 , 0.88724036, 0.87388724, 0.88112927,
       0.87053571, 0.87202381, 0.86607143, 0.85714286, 0.8764881 ,
       0.87240356, 0.86498516, 0.8768546 , 0.8768546 , 0.88410104,
       0.8764881 , 0.86309524, 0.8735119 , 0.86607143, 0.875     ,
       0.88575668, 0.87240356, 0.86053412, 0.87240356, 0.8692422 ,
       0.85863095, 0.87053571, 0.87202381, 0.86755952, 0.87202381,
       0.86350148, 0.87240356, 0.87240356, 0.85905045, 0.87518574,
       0.86755952, 0.87946429, 0.89583333, 0.86607143, 0.88095238,
       0.87982196, 0.87833828, 0.87388724, 0.85905045, 0.87518574,
       0.875     , 0.86309524, 0.89136905, 0.87053571, 0.87202381,
       0.86053412, 0.87091988, 0.85311573, 0.86646884, 0.87369985,
       0.8735119 , 0.84970238, 0.8735119 , 0.8735119 , 0.88392857,
       0.86498516, 0.8694362 , 0.87091988, 0.8620178 , 0.87221397,
       0.8735119 , 0.85714286, 0.86755952, 0.8735119 , 0.86011905,
       0.87537092, 0.87537092, 0.86350148, 0.8694362 , 0.87667

In [37]:
print( round(np.mean(accuracy_scores), 3), "+/-", round(np.std(accuracy_scores), 3) )
print( round(np.mean(precision_scores), 3), "+/-", round(np.std(precision_scores), 3) )
print( round(np.mean(recall_scores), 3), "+/-", round(np.std(recall_scores), 3) )

0.872 +/- 0.008
0.622 +/- 0.062
0.3 +/- 0.047


In [38]:
# Compare non-repeated cross-validation w/ cross_validate() and RepeatedKFold()
n_reps = 1
n_folds = 10

accuracy_scores = np.zeros(n_reps*n_folds)
precision_scores = np.zeros(n_reps*n_folds)
recall_scores = np.zeros(n_reps*n_folds)

rskf = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_reps, random_state=myseed)

i = 0
for train_index, test_index in rskf.split(X_20, y_20):
    rf_model = RandomForestClassifier(random_state=myseed)
    rf_model.fit(X_20[train_index], y_20[train_index])
    y_pred = rf_model.predict(X_20[test_index])
    accuracy_scores[i] = accuracy_score(y_20[test_index], y_pred)
    precision_scores[i] = precision_score(y_20[test_index], y_pred)
    recall_scores[i] = recall_score(y_20[test_index], y_pred)
    i = i + 1



In [39]:
print( round(np.mean(accuracy_scores), 3), "+/-", round(np.std(accuracy_scores), 3) )
print( round(np.mean(precision_scores), 3), "+/-", round(np.std(precision_scores), 3) )
print( round(np.mean(recall_scores), 3), "+/-", round(np.std(recall_scores), 3) )

0.874 +/- 0.009
0.638 +/- 0.063
0.306 +/- 0.054


### Logistic Regression with Robust Scaled Data

In [40]:
# Robust Scaler scales data based on percentiles, eliminating outlier influence.

rs_20 = RobustScaler()
rs_20.fit(X_20)
X_20_rs = rs_20.transform(X_20)
X_21_rs = rs_20.transform(X_21)
X_30_rs = rs_20.transform(X_30)

In [41]:
display((X_20_rs.min(), X_20_rs.max()))
display((X_21_rs.min(), X_21_rs.max()))
display((X_30_rs.min(), X_30_rs.max()))

(-1.8749999999999998, 1049.0)

(-1.8749999999999998, 1126.0)

(-1.8749999999999998, 1254.0)

In [42]:
# Build logistic regression model on min-max scaled data
model_20 = LogisticRegression(random_state=myseed)
model_20.fit(X_20_rs, y_20)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=1, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [43]:
# Predict defect status on files of each Eclipse version with scaled data
y_20_pred = model_20.predict(X_20_rs)
y_21_pred = model_20.predict(X_21_rs)
y_30_pred = model_20.predict(X_30_rs)

In [44]:
# Accuracy metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_accuracy = {
    'e20': round(accuracy_score(y_20, y_20_pred), 3),
    'e21': round(accuracy_score(y_21, y_21_pred), 3),
    'e30': round(accuracy_score(y_30, y_30_pred), 3)
}
# Precision metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_precision = {
    'e20': round(precision_score(y_20, y_20_pred), 3),
    'e21': round(precision_score(y_21, y_21_pred), 3),
    'e30': round(precision_score(y_30, y_30_pred), 3)
}
# Recall metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
model_20_recall = {
    'e20': round(recall_score(y_20, y_20_pred), 3),
    'e21': round(recall_score(y_21, y_21_pred), 3),
    'e30': round(recall_score(y_30, y_30_pred), 3)
}
print(model_20_accuracy)
print(model_20_precision)
print(model_20_recall)

{'e20': 0.876, 'e21': 0.89, 'e30': 0.861}
{'e20': 0.695, 'e21': 0.477, 'e30': 0.615}
{'e20': 0.262, 'e21': 0.185, 'e30': 0.165}


### Gradient Boosted Tree Model

In [45]:
from sklearn.ensemble import GradientBoostingClassifier

In [46]:
# Build random forest model on min-max scaled data
gbt_20 = GradientBoostingClassifier(random_state=myseed)
gbt_20.fit(X_20_mms, y_20)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=1,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [47]:
# Predict defect status on files of each Eclipse version with scaled data
gbt_20_pred = gbt_20.predict(X_20_mms)
gbt_21_pred = gbt_20.predict(X_21_mms)
gbt_30_pred = gbt_20.predict(X_30_mms)

In [48]:
# Accuracy metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
gbt_20_accuracy = {
    'e20': round(accuracy_score(y_20, gbt_20_pred), 3),
    'e21': round(accuracy_score(y_21, gbt_21_pred), 3),
    'e30': round(accuracy_score(y_30, gbt_30_pred), 3)
}
# Precision metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
gbt_20_precision = {
    'e20': round(precision_score(y_20, gbt_20_pred), 3),
    'e21': round(precision_score(y_21, gbt_21_pred), 3),
    'e30': round(precision_score(y_30, gbt_30_pred), 3)
}
# Recall metrics for predicting defect files in each Eclipse version after training on Eclipse 2.0
gbt_20_recall = {
    'e20': round(recall_score(y_20, gbt_20_pred), 3),
    'e21': round(recall_score(y_21, gbt_21_pred), 3),
    'e30': round(recall_score(y_30, gbt_30_pred), 3)
}
print(gbt_20_accuracy)
print(gbt_20_precision)
print(gbt_20_recall)

{'e20': 0.903, 'e21': 0.887, 'e30': 0.862}
{'e20': 0.856, 'e21': 0.45, 'e30': 0.594}
{'e20': 0.396, 'e21': 0.217, 'e30': 0.212}


In [49]:
# Compare with Random Forest results
print(rf_20_accuracy)
print(rf_20_precision)
print(rf_20_recall)

{'e20': 0.987, 'e21': 0.872, 'e30': 0.852}
{'e20': 1.0, 'e21': 0.356, 'e30': 0.499}
{'e20': 0.908, 'e21': 0.231, 'e30': 0.21}


In [50]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_20, y_20_pred)

0.6210368706719072

In [51]:
rf_20_pred = rf_20.predict(X_20)
roc_auc_score(y_20, rf_20_pred)

0.49445068313681456

## On your own: turn in by e-mail before next class meeting

On your own, 

  1. Build a logistic regression model using robust scaled data and compare performance with min-max scaling. 
  2. Compare performance of logistic regression and random forest models using area under the curve metrics: average precision and ROC AUC.
  3. Build a [Gradient Boosted Tree](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) classification model.
  4. Measure model performance against all 3 versions of Eclipse and compare with Logistic Regression and Random Forest results. Use accuracy, precision, recall, average precision, and ROC AUC metrics. Is it better/worse? Does it take longer to build the boosted tree models?
  5. Measure model performance using 10-fold cross validation repeated 10 times on Eclipse 2.0 and compare with the Random Forest results above. Is it better/worse? Does it take longer to build the boosted tree models?

In [None]:
it random tree model is better for 2.0.  logistic regression is better for 2.1. The boosted tree model is better for 3.0

random forest is better than 10fold cross validation. do not know if it takes longer to build.