Goal of this notebook:

Perform feature selection on our dataset.

Strategy:

Iterate over each project and execute the feature selection

In [1]:
import numpy as np
import pandas as pd
from IPython.core.display import display

pd.set_option('display.max_columns', None)
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
import warnings
import classifier_utils
warnings.filterwarnings("ignore")

In [2]:
non_features_columns = ["chunk_id", "line_start", "line_end", "line_separator", "kind_conflict", "url", "project"]
non_features_columns.extend(["project_user", "project_name", "path", "file_name", "sha", "leftsha", "rightsha", "basesha"])

In [3]:
selected_dataset = pd.read_csv("../../data/SELECTED_LABELLED_DATASET.csv")
projects = list(selected_dataset['project'].unique())

In [4]:
rf = RandomForestClassifier(random_state=99, n_jobs=5, n_estimators=400, max_features=0.3, min_samples_leaf=1)

# Tree-based feature selection

Uses the feature_importances_ attribute from the Random Forest model to select the most important features. It uses the mean of the importances of all features as a threshold.

In [5]:
import importlib
importlib.reload(classifier_utils)
results_tree, attributes_record_tree = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'tree')

In [6]:
results_tree

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759,129.0,33.0,0.736,0.739,0.01
1,apache__directory-server,652,96.0,18.0,0.942,0.929,-0.013
2,android__platform_frameworks_base,2460,566.0,54.0,0.817,0.817,0.0
3,freenet__fred,1012,134.0,47.0,0.678,0.68,0.006
4,alexo__wro4j,1368,107.0,30.0,0.59,0.581,-0.015
5,apache__lucene-solr,974,124.0,42.0,0.65,0.636,-0.022
6,getrailo__railo,572,90.0,32.0,0.703,0.71,0.023
7,atlasapi__atlas,782,124.0,41.0,0.666,0.656,-0.015
8,hibernate__hibernate-orm,716,131.0,35.0,0.598,0.599,0.004
9,CloudStack-extras__CloudStack-archive,1106,135.0,46.0,0.804,0.808,0.023


# Recursive feature elimination:

First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
Currently we are discarding 1 feature per step, using a 5-fold-cross-validation to calculate the accuracy on each step.

In [7]:
import importlib
importlib.reload(classifier_utils)
results_recursive, attributes_record_rec = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'recursive')

In [8]:
results_recursive

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759,129.0,53.0,0.736,0.742,0.02
1,apache__directory-server,652,96.0,78.0,0.942,0.94,-0.002
2,android__platform_frameworks_base,2460,566.0,270.0,0.817,0.818,0.004
3,freenet__fred,1012,134.0,96.0,0.678,0.676,-0.003
4,alexo__wro4j,1368,107.0,59.0,0.59,0.588,-0.002
5,apache__lucene-solr,974,124.0,100.0,0.65,0.639,-0.017
6,getrailo__railo,572,90.0,15.0,0.703,0.706,0.012
7,atlasapi__atlas,782,124.0,90.0,0.666,0.671,0.015
8,hibernate__hibernate-orm,716,131.0,84.0,0.598,0.598,0.0
9,CloudStack-extras__CloudStack-archive,1106,135.0,88.0,0.804,0.806,0.009


# IGAR
Selects attributes based on the ranking of their information gain.
Information gain measures the ability of a feature to separate the target classes. The greater the information gain, the better its importance for classification tasks.

Information Gain = Entropy(overall) - Entropy(attribute)

The algorithm has an input value 'n' that is used to select the 'n' attributes with the greatest information gain among all attributes. In this notebook we use n = 84, which was the found in the notebook IGAR_tuning.ipynb.

In [9]:
import importlib
importlib.reload(classifier_utils)
results_IGAR, attributes_record_IGAR = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'IGAR')

In [10]:
results_IGAR

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759,129.0,84.0,0.736,0.743,0.025
1,apache__directory-server,652,96.0,84.0,0.942,0.933,-0.01
2,android__platform_frameworks_base,2460,566.0,84.0,0.817,0.817,-0.0
3,freenet__fred,1012,134.0,84.0,0.678,0.662,-0.023
4,alexo__wro4j,1368,107.0,84.0,0.59,0.583,-0.012
5,apache__lucene-solr,974,124.0,84.0,0.65,0.648,-0.003
6,getrailo__railo,572,90.0,84.0,0.703,0.717,0.047
7,atlasapi__atlas,782,124.0,84.0,0.666,0.675,0.027
8,hibernate__hibernate-orm,716,131.0,84.0,0.598,0.591,-0.012
9,CloudStack-extras__CloudStack-archive,1106,135.0,84.0,0.804,0.805,0.005


## Comparison

In [11]:
df_inner = pd.merge(results_tree, results_recursive, on='project', how='inner', suffixes=('_tree', '_rec'))
df_inner_igar = results_IGAR.add_suffix("_IGAR").rename(columns={"project_IGAR": "project"})
df_inner = pd.merge(df_inner, df_inner_igar, on='project', how='inner')
df_inner.to_csv('feature_selection_comparison.csv', index=False)

accuracy_inner = df_inner.filter(regex=("project|accuracy.*")).copy()
accuracy_inner['improvement_tree'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_tree'], x['accuracy_tree']), axis=1)
accuracy_inner['improvement_rec'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_rec'], x['accuracy_rec']), axis=1)
accuracy_inner['improvement_IGAR'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_IGAR'], x['accuracy_IGAR']), axis=1)
accuracy_inner = accuracy_inner.round(3)
accuracy_inner

Unnamed: 0,project,accuracy_tree,accuracy_fs_tree,accuracy_rec,accuracy_fs_rec,accuracy_IGAR,accuracy_fs_IGAR,improvement_tree,improvement_rec,improvement_IGAR
0,Ramblurr__Anki-Android,0.736,0.739,0.736,0.742,0.736,0.743,0.011,0.023,0.027
1,apache__directory-server,0.942,0.929,0.942,0.94,0.942,0.933,-0.014,-0.002,-0.01
2,android__platform_frameworks_base,0.817,0.817,0.817,0.818,0.817,0.817,0.0,0.005,0.0
3,freenet__fred,0.678,0.68,0.678,0.676,0.678,0.662,0.006,-0.003,-0.024
4,alexo__wro4j,0.59,0.581,0.59,0.588,0.59,0.583,-0.015,-0.003,-0.012
5,apache__lucene-solr,0.65,0.636,0.65,0.639,0.65,0.648,-0.022,-0.017,-0.003
6,getrailo__railo,0.703,0.71,0.703,0.706,0.703,0.717,0.024,0.01,0.047
7,atlasapi__atlas,0.666,0.656,0.666,0.671,0.666,0.675,-0.015,0.015,0.027
8,hibernate__hibernate-orm,0.598,0.599,0.598,0.598,0.598,0.591,0.002,0.0,-0.012
9,CloudStack-extras__CloudStack-archive,0.804,0.808,0.804,0.806,0.804,0.805,0.02,0.01,0.005


In [12]:
pd.DataFrame(attributes_record_rec, columns=['project', 'attribute', 'information_gain', 'method']).to_csv('attributes_record_rec.csv', index=False)

In [13]:
attributes_record = []
attributes_record.extend(attributes_record_tree)
attributes_record.extend(attributes_record_rec)
attributes_record.extend(attributes_record_IGAR)
attributes_record_df = pd.DataFrame(attributes_record, columns=['project', 'attribute', 'information_gain', 'method'])
attributes_record_df.to_csv('attributes_record.csv', index=False)

In [14]:
attributes_record = pd.read_csv('attributes_record.csv')

## Ranking of features selected by tree method

Counts in how many projects the respective feature was selected using the tree method.

In [15]:
ranking_tree = classifier_utils.get_attribute_selection_ranking(attributes_record, 'tree')
ranking_tree.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
17,chunkRelSize,19.0,1.396814,1.0
19,fileSize,19.0,1.090939,3.526316
25,Merge isolation time,14.0,1.046807,3.214286
24,Branching time,15.0,0.933865,3.8
2,right_lines_added,16.0,0.907216,5.8125
1,left_lines_removed,14.0,0.889584,6.928571
3,right_lines_removed,17.0,0.876816,4.823529
0,left_lines_added,17.0,0.811123,8.117647
28,Commits 1,14.0,0.782361,9.714286
42,Changed files 1,15.0,0.779832,9.866667


## Ranking of features selected by recursive method

Counts in how many projects the respective feature was selected using the recursive method.

In [16]:
ranking_recursive = classifier_utils.get_attribute_selection_ranking(attributes_record, 'recursive')
ranking_recursive.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
19,chunkRelSize,19.0,1.418194,1.894737
21,fileSize,18.0,1.172272,4.277778
27,Merge isolation time,17.0,0.929518,3.705882
26,Branching time,17.0,0.928754,3.823529
1,left_lines_removed,18.0,0.84138,6.555556
3,right_lines_removed,17.0,0.81954,8.294118
2,right_lines_added,17.0,0.811537,9.411765
0,left_lines_added,18.0,0.790354,8.222222
17,fileCC,18.0,0.752752,9.888889
33,Changed files 1,17.0,0.738335,9.941176


## Ranking of features selected by IGAR method

Counts in how many projects the respective feature was selected using the IGAR method. 

The information gain column is an average among all projects.

In [17]:
import importlib
importlib.reload(classifier_utils)
ranking_IGAR = classifier_utils.get_attribute_selection_ranking(attributes_record, 'IGAR')
ranking_IGAR.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
27,chunkRelSize,15.0,1.303072,2.133333
12,fileSize,15.0,1.004468,4.733333
3,Merge isolation time,19.0,0.780967,3.842105
1,Branching time,19.0,0.780469,3.894737
7,left_lines_removed,20.0,0.769164,6.2
2,right_lines_added,20.0,0.760997,8.45
9,right_lines_removed,19.0,0.7351,7.947368
0,left_lines_added,20.0,0.72294,7.85
14,Changed files 1,19.0,0.712724,9.0
10,fileCC,15.0,0.665897,10.066667
