Goal of this notebook:

Perform feature selection on our dataset.

Strategy:

Iterate over each project and execute the feature selection

In [1]:
import numpy as np
import pandas as pd
from IPython.core.display import display

pd.set_option('display.max_columns', None)
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
import warnings
import classifier_utils
warnings.filterwarnings("ignore")

In [2]:
non_features_columns = ["chunk_id", "line_start", "line_end", "line_separator", "kind_conflict", "url", "project"]
non_features_columns.extend(["project_user", "project_name", "path", "file_name", "sha", "leftsha", "rightsha", "basesha"])

In [3]:
selected_dataset = pd.read_csv("../../data/SELECTED_LABELLED_DATASET.csv")
projects = list(selected_dataset['project'].unique())

In [4]:
rf = RandomForestClassifier(random_state=99, n_jobs=5, n_estimators=100, max_features=0.3, min_samples_leaf=1)

# Tree-based feature selection

Uses the feature_importances_ attribute from the Random Forest model to select the most important features. It uses the mean of the importances of all features as a threshold.

In [5]:
import importlib
importlib.reload(classifier_utils)
results_tree, attributes_record_tree = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'tree')

In [6]:
results_tree

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759.0,129.0,33.0,0.739,0.735,-0.005
1,apache__directory-server,652.0,96.0,19.0,0.942,0.928,-0.015
2,android__platform_frameworks_base,2460.0,566.0,54.0,0.817,0.815,-0.002
3,freenet__fred,1012.0,134.0,47.0,0.673,0.677,0.012
4,alexo__wro4j,1368.0,107.0,30.0,0.575,0.591,0.038
5,apache__lucene-solr,974.0,124.0,42.0,0.641,0.643,0.005
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,572.0,90.0,30.0,0.699,0.698,-0.0
8,atlasapi__atlas,782.0,124.0,42.0,0.674,0.647,-0.04
9,hibernate__hibernate-orm,716.0,131.0,35.0,0.599,0.596,-0.006


# Recursive feature elimination:

First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
Currently we are discarding 1 feature per step, using a 5-fold-cross-validation to calculate the accuracy on each step.

In [7]:
import importlib
importlib.reload(classifier_utils)
results_recursive, attributes_record_rec = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'recursive')

In [8]:
results_recursive

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759.0,129.0,115.0,0.739,0.737,-0.003
1,apache__directory-server,652.0,96.0,60.0,0.942,0.94,-0.002
2,android__platform_frameworks_base,2460.0,566.0,308.0,0.817,0.818,0.009
3,freenet__fred,1012.0,134.0,31.0,0.673,0.675,0.006
4,alexo__wro4j,1368.0,107.0,72.0,0.575,0.583,0.017
5,apache__lucene-solr,974.0,124.0,108.0,0.641,0.658,0.048
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,572.0,90.0,72.0,0.699,0.714,0.052
8,atlasapi__atlas,782.0,124.0,109.0,0.674,0.663,-0.017
9,hibernate__hibernate-orm,716.0,131.0,86.0,0.599,0.598,-0.002


# IGAR
Selects attributes based on the ranking of their information gain.
Information gain measures the ability of a feature to separate the target classes. The greater the information gain, the better its importance for classification tasks.

Information Gain = Entropy(overall) - Entropy(attribute)

The algorithm has an input value 'n' that is used to select the 'n' attributes with the greatest information gain among all attributes. In this notebook we use n = 65, which was the found in the notebook IGAR_tuning.ipynb.

In [9]:
import importlib
importlib.reload(classifier_utils)
results_IGAR, attributes_record_IGAR = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'IGAR')

In [10]:
results_IGAR

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759.0,129.0,65.0,0.739,0.741,0.005
1,apache__directory-server,652.0,96.0,65.0,0.942,0.934,-0.008
2,android__platform_frameworks_base,2460.0,566.0,65.0,0.817,0.81,-0.008
3,freenet__fred,1012.0,134.0,65.0,0.673,0.684,0.033
4,alexo__wro4j,1368.0,107.0,65.0,0.575,0.574,-0.002
5,apache__lucene-solr,974.0,124.0,65.0,0.641,0.634,-0.01
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,572.0,90.0,65.0,0.699,0.711,0.04
8,atlasapi__atlas,782.0,124.0,65.0,0.674,0.658,-0.023
9,hibernate__hibernate-orm,716.0,131.0,65.0,0.599,0.588,-0.018


## Comparison

In [11]:
df_inner = pd.merge(results_tree, results_recursive, on='project', how='inner', suffixes=('_tree', '_rec'))
df_inner_igar = results_IGAR.add_suffix("_IGAR").rename(columns={"project_IGAR": "project"})
df_inner = pd.merge(df_inner, df_inner_igar, on='project', how='inner')
df_inner.to_csv('feature_selection_comparison.csv', index=False)

accuracy_inner = df_inner.filter(regex=("project|accuracy.*")).copy()
accuracy_inner['improvement_tree'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_tree'], x['accuracy_tree']), axis=1)
accuracy_inner['improvement_rec'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_rec'], x['accuracy_rec']), axis=1)
accuracy_inner['improvement_IGAR'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_IGAR'], x['accuracy_IGAR']), axis=1)
accuracy_inner = accuracy_inner.round(3)
accuracy_inner

Unnamed: 0,project,accuracy_tree,accuracy_fs_tree,accuracy_rec,accuracy_fs_rec,accuracy_IGAR,accuracy_fs_IGAR,improvement_tree,improvement_rec,improvement_IGAR
0,Ramblurr__Anki-Android,0.739,0.735,0.739,0.737,0.739,0.741,-0.005,-0.003,0.008
1,apache__directory-server,0.942,0.928,0.942,0.94,0.942,0.934,-0.015,-0.002,-0.008
2,android__platform_frameworks_base,0.817,0.815,0.817,0.818,0.817,0.81,-0.002,0.005,-0.009
3,freenet__fred,0.673,0.677,0.673,0.675,0.673,0.684,0.012,0.006,0.034
4,alexo__wro4j,0.575,0.591,0.575,0.583,0.575,0.574,0.038,0.019,-0.002
5,apache__lucene-solr,0.641,0.643,0.641,0.658,0.641,0.634,0.006,0.047,-0.011
6,elastic__elasticsearch,,,,,,,,,
7,getrailo__railo,0.699,0.698,0.699,0.714,0.699,0.711,-0.001,0.05,0.04
8,atlasapi__atlas,0.674,0.647,0.674,0.663,0.674,0.658,-0.04,-0.016,-0.024
9,hibernate__hibernate-orm,0.599,0.596,0.599,0.598,0.599,0.588,-0.005,-0.002,-0.018


In [12]:
pd.DataFrame(attributes_record_rec, columns=['project', 'attribute', 'information_gain', 'method']).to_csv('attributes_record_rec.csv', index=False)

In [13]:
attributes_record = []
attributes_record.extend(attributes_record_tree)
attributes_record.extend(attributes_record_rec)
attributes_record.extend(attributes_record_IGAR)
attributes_record_df = pd.DataFrame(attributes_record, columns=['project', 'attribute', 'information_gain', 'method'])
attributes_record_df.to_csv('attributes_record.csv', index=False)

## Ranking of features selected by tree method

Counts in how many projects the respective feature was selected using the tree method.

In [14]:
ranking_tree = classifier_utils.get_attribute_selection_ranking(attributes_record, 'tree')
ranking_tree.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
69,steve@hibernate.org,1.0,2.069798,12.0
10,keyword_remove,17.0,1.721044,7.058824
26,Merge isolation time,22.0,1.670107,2.727273
25,Branching time,20.0,1.667334,2.25
29,Changed files 1,20.0,1.648005,2.95
27,Commits 1,20.0,1.643616,4.15
8,keyword_update,16.0,1.63001,8.1875
11,keyword_use,18.0,1.62759,6.0
1,left_lines_removed,20.0,1.613181,5.3
6,keyword_bug,18.0,1.610705,7.277778


## Ranking of features selected by recursive method

Counts in how many projects the respective feature was selected using the recursive method.

In [15]:
ranking_recursive = classifier_utils.get_attribute_selection_ranking(attributes_record, 'recursive')
ranking_recursive.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
365,alex.objelean@gmail.com,1.0,2.091704,6.0
473,steve@hibernate.org,1.0,2.069798,14.0
29,Merge isolation time,22.0,1.670107,3.0
28,Branching time,22.0,1.669547,3.272727
2,right_lines_added,22.0,1.640534,2.727273
32,Different devs,22.0,1.611124,7.181818
13,keyword_remove,22.0,1.61111,9.136364
37,Changed files 1,23.0,1.610289,3.695652
35,Commits 1,23.0,1.607341,4.608696
12,keyword_add,23.0,1.596992,5.521739


## Ranking of features selected by IGAR method

Counts in how many projects the respective feature was selected using the IGAR method. 

The information gain column is an average among all projects.

In [16]:
import importlib
importlib.reload(classifier_utils)
ranking_IGAR = classifier_utils.get_attribute_selection_ranking(attributes_record, 'IGAR')
ranking_IGAR.sort_values(['average_information_gain', 'count_selected'], ascending=False).head(50)

Unnamed: 0,attribute,count_selected,average_information_gain,average_ranking
93,steve@hibernate.org,1.0,2.069798,14.0
92,michael@getrailo.org,1.0,1.588456,18.0
37,fileSize,25.0,1.509932,1.6
0,chunkRelSize,25.0,1.50977,1.52
3,Merge isolation time,25.0,1.503189,3.12
5,Branching time,25.0,1.502696,3.36
19,Changed files 1,25.0,1.498969,4.2
39,fileCC,25.0,1.496872,3.76
22,Commits 1,25.0,1.496374,4.96
16,keyword_add,25.0,1.485307,5.72
