Goal of this notebook:

Perform feature selection on our dataset.

Strategy:

Iterate over each project and execute the feature selection

In [1]:
import numpy as np
import pandas as pd
from IPython.core.display import display

pd.set_option('display.max_columns', None)
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
import warnings
import classifier_utils
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
non_features_columns = ["chunk_id", "line_start", "line_end", "line_separator", "kind_conflict", "url", "project"]
non_features_columns.extend(["project_user", "project_name", "path", "file_name", "sha", "leftsha", "rightsha", "basesha"])

In [3]:
selected_dataset = pd.read_csv("../../data/SELECTED_LABELLED_DATASET.csv")
projects = list(selected_dataset['project'].unique())

In [4]:
rf = RandomForestClassifier(random_state=99, n_jobs=5, n_estimators=100, max_features=0.3, min_samples_leaf=1)

# Tree-based feature selection

Uses the feature_importances_ attribute from the Random Forest model to select the most important features. It uses the mean of the importances of all features as a threshold.

In [5]:
import importlib
importlib.reload(classifier_utils)
results_tree, count_tree = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'tree')

In [6]:
results_tree

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759.0,129.0,33.0,0.74,0.734,-0.009
1,apache__directory-server,652.0,96.0,19.0,0.939,0.925,-0.015
2,android__platform_frameworks_base,2460.0,566.0,54.0,0.818,0.815,-0.004
3,freenet__fred,1012.0,134.0,47.0,0.679,0.68,0.003
4,alexo__wro4j,1368.0,107.0,30.0,0.586,0.577,-0.015
5,apache__lucene-solr,974.0,124.0,42.0,0.65,0.639,-0.017
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,572.0,90.0,30.0,0.71,0.71,-0.0
8,atlasapi__atlas,782.0,124.0,42.0,0.662,0.652,-0.015
9,hibernate__hibernate-orm,716.0,131.0,35.0,0.598,0.595,-0.005


# Recursive feature elimination:

First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
Currently we are discarding 1 feature per step, using a 5-fold-cross-validation to calculate the accuracy on each step.

In [7]:
import importlib
importlib.reload(classifier_utils)
results_recursive, count_recursive = classifier_utils.projects_feature_selection(projects, non_features_columns, rf, 'recursive')

In [8]:
results_recursive

Unnamed: 0,project,N,# attr.,# attr. fs,accuracy,accuracy_fs,improvement
0,Ramblurr__Anki-Android,759.0,129.0,120.0,0.74,0.748,0.03
1,apache__directory-server,652.0,96.0,41.0,0.939,0.937,-0.002
2,android__platform_frameworks_base,2460.0,566.0,138.0,0.818,0.816,-0.002
3,freenet__fred,1012.0,134.0,37.0,0.679,0.685,0.018
4,alexo__wro4j,1368.0,107.0,101.0,0.586,0.593,0.018
5,apache__lucene-solr,974.0,124.0,107.0,0.65,0.651,0.003
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,572.0,90.0,67.0,0.71,0.701,-0.012
8,atlasapi__atlas,782.0,124.0,114.0,0.662,0.652,-0.015
9,hibernate__hibernate-orm,716.0,131.0,127.0,0.598,0.592,-0.009


## Comparison

In [9]:
df_inner = pd.merge(results_tree, results_recursive, on='project', how='inner', suffixes=('_tree', '_rec'))

accuracy_inner = df_inner.filter(regex=("project|accuracy.*")).copy()
accuracy_inner['improvement_tree'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_tree'], x['accuracy_tree']), axis=1)
accuracy_inner['improvement_rec'] = accuracy_inner.apply(lambda x: classifier_utils.get_normalized_improvement(x['accuracy_fs_rec'], x['accuracy_rec']), axis=1)
accuracy_inner = accuracy_inner.round(3)
accuracy_inner

Unnamed: 0,project,accuracy_tree,accuracy_fs_tree,accuracy_rec,accuracy_fs_rec,improvement_tree,improvement_rec
0,Ramblurr__Anki-Android,0.74,0.734,0.74,0.748,-0.008,0.031
1,apache__directory-server,0.939,0.925,0.939,0.937,-0.015,-0.002
2,android__platform_frameworks_base,0.818,0.815,0.818,0.816,-0.004,-0.002
3,freenet__fred,0.679,0.68,0.679,0.685,0.003,0.019
4,alexo__wro4j,0.586,0.577,0.586,0.593,-0.015,0.017
5,apache__lucene-solr,0.65,0.639,0.65,0.651,-0.017,0.003
6,elastic__elasticsearch,,,,,,
7,getrailo__railo,0.71,0.71,0.71,0.701,0.0,-0.013
8,atlasapi__atlas,0.662,0.652,0.662,0.652,-0.015,-0.015
9,hibernate__hibernate-orm,0.598,0.595,0.598,0.592,-0.005,-0.01


## Ranking of features selected by tree method

Counts in how many projects the respective feature was selected using the tree method.

In [10]:
count_tree.sort_values(['Count'], ascending=False).head(50)

Unnamed: 0,Count
fileCC,25
chunk_right_rel_size,24
chunk_right_abs_size,24
chunk_left_rel_size,24
chunk_left_abs_size,24
fileSize,24
chunkRelSize,24
chunkAbsSize,24
chunkPosition,23
Merge isolation time,22


## Ranking of features selected by recursive method

Counts in how many projects the respective feature was selected using the recursive method.

In [11]:
count_recursive.sort_values(['Count'], ascending=False).head(50)

Unnamed: 0,Count
chunkRelSize,25
chunk_left_rel_size,24
fileSize,24
fileCC,24
keyword_use,23
self_conflict_perc,23
Changed files 1,23
chunk_right_rel_size,23
chunk_right_abs_size,23
chunk_left_abs_size,23
