Goal of this notebook:

Explore Random Forest (XGBoost) hyperparameters to find the best common combination among all projects.

Outputs: 
- Validation curves for each hyperparameter
- Best combination of Random Forest (XGBoost) hyperparameters

In [2]:
import numpy as np
import pandas as pd
from IPython.core.display import display
from matplotlib import pyplot as plt

pd.set_option('display.max_columns', None)
from xgboost import XGBRFClassifier
import warnings
import classifier_utils
warnings.filterwarnings("ignore", category=UserWarning)

from sklearn.impute import SimpleImputer
import math

In [3]:
non_features_columns = ["chunk_id", "line_start", "line_end", "line_separator", "kind_conflict", "url", "project"]
non_features_columns.extend(["project_user", "project_name", "path", "file_name", "sha", "leftsha", "rightsha", "basesha"])

In [4]:
selected_dataset = pd.read_csv("../../data/SELECTED_LABELLED_DATASET.csv")
projects = list(selected_dataset['project'].unique())

In [5]:
df_training = pd.read_csv("../../data/dataset-training.csv")
df_na = df_training[df_training.isna().any(axis=1)]

len(df_na) / len(df_training)

0.28144947636066214

### Base classifier

In [10]:
rf_xg = XGBRFClassifier(random_state=99, subsample=0.9, eval_metric='mlogloss')

In [11]:
result_rf_xg = classifier_utils.ProjectsResults(rf_xg, projects, non_features_columns)

In [12]:
report_rf_xg = result_rf_xg.get_report_df(include_overall=True)
report_rf_xg

Unnamed: 0,project,observations,observations (wt NaN),precision,recall,f1-score,accuracy,baseline (majority),improvement
0,CCI-MIT__XCoLab,5512,3757,0.971,0.974,0.972,0.974,0.573,0.94
1,apache__directory-server,845,652,0.934,0.934,0.934,0.934,0.512,0.865
2,jgralab__jgralab,2072,1802,0.819,0.822,0.813,0.822,0.491,0.65
3,Unidata__thredds,1154,950,0.897,0.905,0.896,0.905,0.777,0.575
4,apache__accumulo,4113,3148,0.827,0.839,0.831,0.839,0.635,0.56
5,TeamDev-Ltd__OpenFaces,2979,2859,0.966,0.97,0.967,0.97,0.938,0.506
6,getrailo__railo,815,572,0.675,0.691,0.681,0.691,0.378,0.503
7,CloudStack-extras__CloudStack-archive,1424,1106,0.705,0.711,0.695,0.711,0.437,0.486
8,apache__lucene-solr,1256,974,0.606,0.608,0.607,0.608,0.266,0.466
9,Ramblurr__Anki-Android,892,759,0.661,0.7,0.677,0.7,0.439,0.465


### Validation curves

- n_estimators (default 100) -> [80, 100, 200, 300, 400]
- colsample_bynode (relative number of features used at each split point 0 to 1.0) -> 0.0 to 1.0 with .1 step

##### n_estimators

n_estimators : int, default=100

The number of trees in the forest.

### Overall

In [14]:
import importlib
importlib.reload(classifier_utils)
classifier_utils.get_validation_curve_all(projects, rf_xg, 'n_estimators',
                                        [10,50,100,200,400,600,800,1000],
                                        non_features_columns)

KeyboardInterrupt: 

### Per project

In [None]:
import importlib
importlib.reload(classifier_utils)
classifier_utils.plot_validation_curves(projects, rf_xg, 'n_estimators',
                                        [10,50,100,200,400,600,800,1000],
                                        non_features_columns)

#### colsample_bynode

colsample_bynode : float, default=0.8

relative number of features used at each split point 0 to 1.0

### Overall

In [None]:
import importlib
importlib.reload(classifier_utils)
classifier_utils.get_validation_curve_all(projects, rf_xg, 'colsample_bynode',
                                        np.range(0,1,0.1),
                                        non_features_columns)

### Per project

In [None]:
import importlib
importlib.reload(classifier_utils)
classifier_utils.plot_validation_curves(projects, rf_xg, 'colsample_bynode',
                                        np.range(0,1,0.1),
                                        non_features_columns)

### Tuning hyperparameters


Parameters range to explore, according to the validation curves:
- min_samples_leaf: 40, 60, 80
    - There is not much difference after increasing this parameter. The more it grows, the tendendy is that the difference between the two curves lessens.
- criterion: do no use
    - There is no visible difference between the two possible values (gini and entropy) for our data.
- max_depth: None, 2, 4, 12
    - The tendency observed in the curves is that the more this parameter grows, bigger the distance between the curves. This indicates that the model is overfitting for bigger values of max_depth.




In [14]:
print("Hyperparameters of Decision Tree:")
dt.get_params()

Hyperparameters of Decision Tree:


{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': 99,
 'splitter': 'best'}

In [15]:
parameters = {'min_samples_leaf':[40,60,80],
              'max_depth':[None, 2, 4, 12]
                 }

In [16]:
results = classifier_utils.grid_search_all(projects, dt, parameters, non_features_columns)
results.sort_values(['gold_medals', 'silver_medals', 'bronze_medals', 'total_medals'], ascending=False)

Unnamed: 0,min_samples_leaf,max_depth,mean_accuracy,total_medals,gold_medals,silver_medals,bronze_medals
0,40,,0.724026,22,21,0,1
3,40,12.0,0.724026,22,21,0,1
8,80,,0.757681,10,6,0,4
11,80,12.0,0.757681,10,6,0,4
4,60,,0.727515,11,5,0,6
7,60,12.0,0.727515,11,5,0,6
2,40,4.0,0.805778,10,5,0,5
10,80,4.0,0.804348,8,5,0,3
6,60,4.0,0.793315,6,5,0,1
1,40,2.0,0.949622,4,4,0,0


## Comparing the models with the best parameters with the base model

In [17]:
base_model = dt
model_1 = DecisionTreeClassifier(random_state=99, min_samples_leaf=40, max_depth=None)
model_2 = DecisionTreeClassifier(random_state=99, min_samples_leaf=80, max_depth=None)
model_3 = DecisionTreeClassifier(random_state=99, min_samples_leaf=60, max_depth=None)

In [18]:
models = [base_model, model_1, model_2, model_3]
models_names = ['base', 'model1', 'model2', 'model3']
import importlib
importlib.reload(classifier_utils)
comparison = classifier_utils.compare_models(models, models_names, projects, non_features_columns)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_result['model'] = None


In [19]:
comparison.filter(regex=("model|accuracy|precision|recall")).sort_values(['accuracy'], ascending=False)

Unnamed: 0,precision,recall,accuracy,model
0,0.76944,0.76904,0.76904,base
1,0.70628,0.73124,0.73124,model1
3,0.6844,0.72044,0.72044,model3
2,0.67284,0.71592,0.71592,model2
