# TODOs

In [1]:
from IPython.display import display, Markdown

display(Markdown('TODO.md'))

# Stuff to be done

- [x] Evaluate possible ML models
- [x] Add ML pipeline for GridSearch/RandomSearch with HyerOpt
- [x] Initialize code2vec/code2seq as input parameters 
- [x] Package new metric generator in a docker image for easier execution
- [ ] Update usage docs
- [x] Evaluate RFC, TCC, LCC methods [(from here)](https://github.com/mauricioaniche/ck)
- [x] Add shell scripts for running tools in docker container
- [x] Use mean embeddings as feature in classification
- [x] Evaluate current state
- [x] ~~Consider adding code2seq instead of code2vec for generating numerical representation of source code semantics~~ Not suitable for this use case
- [x] Adjust DataFrame to represent mean average vector for embeddings
- [x] Consider applying PCA to reduce the dimensionality of embedding vector
- [X] ~~Consider using the Mean Of a Embedding components as input~~
- [ ] Test hyperopt with kfold
- [ ] Determine final design pattern evaluation

# Design Pattern Recognition with Software Metrics

## Library/Package Imports
All required modules should be in the next cell to avoid scattered imports

In [2]:
# Ignore missing imports warnings in vs code
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from typing import Callable
import ipywidgets as widgets
from IPython.display import display, HTML
from typing import Optional, Dict, List
import numpy as np
from enum import Enum, auto
from constants import ClassMetricVectorConstants, get_label_column, get_metric_columns

In [3]:
# Common utility functions
def generate_subplot(df: pd.DataFrame, plot_func: Callable[[pd.DataFrame, str], go.Figure], subplot_width: int = 600, subplot_height: int = 2400) -> go.Figure:
    metric_columns = get_metric_columns()
    subplots = make_subplots(
        len(metric_columns), subplot_titles=metric_columns)
    for i, metric in enumerate(metric_columns):
        figure = plot_func(df, metric)
        subplots.add_trace(figure, row=i+1, col=1)
    subplots['layout'].update(height=subplot_height, width=subplot_width)
    return subplots


def generate_selectable_graph_for_metrics(df: pd.DataFrame, initial_plot_func: Callable[[], go.Figure], update_func: Callable[[go.Figure, pd.DataFrame, str], None], y_label: Optional[str] = None):
    metric_dropdown = widgets.Dropdown(options=get_metric_columns())
    fig = go.FigureWidget(initial_plot_func())

    def on_metric_changed(change):
        metric = change['new']
        with fig.batch_update():
            figure = fig.data[0]
            update_func(figure, df, metric)
            figure['name'] = metric
            label = y_label if y_label else ' '
            fig.update_layout(title=metric, bargap=0.5,
                              xaxis_title=metric, yaxis_title=label)

    metric_dropdown.observe(on_metric_changed, names='value')
    display(widgets.VBox([metric_dropdown, fig]))

## Generation of metrics

If the metrics are not yet generated, the following steps are required:

1. Make sure that `source_files.zip` is located in the current directory. The archive contains the actual zipped source code of the projects in [P-MArT](https://www.ptidej.net/tools/designpatterns/) and `pmart.xml` with descriptions of the micro architectures
2. Create a new virtual Python environment with `python -m venv .` in the current directory if not yet done
3. Activate the virtual environment ([refer here for the actual command to run](https://docs.python.org/3/library/venv.html#how-venvs-work))
4. Execute `python3 preprocess_source_files.py` to extract the source files from `source_files.zip` and move the source files described in `pmart.xml` into `dataset` directory. For more information run `python3 preprocess_source_files.py -h`.
    - Source files are structured as `<dataset_dir>/<design_pattern/micro_architecture_<id>`
    - Each micro architecture directory contains the following files:
        - `roles.csv`: Roles, entity names and role kind as described in `pmart.xml`
        - `projects.txt`: From which project the source files come from
        - The source files to be evaluated
5. 
    - **OLD**: Execute `python3 generate_source_file_metrics.py` to generate `metrics.csv`. For more information run `python3 generate_source_file_metrics.py`.
    - **NEW**: Execute `docker build --file docker/sourcefileparser.dockerfile . -t sourcefilerparser:latest` in the `project` directory to build the tool and run `docker run -v ./:/home/app/volume  -e DATASET_PATH=./dataset -e OUTPUT_CSV=./m.csv sourcefilerparser:latest` for metric generation

**NOTES**: 
- As the projects in this dataset are old and not all projects listed in P-MaRT are not accessible, some source files and their entries in `metrics` may be missing.
- The tool for generating the metrics was originally written with a Java Parser implemented Python only. This lead to parsing issues in some source files. As a result, the tool was rewritten as a Java project with a native parser. The original Python script is included for completeness.

## Overview about `metrics.csv`

In order to detect applied Gang Of Four design patterns in source code with machine learning strategies, we first need to transform the source file into a numerical representation that can be understood by a machine learning model.
This approach aims to solve this by generating numerical characteristics for each source file in the context of the regarded micro architecture. As there are several methods to define what metrics to include in the evaluation, the metrics as described [in this paper](../sources/JSEA-DP-2014.pdf):

- NOF: Number of fields
- NSF: Number of static fields
- NOM: Number of methods
- NSM: Number of static methods
- NOAM: Number of abstract methods
- NORM: Number of overridden methods
- NOPC: Number of private constrcutors
- NOOF: Number of object fields
- NCOF: Number of other classes with field of own type


In addition to these metrics, the following Chidamber & Kemerer object-oriented metrics were added to quantify the relation, coupling and cohesion between participants in a design pattern:

- FAN_IN: Number of input dependencies
- FAN_OUT: Number of output dependencies
- CBO: Coupling between objects
- NOC: Number of inheriting children
- RFC: Response for a class (number of unique method invocations in a class)
- TCC: Tight class cohesion (via direct connections between visible methods, two methods or their invocation trees access the same class variable)
- LCC: Low class cohesion

## Outlier Detection and Removal

As the dataset may contain a varied implementation of datasets, outlier detection and removal may be required to reduce the noise in the dataset. `sklearn` provides the some automatic and unsupervised approaches out of the box. The following are considered

**NOTE**: This list is subject to change

* Isolation Forest
* Local Outlier Factor

In [8]:
# Required imports for this section
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

### Isolation Forest

In [9]:
def apply_isolation_forest(df: pd.DataFrame):
    df_filtered = df.copy()
    isolation_forest = IsolationForest(contamination=0.1)
    df_filtered['outlier'] = isolation_forest.fit_predict(
        df_filtered[get_metric_columns()])
    df_filtered = df_filtered[df_filtered['outlier'] == 1]
    return df_filtered.drop(columns=['outlier'])

### Local Outlier Factor

In [10]:
def apply_local_outlier_factor(df: pd.DataFrame) -> pd.DataFrame:
    threshold = 0

    df_copy = df.copy()
    lof = LocalOutlierFactor(contamination=0.5)
    df_copy['outlier_score'] = lof.fit_predict(df_copy[get_metric_columns()])
    df_copy = df_copy[df_copy['outlier_score'] > threshold]
    return df_copy.drop(columns=('outlier_score'))

## Explorative Data Analysis of the Dataset

In [34]:
df = pd.read_csv('./metrics.csv')
df = df.dropna()
#df = apply_isolation_forest(df)
print(f'{df.shape[0]} rows were imported')

1060 rows were imported


In [16]:
df[ClassMetricVectorConstants.ROLE] = df[ClassMetricVectorConstants.ROLE].str.lower().str.strip()
df[ClassMetricVectorConstants.ROLE_KIND] = df[ClassMetricVectorConstants.ROLE_KIND].str.lower().str.strip()

In [None]:
# Check if columns in dataframe have expected types
df.dtypes

### Filter Dataframe entries by micro architecture

In [9]:
micro_arches = df[ClassMetricVectorConstants.MICRO_ARCHITECTURE].unique().tolist()

def view(micro_arch=''):
    cols = [ClassMetricVectorConstants.ROLE_KIND, ClassMetricVectorConstants.ENTITY] + get_metric_columns()
    display(df[df[ClassMetricVectorConstants.MICRO_ARCHITECTURE] == micro_arch]
            [cols], clear=True)


w = widgets.Dropdown(options=micro_arches)
widgets.interactive(view, micro_arch=w)

interactive(children=(Dropdown(description='micro_arch', options=('micro_arch_35', 'micro_arch_37', 'micro_arc…

### Corelation Between Columns
For each column we caclulate pairwaise the coefficient of corelation with other columns. The value of the coefficient can be interpreteted as:

- between -1.0 and 0: Negative correlation; a increase in one column expects a decrease in the other; the lower the bigger the impact
- equals 0: No correlation
- between 0 and 1: Postive correlation; a increase in one column causes an increase the other; the higher the bigger the impact

In [35]:
df_corr = df[get_metric_columns()].copy()
corr = df_corr.corr()
fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x=corr.columns,
        y=corr.index,
        z=np.array(corr),
        text=corr.values,
        texttemplate='%{text:.2f}'
    )
)
fig.show()


### Distribution of roles

In [95]:
temp = df.groupby([ClassMetricVectorConstants.ROLE]).size()
temp = temp.sort_values(ascending=False).reset_index()
px.bar(temp, x=ClassMetricVectorConstants.ROLE, y=0).update_layout(yaxis_title='count')

### Distribution of design patterns

In [93]:
df_binned_by_role = df.copy()
df_binned_by_role = df_binned_by_role.drop_duplicates(
    [ClassMetricVectorConstants.MICRO_ARCHITECTURE, ClassMetricVectorConstants.DESIGN_PATTERN]).reset_index()
df_binned_by_role = df_binned_by_role[ClassMetricVectorConstants.DESIGN_PATTERN].value_counts(
).reset_index()

fig = px.histogram(df_binned_by_role, x=ClassMetricVectorConstants.DESIGN_PATTERN, y='count')
fig.update_layout(xaxis_title='Design Pattern',
                  yaxis_title='Count of Design Pattern')

### Distribution for metrics

In [None]:
def initial_histogram():
    return go.Histogram(
        histfunc='count',
    )


def update_histogram(figure: go.Figure, df: pd.DataFrame, metric: str):
    figure['x'] = df[metric]


generate_selectable_graph_for_metrics(
    df, initial_histogram, update_histogram, 'count')

### Box Plots for metrics

In [None]:
def initial_histogram():
    return go.Box(
    )


def update_histogram(figure: go.Figure, df: pd.DataFrame, metric: str):
    figure['x'] = df[metric]


generate_selectable_graph_for_metrics(df, initial_histogram, update_histogram)

## Model Training

As design patterns can be considered as small scale appliances of software architecture, they consist of different entities with different relationships and roles to fulfill in the regarded design pattern. In order to detect design patterns, we first need to detect what kind of role a given Java class or entity it most likely corresponds to. To achieve this, machine learning model capable of classifying multiple labels should be considered. The extracted software metrics are the numerical inputs and the most likely roles in a design pattern are the result. 
As this falls in the area of supervised machine learning, initially the following models/techniques are to be considered:

**NOTE:** This list is subject to change 

* Support Vector Machines
* Tree Classifiers
* Ensemble Classifiers (e.g Random Forest Classifier)
* Custom Convoluted Network

In order to optimize the given results of a given model, first RandomGridSearch is applied to determine a range of values or selection for the hyperparameters while GridSearch is used to determine the most optimal available value or selection for the regarded hyperparameter.

In [63]:
# Required import for machine learning
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
import hpsklearn
import hyperopt
from dataclasses import dataclass, field
import joblib
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
from sklearn.base import clone
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [64]:
class Dataset:
    train: pd.DataFrame
    test: pd.DataFrame
    label_col: List[str]
    feature_cols: List[str]
    #roleKindEncoder: LabelEncoder
    roleEncoder: LabelEncoder
    dataset: pd.DataFrame
    
    def __init__(self, df: pd.DataFrame):
        df_copy = df.copy().dropna()
        self.label_col = get_label_column()
        self.feature_cols = get_metric_columns()
        self.roleEncoder = LabelEncoder()
        df_copy[ClassMetricVectorConstants.ROLE] = self.roleEncoder.fit_transform(df_copy[ClassMetricVectorConstants.ROLE])
        df_copy = df_copy[self.label_col + self.feature_cols]
        self.train, self.test = train_test_split(df_copy, test_size=0.15, stratify=df_copy[get_label_column()])

    @classmethod
    def top_k_design_patterns(cls, df: pd.DataFrame, k: int) -> "Dataset":
        top = get_top_k_labels(df, k)
        df_dataset = df[df[ClassMetricVectorConstants.DESIGN_PATTERN].isin(top)]
        return cls(df_dataset)
        

    def get_X_train(self):
        return self.train[self.feature_cols]
    
    def get_Y_train(self):
        return self.train[self.label_col].values.ravel()

    def get_X_test(self):
        return self.test[self.feature_cols]

    def get_Y_test(self):
        return self.test[self.label_col].values.ravel()


def get_top_k_labels(df: pd.DataFrame, k: int):
    df_binned_by_role = df.copy()
    df_binned_by_role = df_binned_by_role.drop_duplicates(
        [ClassMetricVectorConstants.MICRO_ARCHITECTURE, ClassMetricVectorConstants.DESIGN_PATTERN])
    df_binned_by_role = df_binned_by_role[ClassMetricVectorConstants.DESIGN_PATTERN].value_counts(
    ).sort_values(ascending=False).head(k)
    return df_binned_by_role.index.to_list()

In [103]:
dataset = Dataset.top_k_design_patterns(df, 4)
top_four = get_top_k_labels(df, 4)

d = df[df[ClassMetricVectorConstants.DESIGN_PATTERN].isin(top_four)].reset_index()
d = d.groupby([ClassMetricVectorConstants.DESIGN_PATTERN, ClassMetricVectorConstants.ROLE]).size()


d.sum()

295

### Support Vector Machines

In [98]:
def apply_svm(dataset: Dataset):
    X_train = dataset.get_X_train()
    y_train = dataset.get_Y_train()
    
    X_test = dataset.get_X_test()
    y_test = dataset.get_Y_test()

    standard_scaler = StandardScaler()
    X_train = standard_scaler.fit_transform(X_train)
    X_test = standard_scaler.fit_transform(X_test)

    svm_classifier = SVC(kernel='rbf', gamma=0.1, C=1.75)
    svm_classifier.fit(X_train, y_train)

    pred = svm_classifier.predict(X_test)
    return svm_classifier.score(X_test, y_test)

apply_svm(dataset)

0.4482758620689655

### Random Forest Classifier

In [99]:
def apply_random_forest(dataset: Dataset):
    X_train = dataset.get_X_train()
    y_train = dataset.get_Y_train()

    X_test = dataset.get_X_test()
    y_test = dataset.get_Y_test()

    random_forest_classifier = RandomForestClassifier(
        max_depth=30, random_state=1)
    random_forest_classifier.fit(X_train, y_train)

    pred = random_forest_classifier.predict(X_test)
    return random_forest_classifier.score(X_test, y_test)


apply_random_forest(dataset)

0.5172413793103449

### Get Best Possible Classifier with hyperopt-sklearn

In [82]:
def apply_hyperopt(dataset: Dataset, evals: int = 10):
    X_train = dataset.get_X_train()
    y_train = dataset.get_Y_train()

    X_test = dataset.get_X_test()
    y_test = dataset.get_Y_test()

    
    chosen_classifiers = [
        #TODO Add SVM + KNN to the mix
        hpsklearn.random_forest_classifier('random_forest'),
        hpsklearn.extra_trees_classifier('extra_trees'),
        hpsklearn.hist_gradient_boosting_classifier('gradient_boosting')
    ]

    p = 1 / len(chosen_classifiers)
    classifiers = hyperopt.hp.pchoice('cls', [(p, c) for c in chosen_classifiers])

    hyper_estimator = hpsklearn.HyperoptEstimator(
        classifier=classifiers,
        preprocessing=[],
        max_evals=evals,
        algo=hyperopt.tpe.suggest,
        trial_timeout=180,
        
    )

    hyper_estimator.fit(X_train, y_train)
    best_model = hyper_estimator.best_model()['learner']
    y_pred = best_model.predict(X_test)
    print(classification_report(y_test, y_pred))
    
    return hyper_estimator.score(X_test, y_test), best_model

apply_hyperopt(dataset, evals=30)
    

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.66trial/s, best loss: 0.8125]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.44trial/s, best loss: 0.71875]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.21trial/s, best loss: 0.71875]
 75%|████████████████████████████████████████████


X has feature names, but RandomForestClassifier was fitted without feature names


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.



(0.27586206896551724,
 RandomForestClassifier(class_weight='balanced_subsample',
                        max_features=0.8760689265763116,
                        min_impurity_decrease=0.02, min_samples_leaf=25,
                        n_estimators=217, n_jobs=1, random_state=0,
                        verbose=False))

In [100]:
def get_best_iteration(df: pd.DataFrame, top_k: int, max_evals: int, k_split: int):
    dataset = Dataset.top_k_design_patterns(df, top_k)
    score, estimator = apply_hyperopt(dataset, evals=max_evals)
    unfitted_estimator = clone(estimator)
    cross_score = cross_val_score(unfitted_estimator, dataset.get_X_train(), dataset.get_Y_train(), cv=k_split, )
    return f'HyperOpt-Score: {score} Mean Cross Validation Score: {cross_score}'
    
    

get_best_iteration(df, 3, 50, 3)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.49trial/s, best loss: 0.5357142857142857]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.36trial/s, best loss: 0.4285714285714286]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.01s/trial, best loss: 0.4285714285714286]
 75%|████████████████████████████████████████████


X has feature names, but ExtraTreesClassifier was fitted without feature names


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



'HyperOpt-Score: 0.68 Mean Cross Validation Score: [0.46808511 0.46808511 0.56521739]'