# Design Pattern Recognition with Software Metrics

## Library/Package Imports
All required modules should be in the next cell to avoid scattered imports

In [3]:
# Ignore missing imports warnings in vs code
import pandas as pd

<module 'pandas' from '/opt/conda/lib/python3.11/site-packages/pandas/__init__.py'>


## Generation of metrics

If the metrics are not yet generated, the following steps are required:

1. Make sure that `source_files.zip` is located in the current directory. The archive contains the actual zipped source code of the projects in [P-MArT](https://www.ptidej.net/tools/designpatterns/) and `pmart.xml` with descriptions of the micro architectures
2. Create a new virtual Python environment with `python -m venv .` in the current directory if not yet done
3. Activate the virtual environment ([refer here for the actual command to run](https://docs.python.org/3/library/venv.html#how-venvs-work))
4. Execute `python3 preprocess_source_files.py` to extract the source files from `source_files.zip` and move the source files described in `pmart.xml` into `dataset` directory. For more information run `python3 preprocess_source_files.py -h`.
    - Source files are structured as `<dataset_dir>/<design_pattern/micro_architecture_<id>`
    - Each micro architecture directory contains the following files:
        - `roles.csv`: Roles, entity names and role kind as described in `pmart.xml`
        - `projects.txt`: From which project the source files come from
        - The source files to be evaluated
5. Execute `python3 generate_source_file_metrics.py` to generate `metrics.csv`. For more information run `python3 generate_source_file_metrics.py`.

**NOTE**: As the projects in this dataset are old and not all projects listed in P-MaRT are not accessible, some source files and their entries in `metrics` may be missing.

## Overview about `metrics.csv`

In order to detect applied Gang Of Four design patterns in source code with machine learning strategies, we first need to transform the source file into a numerical representation that can be understood by a machine learning model.
This approach aims to solve this by generating numerical characteristics for each source file in the context of the regarded micro architecture. As there are several methods to define what metrics to include in the evaluation, the metrics as described [in this paper](../sources/JSEA-DP-2014.pdf):

- NOF: Number of fields
- NSF: Number of static fields
- NOM: Number of methods
- NSM: Number of static methods
- NOAM: Number of abstract methods
- NORM: Number of overridden methods
- NOPC: Number of private constrcutors
- NOOF: Number of object fields
- NCOF: Number of other classes with field of own type

## Explorative Data Analysis of the Dataset

In [5]:
df = pd.read_csv('./metrics.csv')
df

Unnamed: 0,role,role_kind,entity,design_pattern,micro_architecture,NOF,NSF,NOM,NSM,NOI,NOAM,NORM,NOPC,NOTC,NOOF,NCOF
0,product,Class,edu.rice.cs.drjava.platform.DefaultPlatform,factorymethod,micro_arch_418,1,1,3,0,1,0,3,0,0,1,0
1,concreteProduct,AbstractClass,edu.rice.cs.drjava.platform.PlatformSupport,factorymethod,micro_arch_418,0,0,3,0,0,0,0,0,0,0,1
2,concreteCreator,Class,edu.rice.cs.drjava.platform.PlatformFactory,factorymethod,micro_arch_418,1,1,1,1,0,0,0,0,0,1,0
3,client,Class,edu.rice.cs.drjava.model.definitions.reducedmo...,iterator,micro_arch_401,1,0,22,0,0,0,0,0,0,1,0
4,aggregate,Class,edu.rice.cs.drjava.model.definitions.reducedmo...,iterator,micro_arch_401,4,0,9,0,0,0,0,0,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1681,concreteProduct,Class,net.sourceforge.pmd.ast.JavaParser,factory_method,micro_arch_132,25,4,503,4,2,0,0,0,2,7,0
1682,product,AbstractClass,javax.swing.text.PlainDocument,factory_method,micro_arch_2020,6,2,7,0,0,0,0,0,1,6,0
1683,concreteProduct,Class,edu.rice.cs.drjava.model.definitions.Definitio...,factory_method,micro_arch_2020,16,6,58,3,1,0,0,0,0,5,0
1684,creator,AbstractClass,javax.swing.text.DefaultEditorKit,factory_method,micro_arch_2020,56,56,10,0,0,0,0,0,0,56,0
