Skip to content

Commit

Permalink
adding start of work for association analysis
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Nov 18, 2021
1 parent 41e7179 commit 6f1a15d
Show file tree
Hide file tree
Showing 13 changed files with 4,108 additions and 1,133 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ For the last point, the easiest thing to do is have the script time itself.

## Usage

Basic examples are provided below. A more extensive analysis is in [association-analysis](association-analysis)

### Dependencies

```bash
Expand All @@ -30,6 +32,8 @@ $ python compilerops.py gen g++

Will generate filtered [data/gpp_flags.json](data/gpp_flags.json)

**important** the first two times I ran monte carlo and the tabu search I included warnings, and later removed these.
The original data (suffix _warnings.json) is included in the data folder.

### Running Models

Expand Down Expand Up @@ -92,3 +96,4 @@ $ tree data/results/tabu
And with 100 iterations we find some good combinations!

![data/results/tabu/2/gpp_flags_results.png](data/results/tabu/2/gpp_flags_results.png)

70 changes: 70 additions & 0 deletions association-analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Association Analysis

> People don't which flags to use in different scenarios.
This is a more scaled version of the original montecarlo simulation. For this small
analysis we have updated [montecarlo.py](montecarlo.py) so that it can run in parallel,
each time with a different script to run some number of iterations over. We have
also moved this processing to happen in temporary directories to keep the repository
a bit neater.

0. Make so montecarlo can run in parallel
1. Run MonteCarlo on all analyses here: https://github.com/sinairv/Cpp-Tutorial-Samples
2. Extract "features" of each with GoSmeagle (or similar to get code structure), turn into matrices
3. Make assocations between features and compile flags or compile time

## Usage

```bash
$ python -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt
```

### Generate Flags

The flags should already have been generated one folder up in [data](../data). Note that we are using
the set without warnings (high 700s).

### Run Analysis!

Provide a path to the flags, and the root directory defaults to pwd, and give the filename to look for to compile:

```bash
$ python montecarlo.py run ../data/gpp_flags.json Prog.cpp
```

In practice, I found that using parallel made more sense (no workers in Python).
Here is how to test a single script:

```bash
$ python montecarlo-parallel.py run ../data/gpp_flags.json "./examples/sizeof Operator/Prog.cpp" --outdir-num 1 --num-iter 2000
```

And then to run using parallel (`apt-get install -y parallel`)

```bash
$ find ./examples -name "*Prog.cpp" | parallel -I% --max-args 1 python montecarlo-parallel.py run ../data/gpp_flags.json "%" --outdir-num 1 --num-iter 2000
```

There is a [run.sh](run.sh) script that I used, and ultimately ran between a range of 0 and 29 (to generate 30 runs of the same predictions for 100 iterations each).

### Find common flags

After we've run this many times, we'd want to see some kind of signal of common flags across runs. We can calculate the percentage
of time that we see each flag for each result file.

```bash
$ python flag_popularity.py assess data/results/montecarlo-association-analysis
```

### Tokenize

Next, we would want to say "It's common for code with arrays to have better performance when compiled with these flags." To
avoid the complexities of parsing assembly (or something like that) instead we are going to tokenize
the CPP code, and keep track of counts for the number of things that we find (e.g., strings, for loops, etc.).
To do that:

```bash
$ python create_token_features.py examples Prog.cpp
```
1 change: 1 addition & 0 deletions association-analysis/examples
Submodule examples added at 29ac0f
111 changes: 111 additions & 0 deletions association-analysis/flag_popularity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#!/usr/bin/env python3

# This script does the following.
# 1. Loads in results files
# 2. calculates percentage of time we see each flag for each file type
# 3. Save results to file

import numpy as np
import matplotlib.pyplot as plt
import pandas

import argparse
from glob import glob
import json
import os
import re
import shutil
import sys

import time

# keep global results
results = []

here = os.path.dirname(os.path.abspath(__file__))


def get_parser():
parser = argparse.ArgumentParser(description="run")

description = "Assess flag popularity"
subparsers = parser.add_subparsers(
help="actions",
title="actions",
description=description,
dest="command",
)
assess = subparsers.add_parser("assess", help="run flag popularity")
assess.add_argument("results_dir", help="root of results directory with numbered subfolders")
return parser


def run_command(cmd):
"""
A modified run command to return output and error code
"""
cmd = shlex.split(cmd)
try:
start = time.time()
output = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
end = time.time()
except:
return "", 1, np.inf
t = output.communicate()[0]
return "", output.returncode, end - start

def read_json(filename):
with open(filename, 'r') as fd:
content = json.loads(fd.read())
return content

def assess_popularity(results_dirs):
"""
Given a list of result directories, assess flag popularity.
"""
flags = {}
fails = {}
# TODO should we count the number per run?
for result_dir in results_dirs:
results = glob("%s/*.json" % result_dir)
for result in results:
result_id = os.path.basename(result).split('.')[0]
if result_id not in flags:
flags[result_id] = {}
result = read_json(result)
# Last result is fastest
fastest = result['results'][-1]
if fastest[0] != "Run success":
fails[result_id] = result['filename']
continue
for flag in fastest[2]:
if flag not in flags[result_id]:
flags[result_id][flag] = 0
else:
print("Duplicate flag %s found for %s" % (flag, result_id))
flags[result_id][flag] += 1


def main():
parser = get_parser()

def help(return_code=0):
parser.print_help()
sys.exit(return_code)

args, extra = parser.parse_known_args()
if not args.command:
help()

# Load data
if not args.results_dir or not os.path.exists(args.results_dir):
sys.exit("%s missing or does not exist." % args.results_dir)

results_dirs = sorted([int(x) for x in os.listdir(args.results_dir)])
results_dirs = [os.path.join(args.results_dir, str(x)) for x in results_dirs]
print("Found %s runs!" % len(results_dirs))
df = assess_popularity(results_dirs)


if __name__ == "__main__":
main()

0 comments on commit 6f1a15d

Please sign in to comment.