-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adding start of work for association analysis
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
- Loading branch information
Showing
13 changed files
with
4,108 additions
and
1,133 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Association Analysis | ||
|
||
> People don't which flags to use in different scenarios. | ||
This is a more scaled version of the original montecarlo simulation. For this small | ||
analysis we have updated [montecarlo.py](montecarlo.py) so that it can run in parallel, | ||
each time with a different script to run some number of iterations over. We have | ||
also moved this processing to happen in temporary directories to keep the repository | ||
a bit neater. | ||
|
||
0. Make so montecarlo can run in parallel | ||
1. Run MonteCarlo on all analyses here: https://github.com/sinairv/Cpp-Tutorial-Samples | ||
2. Extract "features" of each with GoSmeagle (or similar to get code structure), turn into matrices | ||
3. Make assocations between features and compile flags or compile time | ||
|
||
## Usage | ||
|
||
```bash | ||
$ python -m venv env | ||
$ source env/bin/activate | ||
$ pip install -r requirements.txt | ||
``` | ||
|
||
### Generate Flags | ||
|
||
The flags should already have been generated one folder up in [data](../data). Note that we are using | ||
the set without warnings (high 700s). | ||
|
||
### Run Analysis! | ||
|
||
Provide a path to the flags, and the root directory defaults to pwd, and give the filename to look for to compile: | ||
|
||
```bash | ||
$ python montecarlo.py run ../data/gpp_flags.json Prog.cpp | ||
``` | ||
|
||
In practice, I found that using parallel made more sense (no workers in Python). | ||
Here is how to test a single script: | ||
|
||
```bash | ||
$ python montecarlo-parallel.py run ../data/gpp_flags.json "./examples/sizeof Operator/Prog.cpp" --outdir-num 1 --num-iter 2000 | ||
``` | ||
|
||
And then to run using parallel (`apt-get install -y parallel`) | ||
|
||
```bash | ||
$ find ./examples -name "*Prog.cpp" | parallel -I% --max-args 1 python montecarlo-parallel.py run ../data/gpp_flags.json "%" --outdir-num 1 --num-iter 2000 | ||
``` | ||
|
||
There is a [run.sh](run.sh) script that I used, and ultimately ran between a range of 0 and 29 (to generate 30 runs of the same predictions for 100 iterations each). | ||
|
||
### Find common flags | ||
|
||
After we've run this many times, we'd want to see some kind of signal of common flags across runs. We can calculate the percentage | ||
of time that we see each flag for each result file. | ||
|
||
```bash | ||
$ python flag_popularity.py assess data/results/montecarlo-association-analysis | ||
``` | ||
|
||
### Tokenize | ||
|
||
Next, we would want to say "It's common for code with arrays to have better performance when compiled with these flags." To | ||
avoid the complexities of parsing assembly (or something like that) instead we are going to tokenize | ||
the CPP code, and keep track of counts for the number of things that we find (e.g., strings, for loops, etc.). | ||
To do that: | ||
|
||
```bash | ||
$ python create_token_features.py examples Prog.cpp | ||
``` |
Submodule examples
added at
29ac0f
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# This script does the following. | ||
# 1. Loads in results files | ||
# 2. calculates percentage of time we see each flag for each file type | ||
# 3. Save results to file | ||
|
||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
import pandas | ||
|
||
import argparse | ||
from glob import glob | ||
import json | ||
import os | ||
import re | ||
import shutil | ||
import sys | ||
|
||
import time | ||
|
||
# keep global results | ||
results = [] | ||
|
||
here = os.path.dirname(os.path.abspath(__file__)) | ||
|
||
|
||
def get_parser(): | ||
parser = argparse.ArgumentParser(description="run") | ||
|
||
description = "Assess flag popularity" | ||
subparsers = parser.add_subparsers( | ||
help="actions", | ||
title="actions", | ||
description=description, | ||
dest="command", | ||
) | ||
assess = subparsers.add_parser("assess", help="run flag popularity") | ||
assess.add_argument("results_dir", help="root of results directory with numbered subfolders") | ||
return parser | ||
|
||
|
||
def run_command(cmd): | ||
""" | ||
A modified run command to return output and error code | ||
""" | ||
cmd = shlex.split(cmd) | ||
try: | ||
start = time.time() | ||
output = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE) | ||
end = time.time() | ||
except: | ||
return "", 1, np.inf | ||
t = output.communicate()[0] | ||
return "", output.returncode, end - start | ||
|
||
def read_json(filename): | ||
with open(filename, 'r') as fd: | ||
content = json.loads(fd.read()) | ||
return content | ||
|
||
def assess_popularity(results_dirs): | ||
""" | ||
Given a list of result directories, assess flag popularity. | ||
""" | ||
flags = {} | ||
fails = {} | ||
# TODO should we count the number per run? | ||
for result_dir in results_dirs: | ||
results = glob("%s/*.json" % result_dir) | ||
for result in results: | ||
result_id = os.path.basename(result).split('.')[0] | ||
if result_id not in flags: | ||
flags[result_id] = {} | ||
result = read_json(result) | ||
# Last result is fastest | ||
fastest = result['results'][-1] | ||
if fastest[0] != "Run success": | ||
fails[result_id] = result['filename'] | ||
continue | ||
for flag in fastest[2]: | ||
if flag not in flags[result_id]: | ||
flags[result_id][flag] = 0 | ||
else: | ||
print("Duplicate flag %s found for %s" % (flag, result_id)) | ||
flags[result_id][flag] += 1 | ||
|
||
|
||
def main(): | ||
parser = get_parser() | ||
|
||
def help(return_code=0): | ||
parser.print_help() | ||
sys.exit(return_code) | ||
|
||
args, extra = parser.parse_known_args() | ||
if not args.command: | ||
help() | ||
|
||
# Load data | ||
if not args.results_dir or not os.path.exists(args.results_dir): | ||
sys.exit("%s missing or does not exist." % args.results_dir) | ||
|
||
results_dirs = sorted([int(x) for x in os.listdir(args.results_dir)]) | ||
results_dirs = [os.path.join(args.results_dir, str(x)) for x in results_dirs] | ||
print("Found %s runs!" % len(results_dirs)) | ||
df = assess_popularity(results_dirs) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.