adding start of work for association analysis

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
buildsi · Nov 18, 2021 · 6f1a15d · 6f1a15d
1 parent 41e7179
commit 6f1a15d
Show file tree

Hide file tree

Showing 13 changed files with 4,108 additions and 1,133 deletions.
diff --git a/README.md b/README.md
@@ -14,6 +14,8 @@ For the last point, the easiest thing to do is have the script time itself.
 
 ## Usage
 
+Basic examples are provided below. A more extensive analysis is in [association-analysis](association-analysis)
+
 ### Dependencies
 
 ```bash
@@ -30,6 +32,8 @@ $ python compilerops.py gen g++
 
 Will generate filtered [data/gpp_flags.json](data/gpp_flags.json)
 
+**important** the first two times I ran monte carlo and the tabu search I included warnings, and later removed these.
+The original data (suffix _warnings.json) is included in the data folder.
 
 ### Running Models
 
@@ -92,3 +96,4 @@ $ tree data/results/tabu
 And with 100 iterations we find some good combinations!
 
 ![data/results/tabu/2/gpp_flags_results.png](data/results/tabu/2/gpp_flags_results.png)
+
diff --git a/association-analysis/README.md b/association-analysis/README.md
@@ -0,0 +1,70 @@
+# Association Analysis
+
+> People don't which flags to use in different scenarios.
+
+This is a more scaled version of the original montecarlo simulation. For this small
+analysis we have updated [montecarlo.py](montecarlo.py) so that it can run in parallel,
+each time with a different script to run some number of iterations over. We have
+also moved this processing to happen in temporary directories to keep the repository
+a bit neater.
+
+0. Make so montecarlo can run in parallel
+1. Run MonteCarlo on all analyses here: https://github.com/sinairv/Cpp-Tutorial-Samples
+2. Extract "features" of each with GoSmeagle (or similar to get code structure), turn into matrices
+3. Make assocations between features and compile flags or compile time
+
+## Usage
+
+```bash
+$ python -m venv env
+$ source env/bin/activate
+$ pip install -r requirements.txt
+```
+
+### Generate Flags
+
+The flags should already have been generated one folder up in [data](../data). Note that we are using
+the set without warnings (high 700s).
+
+### Run Analysis!
+
+Provide a path to the flags, and the root directory defaults to pwd, and give the filename to look for to compile:
+
+```bash
+$ python montecarlo.py run ../data/gpp_flags.json Prog.cpp
+```
+
+In practice, I found that using parallel made more sense (no workers in Python).
+Here is how to test a single script:
+
+```bash
+$ python montecarlo-parallel.py run ../data/gpp_flags.json "./examples/sizeof Operator/Prog.cpp" --outdir-num 1 --num-iter 2000
+```
+
+And then to run using parallel (`apt-get install -y parallel`)
+
+```bash
+$ find ./examples -name "*Prog.cpp" | parallel -I% --max-args 1 python montecarlo-parallel.py run ../data/gpp_flags.json "%" --outdir-num 1 --num-iter 2000
+```
+
+There is a [run.sh](run.sh) script that I used, and ultimately ran between a range of 0 and 29 (to generate 30 runs of the same predictions for 100 iterations each).
+
+### Find common flags
+
+After we've run this many times, we'd want to see some kind of signal of common flags across runs. We can calculate the percentage
+of time that we see each flag for each result file.
+
+```bash
+$ python flag_popularity.py assess data/results/montecarlo-association-analysis
+```
+
+### Tokenize
+
+Next, we would want to say "It's common for code with arrays to have better performance when compiled with these flags." To
+avoid the complexities of parsing assembly (or something like that) instead we are going to tokenize
+the CPP code, and keep track of counts for the number of things that we find (e.g., strings, for loops, etc.).
+To do that:
+
+```bash
+$ python create_token_features.py examples Prog.cpp
+```
diff --git a/association-analysis/examples b/association-analysis/examples
diff --git a/association-analysis/flag_popularity.py b/association-analysis/flag_popularity.py
@@ -0,0 +1,111 @@
+#!/usr/bin/env python3
+
+# This script does the following.
+# 1. Loads in results files
+# 2. calculates percentage of time we see each flag for each file type
+# 3. Save results to file
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas
+
+import argparse
+from glob import glob
+import json
+import os
+import re
+import shutil
+import sys
+
+import time
+
+# keep global results
+results = []
+
+here = os.path.dirname(os.path.abspath(__file__))
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="run")
+
+    description = "Assess flag popularity"
+    subparsers = parser.add_subparsers(
+        help="actions",
+        title="actions",
+        description=description,
+        dest="command",
+    )
+    assess = subparsers.add_parser("assess", help="run flag popularity")
+    assess.add_argument("results_dir", help="root of results directory with numbered subfolders")
+    return parser
+
+
+def run_command(cmd):
+    """
+    A modified run command to return output and error code
+    """
+    cmd = shlex.split(cmd)
+    try:
+        start = time.time()
+        output = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
+        end = time.time()
+    except:
+        return "", 1, np.inf
+    t = output.communicate()[0]
+    return "", output.returncode, end - start
+
+def read_json(filename):
+    with open(filename, 'r') as fd:
+        content = json.loads(fd.read())
+    return content
+
+def assess_popularity(results_dirs):
+    """
+    Given a list of result directories, assess flag popularity.
+    """
+    flags = {}
+    fails = {}
+    # TODO should we count the number per run?
+    for result_dir in results_dirs:
+        results = glob("%s/*.json" % result_dir)
+        for result in results:
+            result_id = os.path.basename(result).split('.')[0]
+            if result_id not in flags:
+                flags[result_id] = {}
+            result = read_json(result)
+            # Last result is fastest
+            fastest = result['results'][-1]
+            if fastest[0] != "Run success":
+                fails[result_id] = result['filename']
+                continue
+            for flag in fastest[2]:
+                if flag not in flags[result_id]:
+                    flags[result_id][flag] = 0
+                else:
+                   print("Duplicate flag %s found for %s" % (flag, result_id))
+                flags[result_id][flag] += 1 
+
+
+def main():
+    parser = get_parser()
+
+    def help(return_code=0):
+        parser.print_help()
+        sys.exit(return_code)
+
+    args, extra = parser.parse_known_args()
+    if not args.command:
+        help()
+
+    # Load data
+    if not args.results_dir or not os.path.exists(args.results_dir):
+        sys.exit("%s missing or does not exist." % args.results_dir)
+
+    results_dirs = sorted([int(x) for x in os.listdir(args.results_dir)])
+    results_dirs = [os.path.join(args.results_dir, str(x)) for x in results_dirs]
+    print("Found %s runs!" % len(results_dirs))    
+    df = assess_popularity(results_dirs)
+
+
+if __name__ == "__main__":
+    main()