UnitTestBot · Atos1337 · Jun 30, 2022 · Jun 22, 2022 · Jun 22, 2022 · Jun 22, 2022
diff --git a/docs/jlearch/pipeline-training-usage.md b/docs/jlearch/pipeline-training-usage.md
@@ -0,0 +1,37 @@
+# Pipeline diagram
+
+```mermaid
+graph TD
+    Projects --> ContestEstimator
+    Selectors --> ContestEstimator
+    subgraph FeatureGeneration
+        ContestEstimator --> Tests
+        Tests --> Features
+        Tests --> Rewards
+    end
+
+
+        Features --> Data
+        Rewards --> Data
+
+    Data --> Models
+    Models --> NNRewardGuidedSelector --> UsualTestGeneration
+```
+
+# Training
+
+Briefly:
+
+* Get dataset `D` by running `ContestEstimator` on several projects using several selectors.
+* Train `model_0` using `D`
+* For several `iterations` repeat (assume we on `i`-th step): 
+  * Get dataset `D'` by running `ContestEstimator` on several projects using `NNRewardGuidedSelector`, which will use `model_i`
+  * $$D = D \cup D'$$
+  * Train `model_$(i+1)` using `D`
+
+To do this, you should:
+* Be sure that you use `Java 8` by `java` command and set `JAVA_HOME` to `Java 8`. 
+* Put projects, on which you want to learn in `contest_input/projects` folder, then list classes, on which you want to learn in `contest_input/classes/<project name>/list` (if it is empty, than we will take all classes from project jar).
+* Run `pip install -r scripts/requirements.txt`. It is up to you to make it in virtual environment or not.
+* List selectors in `scripts/selector_list` and projects in `scripts/prog_list`
+* Run `./scripts/train_iteratively.sh `
diff --git a/docs/jlearch/scripts.md b/docs/jlearch/scripts.md
@@ -0,0 +1,65 @@
+# How to use scripts
+For each scenario: go to root of `UTBotJava` repository - it is `WORKDIR`.
+
+`PATH_SELECTOR` as argument is `"PATH_SELECTOR_TYPE [PATH_SELECTOR_PATH for NN] [IS_COMBINED (false by default)] [ITERATIONS]"`.
+
+Before start of work run:
+```bash
+./scripts/prepare.sh
+```
+
+It will copy contest resources in `contest_input` folder and build the project, because we use jars, so if you want to change something in code and re-run scripts, then you should run:
+```bash
+./gradlew clean build -x test
+```
+
+## To Train a few iterations of your models:
+By default features directory is `eval/features` - it should be created, to change it you should manually do it in source code of scripts.
+
+List projects and selectors on what you want to train in `scripts/prog_list` and `scripts/selector_list`. You will be trained on all methods of all classes from `contest_input/classes/<project name>/list`.
+
+Then just run:
+```bash
+./scripts/train_iteratively.sh <time_limit> <iterations> <output_dir> <python_command>
+```
+Python command is your command for python3, in the end of execution you will get iterations models in `<output_dir>` folder and features for each selector and project in `<features_dir>/<selector>/<project>` for `selector` from `selectors_list` and in `<features_dir>/jlearch/<selector>/<prog>` for models.
+
+## To Run Contest Estimator with coverage:
+Check that `srcTestDir` with your project exist in `build.gradle` of `utbot-junit-contest`. If it is not then add `build/output/test/<project>`.
+
+Then just run: 
+```bash
+./scripts/run_with_coverage.sh <project> <time_limit> <path_selector> <selector_alias>
+``` 
+
+In the end of execution you will get jacoco report in `eval/jacoco/<project>/<selector_alias>/` folder.
+
+## To estimate quality
+Just run:
+```bash
+./scripts/quality_analysis.sh <project> <selector_aliases, separated by comma>
+```
+It will take coverage reports from relative report folders (at `eval/jacoco/project/alias`) and generate charts in `$outputDir/<project>/<timestamp>.html`.
+`outputDir` can be changed in `QualityAnalysisConfig`. Result file will contain information about 3 metrics:
+* $\frac{\sum_{c \in classSet} instCoverage(c)}{|classSet|}$
+* $\frac{\sum_{c \in classSet} coveredInstructions(c)}{\sum_{c \in classSet} allInstructions(c)}$
+* $\frac{\sum_{c \in classSet} branchCoverage(c)}{|classSet|}$
+
+For each metric for each selector you will have:
+* value of metric
+* some chart with median, $q_1$, $q_3$ and so on
+
+
+## To scrap solution classes from codeforces
+Note: You can't scrap many classes, because codeforces api has a request limit.
+
+It can be useful, if you want to train Jlearch on classes usually without virtual functions, but with many algorithms, so cycles and conditions.
+
+Just run:
+```bash
+python3 path/to/codeforces_scrapper.py --problem_count <val> --submission_count <val> --min_rating <val> --max_rating <val> --output_dir <val>
+```
+
+All arguments are optional. Default values: `100`, `10`, `0`, `1500`, `.`.
+
+At the end you should get `submission_count` classes for each of `problem_count` problems with rating between `min_rating` and `max_rating` at `output_dir`. Each class have package `p<contest_id>.p<submission_id>`.
diff --git a/docs/jlearch/setup.md b/docs/jlearch/setup.md
@@ -0,0 +1,35 @@
+# How to setup environment for experiments on Linux
+
+* Clone repository, go to root
+* `chmod +x ./scripts/*` and `chmod +x gradlew`.
+* Set `Java 8` as default and set `JAVA_HOME` to this `Java`.
+  For example
+  * Go through [this](https://sdkman.io/install) until `Windows installation`
+  * `sdk list java`
+  * Find any `Java 8`
+  * `sdk install <this java>`
+  * `sdk use <this java>`
+  * Check `java -version`
+* `mkdir -p eval/features`
+* `mkdir models`
+* Set environment for `Python`.
+  For example
+  * `python3 -m venv /path/to/new/virtual/environment`
+  * `source /path/to/venv/bin/activate`
+  * Check `which python3`, it should be somewhere in `path/to/env` folder.
+  * `pip install -r scripts/requirements.txt`
+* `./scripts/prepare.sh`
+* Change `scripts/prog_list` to run on smaller project or delete some classes from `contest_input/classes/<project>/list`.
+
+# Default settings and how to change it
+* You can reduce number of models in `models` variable in `scripts/train_iteratively.sh`
+* You can change amount of required RAM in `run_contest_estimator.sh`: `16 gb`  by default
+* You can change `batch_size` or `device` n `train.py`: `4096` and `gpu` by default
+* If you are completing setup on server, then you will need to uncomment tmp directory option in `run_contest_estimator.sh`
+
+# Continue setup
+* `scripts/train_iteratively.sh 30 2 models <your python3 command>`
+* In `models/` you should get models.
+* `mkdir eval/jacoco`
+* `./scripts/run_with_coverage.sh <any project (guava-26.0, for example)> 30 "NN_REWARD_GUIDED_SELECTOR path/to/model" some_alias`. `path/to/model` should be something like `models/nn32/0`, where `nn32` is a type of model and `0` is the iteration number
+* You should get jacoco report in `eval/jacoco/guava-26.0/some_alias/`
diff --git a/scripts/codeforces_scrapper/codeforces_scrapper.py b/scripts/codeforces_scrapper/codeforces_scrapper.py
@@ -0,0 +1,120 @@
+import argparse
+import os.path
+import time
+
+import requests
+from urllib import request
+import json
+import bs4
+import javalang
+
+from codeforces import CodeforcesAPI
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--problem_count", dest='problem_count', type=int, default=100)
+    parser.add_argument("--submission_count", dest='submission_count', type=int, default=10)
+    parser.add_argument("--min_rating", dest='min_rating', type=int, default=0)
+    parser.add_argument("--max_rating", dest="max_rating", type=int, default=1500)
+    parser.add_argument("--output_dir", dest="output_dir", type=str, default=".")
+    return parser.parse_args()
+
+
+args = get_args()
+
+
+def check_json(answer):
+    values = json.loads(answer)
+
+    if values['status'] == 'OK':
+        return values['result']
+
+
+def get_main_name(tree):
+    """
+    :return: class name with main method
+    """
+    return next(klass.name for klass in tree.types
+         if isinstance(klass, javalang.tree.ClassDeclaration)
+         for m in klass.methods
+         if m.name == 'main' and m.modifiers.issuperset({'public', 'static'}))
+
+
+def save_source_code(contest_id, submission_id):
+    """
+    Parse html page to find source code of submission and save it in some unique package p${contest_id}.p${submission_id}.
+    If we reach api request bound, then we try to sleep for 5 minutes.
+    """
+    url = request.Request(f"http://codeforces.com/contest/{contest_id}/submission/{submission_id}")
+    with request.urlopen(url) as req:
+        soup = bs4.BeautifulSoup(req.read(), "html.parser")
+        path = os.path.join(args.output_dir, f"p{contest_id}", f"p{submission_id}")
+        if not os.path.exists(path):
+            os.makedirs(path)
+        code = ""
+        for p in soup.find_all("pre", {"class": "program-source"}):
+            code += p.get_text()
+        tree = javalang.parse.parse(code)
+        try:
+            name = get_main_name(tree)
+            with open(os.path.join(path, f"{name}.java"), 'w') as f:
+                print(f"package p{contest_id}.p{submission_id};", file=f)
+                f.write(code)
+        except StopIteration:
+            print("Sleeping, because we reach request bound")
+            time.sleep(300)
+
+
+def main():
+    codeforces = "http://codeforces.com/api/"
+    api = CodeforcesAPI()
+
+    with request.urlopen(f"{codeforces}problemset.problems") as req:
+        all_problems = check_json(req.read().decode('utf-8'))
+
+    problems = []
+    cur_problem = 0
+    for p in all_problems['problems']:
+        if cur_problem >= args.problem_count:
+            break
+        if p.get('rating') is None:
+            continue
+        if p['rating'] < args.min_rating or p['rating'] > args.max_rating:
+            continue
+        cur_problem += 1
+        problems.append({'contest_id': p['contestId'], 'index': p['index']})
+
+    print(f"Get {len(problems)} problems: {problems[0]}")
+
+    """
+    For each problem try to take submission_count submissions.
+    """
+    all_submission = 0
+    for i, p in enumerate(problems):
+        cur_submission = 0
+        iteration = 0
+        page_size = 1000
+        while cur_submission < args.submission_count:
+            length = 0
+            for s in api.contest_status(contest_id=p['contest_id'], from_=page_size * iteration + 1, count=page_size):
+                if cur_submission >= args.submission_count:
+                    break
+                length += 1
+                if s.problem.contest_id != p['contest_id'] or s.problem.index != p['index']:
+                    continue
+                if s.programming_language != "Java 8":
+                    continue
+                if s.verdict.name != "ok":
+                    continue
+                save_source_code(p['contest_id'], s.id)
+                cur_submission += 1
+                all_submission += 1
+                print(f"Get new {all_submission} program")
+            iteration += 1
+            if length == 0:
+                break
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/prepare.sh b/scripts/prepare.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+./gradlew clean build -x test
+
+INPUT_FOLDER=contest_input
+
+# Copy resources folder in distinct folder to allow other scripts have specific project in contest resources folder
+mkdir $INPUT_FOLDER
+cp -r utbot-junit-contest/src/main/resources/* $INPUT_FOLDER
diff --git a/scripts/prog_list b/scripts/prog_list
@@ -0,0 +1 @@
+antlr
diff --git a/scripts/quality_analysis.sh b/scripts/quality_analysis.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+PROJECT=${1}
+SELECTORS=${2}
+STMT_COVERAGE=${3}
+WORKDIR="."
+
+# We set QualityAnalysisConfig by properties file
+SETTING_PROPERTIES_FILE="$WORKDIR/utbot-analytics/src/main/resources/config.properties"
+touch $SETTING_PROPERTIES_FILE
+echo "project=$PROJECT" > "$SETTING_PROPERTIES_FILE"
+echo "selectors=$SELECTORS" >> "$SETTING_PROPERTIES_FILE"
+
+JAR_TYPE="utbot-analytics"
+echo "JAR_TYPE: $JAR_TYPE"
+LIBS_DIR=utbot-analytics/build/libs/
+UTBOT_JAR="$LIBS_DIR$(ls -l $LIBS_DIR | grep $JAR_TYPE | awk '{print $9}')"
+echo $UTBOT_JAR
+MAIN_CLASS="org.utbot.QualityAnalysisKt"
+
+if [[ -n $STMT_COVERAGE ]]; then
+    MAIN_CLASS="org.utbot.StmtCoverageReportKt"
+fi
+
+
+
+#Running the jar
+COMMAND_LINE="java $JVM_OPTS -cp $UTBOT_JAR $MAIN_CLASS"
+
+echo "COMMAND=$COMMAND"
+
+$COMMAND_LINE
diff --git a/scripts/requirements.txt b/scripts/requirements.txt
@@ -0,0 +1,7 @@
+beautifulsoup4
+javalang
+numpy
+pandas
+requests
+scikit_learn
+torch