Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions docs/jlearch/pipeline-training-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Pipeline diagram

```mermaid
graph TD
Projects --> ContestEstimator
Selectors --> ContestEstimator
subgraph FeatureGeneration
ContestEstimator --> Tests
Tests --> Features
Tests --> Rewards
end


Features --> Data
Rewards --> Data

Data --> Models
Models --> NNRewardGuidedSelector --> UsualTestGeneration
```

# Training

Briefly:

* Get dataset `D` by running `ContestEstimator` on several projects using several selectors.
* Train `model_0` using `D`
* For several `iterations` repeat (assume we on `i`-th step):
* Get dataset `D'` by running `ContestEstimator` on several projects using `NNRewardGuidedSelector`, which will use `model_i`
* $$D = D \cup D'$$
* Train `model_$(i+1)` using `D`

To do this, you should:
* Be sure that you use `Java 8` by `java` command and set `JAVA_HOME` to `Java 8`.
* Put projects, on which you want to learn in `contest_input/projects` folder, then list classes, on which you want to learn in `contest_input/classes/<project name>/list` (if it is empty, than we will take all classes from project jar).
* Run `pip install -r scripts/requirements.txt`. It is up to you to make it in virtual environment or not.
* List selectors in `scripts/selector_list` and projects in `scripts/prog_list`
* Run `./scripts/train_iteratively.sh `
65 changes: 65 additions & 0 deletions docs/jlearch/scripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# How to use scripts
For each scenario: go to root of `UTBotJava` repository - it is `WORKDIR`.

`PATH_SELECTOR` as argument is `"PATH_SELECTOR_TYPE [PATH_SELECTOR_PATH for NN] [IS_COMBINED (false by default)] [ITERATIONS]"`.

Before start of work run:
```bash
./scripts/prepare.sh
```

It will copy contest resources in `contest_input` folder and build the project, because we use jars, so if you want to change something in code and re-run scripts, then you should run:
```bash
./gradlew clean build -x test
```

## To Train a few iterations of your models:
By default features directory is `eval/features` - it should be created, to change it you should manually do it in source code of scripts.

List projects and selectors on what you want to train in `scripts/prog_list` and `scripts/selector_list`. You will be trained on all methods of all classes from `contest_input/classes/<project name>/list`.

Then just run:
```bash
./scripts/train_iteratively.sh <time_limit> <iterations> <output_dir> <python_command>
```
Python command is your command for python3, in the end of execution you will get iterations models in `<output_dir>` folder and features for each selector and project in `<features_dir>/<selector>/<project>` for `selector` from `selectors_list` and in `<features_dir>/jlearch/<selector>/<prog>` for models.

## To Run Contest Estimator with coverage:
Check that `srcTestDir` with your project exist in `build.gradle` of `utbot-junit-contest`. If it is not then add `build/output/test/<project>`.

Then just run:
```bash
./scripts/run_with_coverage.sh <project> <time_limit> <path_selector> <selector_alias>
```

In the end of execution you will get jacoco report in `eval/jacoco/<project>/<selector_alias>/` folder.

## To estimate quality
Just run:
```bash
./scripts/quality_analysis.sh <project> <selector_aliases, separated by comma>
```
It will take coverage reports from relative report folders (at `eval/jacoco/project/alias`) and generate charts in `$outputDir/<project>/<timestamp>.html`.
`outputDir` can be changed in `QualityAnalysisConfig`. Result file will contain information about 3 metrics:
* $\frac{\sum_{c \in classSet} instCoverage(c)}{|classSet|}$
* $\frac{\sum_{c \in classSet} coveredInstructions(c)}{\sum_{c \in classSet} allInstructions(c)}$
* $\frac{\sum_{c \in classSet} branchCoverage(c)}{|classSet|}$

For each metric for each selector you will have:
* value of metric
* some chart with median, $q_1$, $q_3$ and so on


## To scrap solution classes from codeforces
Note: You can't scrap many classes, because codeforces api has a request limit.

It can be useful, if you want to train Jlearch on classes usually without virtual functions, but with many algorithms, so cycles and conditions.

Just run:
```bash
python3 path/to/codeforces_scrapper.py --problem_count <val> --submission_count <val> --min_rating <val> --max_rating <val> --output_dir <val>
```

All arguments are optional. Default values: `100`, `10`, `0`, `1500`, `.`.

At the end you should get `submission_count` classes for each of `problem_count` problems with rating between `min_rating` and `max_rating` at `output_dir`. Each class have package `p<contest_id>.p<submission_id>`.
35 changes: 35 additions & 0 deletions docs/jlearch/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# How to setup environment for experiments on Linux

* Clone repository, go to root
* `chmod +x ./scripts/*` and `chmod +x gradlew`.
* Set `Java 8` as default and set `JAVA_HOME` to this `Java`.
For example
* Go through [this](https://sdkman.io/install) until `Windows installation`
* `sdk list java`
* Find any `Java 8`
* `sdk install <this java>`
* `sdk use <this java>`
* Check `java -version`
* `mkdir -p eval/features`
* `mkdir models`
* Set environment for `Python`.
For example
* `python3 -m venv /path/to/new/virtual/environment`
* `source /path/to/venv/bin/activate`
* Check `which python3`, it should be somewhere in `path/to/env` folder.
* `pip install -r scripts/requirements.txt`
* `./scripts/prepare.sh`
* Change `scripts/prog_list` to run on smaller project or delete some classes from `contest_input/classes/<project>/list`.

# Default settings and how to change it
* You can reduce number of models in `models` variable in `scripts/train_iteratively.sh`
* You can change amount of required RAM in `run_contest_estimator.sh`: `16 gb` by default
* You can change `batch_size` or `device` n `train.py`: `4096` and `gpu` by default
* If you are completing setup on server, then you will need to uncomment tmp directory option in `run_contest_estimator.sh`

# Continue setup
* `scripts/train_iteratively.sh 30 2 models <your python3 command>`
* In `models/` you should get models.
* `mkdir eval/jacoco`
* `./scripts/run_with_coverage.sh <any project (guava-26.0, for example)> 30 "NN_REWARD_GUIDED_SELECTOR path/to/model" some_alias`. `path/to/model` should be something like `models/nn32/0`, where `nn32` is a type of model and `0` is the iteration number
* You should get jacoco report in `eval/jacoco/guava-26.0/some_alias/`
120 changes: 120 additions & 0 deletions scripts/codeforces_scrapper/codeforces_scrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import argparse
import os.path
import time

import requests
from urllib import request
import json
import bs4
import javalang

from codeforces import CodeforcesAPI


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--problem_count", dest='problem_count', type=int, default=100)
parser.add_argument("--submission_count", dest='submission_count', type=int, default=10)
parser.add_argument("--min_rating", dest='min_rating', type=int, default=0)
parser.add_argument("--max_rating", dest="max_rating", type=int, default=1500)
parser.add_argument("--output_dir", dest="output_dir", type=str, default=".")
return parser.parse_args()


args = get_args()


def check_json(answer):
values = json.loads(answer)

if values['status'] == 'OK':
return values['result']


def get_main_name(tree):
"""
:return: class name with main method
"""
return next(klass.name for klass in tree.types
if isinstance(klass, javalang.tree.ClassDeclaration)
for m in klass.methods
if m.name == 'main' and m.modifiers.issuperset({'public', 'static'}))


def save_source_code(contest_id, submission_id):
"""
Parse html page to find source code of submission and save it in some unique package p${contest_id}.p${submission_id}.
If we reach api request bound, then we try to sleep for 5 minutes.
"""
url = request.Request(f"http://codeforces.com/contest/{contest_id}/submission/{submission_id}")
with request.urlopen(url) as req:
soup = bs4.BeautifulSoup(req.read(), "html.parser")
path = os.path.join(args.output_dir, f"p{contest_id}", f"p{submission_id}")
if not os.path.exists(path):
os.makedirs(path)
code = ""
for p in soup.find_all("pre", {"class": "program-source"}):
code += p.get_text()
tree = javalang.parse.parse(code)
try:
name = get_main_name(tree)
with open(os.path.join(path, f"{name}.java"), 'w') as f:
print(f"package p{contest_id}.p{submission_id};", file=f)
f.write(code)
except StopIteration:
print("Sleeping, because we reach request bound")
time.sleep(300)


def main():
codeforces = "http://codeforces.com/api/"
api = CodeforcesAPI()

with request.urlopen(f"{codeforces}problemset.problems") as req:
all_problems = check_json(req.read().decode('utf-8'))

problems = []
cur_problem = 0
for p in all_problems['problems']:
if cur_problem >= args.problem_count:
break
if p.get('rating') is None:
continue
if p['rating'] < args.min_rating or p['rating'] > args.max_rating:
continue
cur_problem += 1
problems.append({'contest_id': p['contestId'], 'index': p['index']})

print(f"Get {len(problems)} problems: {problems[0]}")

"""
For each problem try to take submission_count submissions.
"""
all_submission = 0
for i, p in enumerate(problems):
cur_submission = 0
iteration = 0
page_size = 1000
while cur_submission < args.submission_count:
length = 0
for s in api.contest_status(contest_id=p['contest_id'], from_=page_size * iteration + 1, count=page_size):
if cur_submission >= args.submission_count:
break
length += 1
if s.problem.contest_id != p['contest_id'] or s.problem.index != p['index']:
continue
if s.programming_language != "Java 8":
continue
if s.verdict.name != "ok":
continue
save_source_code(p['contest_id'], s.id)
cur_submission += 1
all_submission += 1
print(f"Get new {all_submission} program")
iteration += 1
if length == 0:
break


if __name__ == "__main__":
main()
9 changes: 9 additions & 0 deletions scripts/prepare.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

./gradlew clean build -x test

INPUT_FOLDER=contest_input

# Copy resources folder in distinct folder to allow other scripts have specific project in contest resources folder
mkdir $INPUT_FOLDER
cp -r utbot-junit-contest/src/main/resources/* $INPUT_FOLDER
1 change: 1 addition & 0 deletions scripts/prog_list
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
antlr
32 changes: 32 additions & 0 deletions scripts/quality_analysis.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash

PROJECT=${1}
SELECTORS=${2}
STMT_COVERAGE=${3}
WORKDIR="."

# We set QualityAnalysisConfig by properties file
SETTING_PROPERTIES_FILE="$WORKDIR/utbot-analytics/src/main/resources/config.properties"
touch $SETTING_PROPERTIES_FILE
echo "project=$PROJECT" > "$SETTING_PROPERTIES_FILE"
echo "selectors=$SELECTORS" >> "$SETTING_PROPERTIES_FILE"

JAR_TYPE="utbot-analytics"
echo "JAR_TYPE: $JAR_TYPE"
LIBS_DIR=utbot-analytics/build/libs/
UTBOT_JAR="$LIBS_DIR$(ls -l $LIBS_DIR | grep $JAR_TYPE | awk '{print $9}')"
echo $UTBOT_JAR
MAIN_CLASS="org.utbot.QualityAnalysisKt"

if [[ -n $STMT_COVERAGE ]]; then
MAIN_CLASS="org.utbot.StmtCoverageReportKt"
fi



#Running the jar
COMMAND_LINE="java $JVM_OPTS -cp $UTBOT_JAR $MAIN_CLASS"

echo "COMMAND=$COMMAND"

$COMMAND_LINE
7 changes: 7 additions & 0 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
beautifulsoup4
javalang
numpy
pandas
requests
scikit_learn
torch
Loading