Automated Completion of GitHub Workflows

Our work introduces GH-WCOM (GitHub WOrkflOW COMpletiOn), a recommender system designed for GitHub workflows. The main objective of GH-WCOM is to provide automated suggestions for GitHub workflows by recommending appropriate actions. We utilized the text-to-text transfer transform model (T5) as the foundation for our approach.

Pipeline Description

To build GH-WCOM, we relied on the pretrain-then-finetune paradigm. Thus, we first need to pre-train the T5 model and later on we can instantiate the architecture of T5 for the auto-completion of GitHub Workflows.

Pre-training

In order to pre-train (and finetune) a T5 small model, we need a new sentencepiece model to accommodate the expanded vocabulary given by the naturally occuring context-specific tokens featuring GitHub workflow files.

How to train a new SPmodel

Pythonic way

pip install sentencepiece==0.1.96
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=all.txt --train_extremely_large_corpus=true --model_prefix=tokenizer-gh-action --vocab_size=32000 --bos_id=-1  --eos_id=1 --unk_id=2 --pad_id=0 --shuffle_input_sentence=true --character_coverage=1.0 --user_defined_symbols="<NL>,<PATH>,<PLH>,<V_NUMBER>,<FILE>,<URL>,<FOR_LATER_USE>,<dependency>,</dependency>" --max_sentence_length=20000')

The new SPmodel has to be trained on the entire pre-training corpus. Our tokenizer is publicly available here

Set up a GCS Bucket 💡

To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the orignal guide provided by Google . Here the link: https://cloud.google.com/storage/docs/quickstart-console Subsequently, by following the jupyter notebook we provide for pre-training and fine-tuning the network, you should be able to set up the final environment.
Datasets 📎

The datasets for the pre-training and fine-tuning the model are stored on GDrive here Please notice, that the TF implementation needs TSV files to work properly. Make sure you pick the correct ones from our GDrive folder.
Pre-trainig/Fine-tuning 💻

To pre-train and then fine-tune T5, you can use the script we provide here:
- Pre-Training
- Fine-Tuning
Statistical Tests

The code to replicate the statistical tests (i.e., McNemar and Wilcoxon) are available at the following links:
- McNemar
- Wilcoxon
As for the data needed to perform the tests, we make these available here
Models 📊
Results: 📂 Click Me!
Additional: 📋

Under Miscellaneous, you can find the code implementating the abstraction schema as well as additional files we used (e.g., additional list of file extensions).
Extra: 📋
- The Hyperparameters tuning results, as well as the convergence of the models, are available here
- To navigate the replication package click here

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Code		Code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code

Code

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Automated Completion of GitHub Workflows

Pipeline Description

Pre-training

How to train a new SPmodel

Set up a GCS Bucket 💡

Datasets 📎

Pre-trainig/Fine-tuning 💻

Statistical Tests

Models 📊

Results: 📂 Click Me!

Additional: 📋

Extra: 📋

About

Releases

Packages

Languages

License

antonio-mastropaolo/GH-WCOM

Folders and files

Latest commit

History

Repository files navigation

Automated Completion of GitHub Workflows

Pipeline Description

Pre-training

How to train a new SPmodel

Set up a GCS Bucket 💡

Datasets 📎

Pre-trainig/Fine-tuning 💻

Statistical Tests

Models 📊

Results: 📂 Click Me!

Additional: 📋

Extra: 📋

About

Resources

License

Stars

Watchers

Forks

Languages