Skip to content

antonio-mastropaolo/GH-WCOM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Automated Completion of GitHub Workflows

Our work introduces GH-WCOM (GitHub WOrkflOW COMpletiOn), a recommender system designed for GitHub workflows. The main objective of GH-WCOM is to provide automated suggestions for GitHub workflows by recommending appropriate actions. We utilized the text-to-text transfer transform model (T5) as the foundation for our approach.

Pipeline Description

To build GH-WCOM, we relied on the pretrain-then-finetune paradigm. Thus, we first need to pre-train the T5 model and later on we can instantiate the architecture of T5 for the auto-completion of GitHub Workflows.

Pre-training

In order to pre-train (and finetune) a T5 small model, we need a new sentencepiece model to accommodate the expanded vocabulary given by the naturally occuring context-specific tokens featuring GitHub workflow files.

  • How to train a new SPmodel

    Pythonic way

    pip install sentencepiece==0.1.96
    import sentencepiece as spm
    spm.SentencePieceTrainer.train('--input=all.txt --train_extremely_large_corpus=true --model_prefix=tokenizer-gh-action --vocab_size=32000 --bos_id=-1  --eos_id=1 --unk_id=2 --pad_id=0 --shuffle_input_sentence=true --character_coverage=1.0 --user_defined_symbols="<NL>,<PATH>,<PLH>,<V_NUMBER>,<FILE>,<URL>,<FOR_LATER_USE>,<dependency>,</dependency>" --max_sentence_length=20000')
    

    The new SPmodel has to be trained on the entire pre-training corpus. Our tokenizer is publicly available here

  • Set up a GCS Bucket 💡

    To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the orignal guide provided by Google . Here the link: https://cloud.google.com/storage/docs/quickstart-console Subsequently, by following the jupyter notebook we provide for pre-training and fine-tuning the network, you should be able to set up the final environment.

  • Datasets 📎

    The datasets for the pre-training and fine-tuning the model are stored on GDrive here Please notice, that the TF implementation needs TSV files to work properly. Make sure you pick the correct ones from our GDrive folder.

  • Pre-trainig/Fine-tuning 💻

    To pre-train and then fine-tune T5, you can use the script we provide here:

  • Statistical Tests

    The code to replicate the statistical tests (i.e., McNemar and Wilcoxon) are available at the following links:

    As for the data needed to perform the tests, we make these available here

  • Models 📊
  • Results: 📂 Click Me!
  • Additional: 📋

    Under Miscellaneous, you can find the code implementating the abstraction schema as well as additional files we used (e.g., additional list of file extensions).

  • Extra: 📋
    • The Hyperparameters tuning results, as well as the convergence of the models, are available here
    • To navigate the replication package click here

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published