<a href="https://colab.research.google.com/github/adrienpayong/codecommentgenerator/blob/main/Code_Comment_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Code Comment Generator

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


The goal of this section is to have some code as input and let the model generate comments for that code. In this case we will use the Salesforce CodeT5 model, which is fine-tuned on Java code.
As its name suggests, the T5 encoder-decoder paradigm is the foundation upon which CodeT5 [1] is built. Instead of treating the source code like any other natural language (NL) text, it applies a new identifier-aware pretraining objective that capitalizes on code semantics. This is in contrast with previous code generation models, which rely on traditional pretraining methods.
The authors distributed two pretrained models: a basic model with 220 million data points and a smaller model with only 60 million data points. In addition to that, they distributed all of their fine-tuning checkpoints through their public GCP bucket. Additionally, the well-known huggingface library makes both of these pretrained models available for use.

CodeT5 is a unified pretrained encoder-decoder transformer model. The CodeT5 approach makes use of a unified framework, which not only facilitates multitask learning but also supports code interpretation and generation activities in an effortless manner.
The pretraining of CodeT5 is accomplished in a sequential manner using two separate goals. The model is optimized with an identifier-aware denoising objective during the first 100 epochs. This trains the model to distinguish between identifiers (such as variable names, function names, etc.) and specific programming language (PL) keywords (e.g., if, while, etc.). Then, optimization is performed for a total of 50 iterations utilizing a bimodal dual generation goal. As a final goal, we want to make sure that the code and the NL descriptions are more aligned with one another.
Since this example needs to download models from a non-huggingface repository (as of writing this book, the model was not updated on huggingface), we will do this example in Google Colab instead of huggingface.

In [None]:
!mkdir comment_model
%cd comment_model
!wget -O config.json https://storage.googleapis.com/sfr-codet5-data-research/pretrained_models/codet5_base/config.json

/content/comment_model
--2023-01-23 14:43:45--  https://storage.googleapis.com/sfr-codet5-data-research/pretrained_models/codet5_base/config.json
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.128, 142.250.141.128, 2607:f8b0:4023:c0b::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1422 (1.4K) [application/json]
Saving to: ‘config.json’


2023-01-23 14:43:45 (25.4 MB/s) - ‘config.json’ saved [1422/1422]



In [None]:
!ls

config.json  pytorch_model.bin


In [None]:
!wget -O pytorch_model.bin https://storage.googleapis.com/sfr-codet5-data-research/finetuned_models/summarize_java_codet5_base.bin

--2023-01-23 14:44:12--  https://storage.googleapis.com/sfr-codet5-data-research/finetuned_models/summarize_java_codet5_base.bin
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.128, 142.250.141.128, 2607:f8b0:4023:c0b::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 891651384 (850M) [application/macbinary]
Saving to: ‘pytorch_model.bin’


2023-01-23 14:44:20 (121 MB/s) - ‘pytorch_model.bin’ saved [891651384/891651384]



In [None]:
from transformers import RobertaTokenizer, T5ForConditionalGeneration
model_name_or_path = '/content/comment_model' # Path to the folder created earlier.
codeT5_tkn = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
mdl = T5ForConditionalGeneration.from_pretrained(model_name_or_path)

Code for comment generation from the source code file

In [None]:
text = """ public static void main(String[] args) {

    int num = 29;
    boolean flag = false;
    for (int i = 2; i <= num / 2; ++i) {
    // condition for nonprime number
        if (num % i == 0) {
          flag = true;
          break;
         }
    }
if (!flag)
    System.out.println(num + " is a prime number.");
else
  System.out.println(num + " is not a prime number.");
} """

In [None]:
input_ids = codeT5_tkn(text, return_tensors="pt").input_ids
gen_ids = mdl.generate(input_ids, max_length=20)
print(codeT5_tkn.decode(gen_ids[0], skip_special_tokens=True))

A test program that checks if the number is a prime number.


## Code that tries to generate comment for Google search code

In [None]:
text = """
String google = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=";
    String search = "stackoverflow";
    String charset = "UTF-8";
    URL url = new URL(google + URLEncoder.encode(search, charset));
    Reader reader = new InputStreamReader(url.openStream(), charset);
    GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);
// Show title and URL of 1st result.
System.out.println(results.getResponseData().getResults().get(0).getTitle());
System.out.println(results.getResponseData().getResults().get(0).getUrl());
"""
input_ids = codeT5_tkn(text, return_tensors="pt").input_ids
gen_ids = mdl.generate(input_ids, max_length=50, temperature=0.2,num_beams=200,no_repeat_ngram_size=2,num_return_sequences=5)

In [None]:
print(codeT5_tkn.decode(gen_ids[0], skip_special_tokens=True))

https://www. googleapis. com / ajax. services. search. web?v = 1. 0 &q = 123 Show title and URL of 1st result.


The last result might not look good, but this can be improved by tuning the specific parameters, which I leave to you to experiment with.
Finally, these pretrained models can also be fine-tuned for specific programming languages like C, C++, etc