# Instructing GPT-2 to Generate Java Specific Code

In order to allow the repetition of this notebook, we give the technical details of the environment used to run the code.

Hardware:
- MSI Katana
- Intel Core i7-10750H
- NVIDIA GeForce RTX 3050 Laptop GPU

Software:
- Windows 10
- Python 3.9.6

# Autorregressive Fine-Tuning

In this section, we will fine-tune GPT-2, a pre-trained language model developed by OpenAI that is capable of generating coherent and contextually relevant text. GPT-2 uses the transformer architecture and has been trained on a diverse dataset of text from the internet. By fine-tuning this model, we aim to adapt it specifically to generate Java-like code that adheres to the syntax and structure of the Java programming language.

### Data Preparation

In [1]:
import pandas as pd
import numpy as np
import os

To achieve the goal of coming to a model that easily generates Java-like code, we first scraped a substantial amount of Java code from public GitHub repositories, creating a dataset rich in diverse Java code snippets. The dataset includes various examples of classes, methods, loops, conditionals, and other key constructs of the Java language. This diversity ensures that the model can generalize well across different coding scenarios.

In [2]:
def read_file(path):
  try:
      with open(path) as f:
        for line in f.readlines():
          if line[:6] != "<body>":
            return line
  except:
    print(path)
    return ""

  return ""


def read_files(dir_path):
  path_list = os.listdir(dir_path)
  content = ""

  for path in path_list:
    content += read_file(dir_path + "/" + path)

  return content

In [3]:
dataset_path = "dataset"

In [4]:
text_data = read_files(dataset_path)

dataset/https__github.com_spring-projects_spring-boot_blob_main_spring-boot-project_spring-boot-autoconfigure_src_test_java_org_springframework_boot_autoconfigure_data_alt_elasticsearch_CityReactiveElasticsearchDbRepository.java
dataset/lines


For efficient training, we preprocess this dataset into two text files: one for training and one for validation. Each file contains one Java code snippet per line, ensuring consistency in data structure. The training set will be used to fine-tune the GPT-2 model, while the validation set will help monitor performance and prevent overfitting.

In [5]:
train_text = text_data[:int(0.8*len(text_data))]
test_text = text_data[int(0.8*len(text_data)):]

In [6]:
path = ""

with open(path + "train_text.txt", "w") as f:
  f.write(train_text)

with open(path + "test_text.txt", "w") as f:
  f.write(test_text)

We are now ready to fine-tune the GPT-2 model on our Java code dataset. To do so, we will make use of the Hugging Face Transformers library, which provides easy-to-use interfaces for working with pre-trained language models. This library is widely used in natural language processing tasks and provides a straightforward way to fine-tune models on custom datasets. The model we will use is the `GPT2LMHeadModel`, which is the GPT-2 model with a language modeling head, consisting of 345M parameters, and has been pre-trained on a large corpus of text data.

In [3]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

import torch

In [4]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset


def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

The training process involves feeding the model with Java code snippets and training it to predict the next token in the sequence. By doing so, the model learns the underlying patterns and structures of Java code, enabling it to generate new code snippets that are syntactically correct and contextually relevant. Thus, this is an autoregressive training process, where we simply show the model how to generate Java code by providing it with examples from our dataset without any additional purpose.

In [5]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  tokenizer.save_pretrained(output_dir)

  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  model = GPT2LMHeadModel.from_pretrained(model_name)
  
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model.to(device)
  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
          save_steps=save_steps,
      )
  
  print(model.device)
  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

We can now proceed with the fine-tuning process. We will run 5 epochs of training, which should be sufficient to capture the patterns in the Java code and fine-tune the model effectively, due to the huge amount of data we have. We will use a batch size of 8 and save the model checkpoints at regular intervals to monitor the training progress.

In [7]:
train_file_path = "train_text.txt"
model_name = 'gpt2'

output_dir = 'models/custom_gpt2'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 5
save_steps = 0

In [11]:
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)



cuda:0


 19%|█▊        | 500/2695 [08:29<43:06,  1.18s/it]

{'loss': 1.3206, 'grad_norm': 2.9556143283843994, 'learning_rate': 4.072356215213358e-05, 'epoch': 0.93}


 37%|███▋      | 1000/2695 [23:05<32:16,  1.14s/it] 

{'loss': 0.978, 'grad_norm': 2.8398938179016113, 'learning_rate': 3.144712430426716e-05, 'epoch': 1.86}


 56%|█████▌    | 1500/2695 [34:42<31:18,  1.57s/it]

{'loss': 0.8559, 'grad_norm': 2.53639554977417, 'learning_rate': 2.2170686456400745e-05, 'epoch': 2.78}


 74%|███████▍  | 2000/2695 [48:35<20:01,  1.73s/it]

{'loss': 0.7798, 'grad_norm': 4.038917541503906, 'learning_rate': 1.2894248608534323e-05, 'epoch': 3.71}


 93%|█████████▎| 2500/2695 [1:03:11<05:43,  1.76s/it]

{'loss': 0.7252, 'grad_norm': 2.581815242767334, 'learning_rate': 3.6178107606679037e-06, 'epoch': 4.64}


100%|██████████| 2695/2695 [1:08:59<00:00,  1.54s/it]


{'train_runtime': 4139.586, 'train_samples_per_second': 5.201, 'train_steps_per_second': 0.651, 'train_loss': 0.91506334067718, 'epoch': 5.0}


As we can appreciate, the training process is computationally intensive and may take a considerable amount of time to complete. Therefore, we recommend running this notebook on a machine with a GPU to accelerate the training process. However, we achieve a loss of only 0.7252 withing 5 epochs, which is a good sign that the model is learning the Java code patterns effectively, which we will confirm in the next section.

# Model Evaluation

We can now proceed testing the model we have just trained. We will provide the model with a prompt, which is an incomplete Java code snippet, and ask it to generate the next tokens to complete the code. By examining the generated output, we can evaluate the model's ability to generate Java-like code that adheres to the syntax and structure of the Java programming language.

In [8]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer


def generate_text(model_path, sequence, extra_length):
    model = load_model(model_path).to(torch.device("cuda"))
    model.requires_grad_(False)
    tokenizer = load_tokenizer(model_path)

    ids = tokenizer.encode(f'{sequence}', return_tensors='pt').to(torch.device("cuda"))
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=len(ids) + extra_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )

    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

Feel free to modify `sequence` variable in order to test the model yourself:

In [11]:
sequence = "public class"
generate_text(output_dir, sequence, extra_length=20)

public class EndpointRepository { private final WebEndpointWebServer host; @Override public String to


In [10]:
sequence = "public static void"
generate_text(output_dir, sequence, extra_length=20)

public static void main(String[] args) { ConfigurableObjectMapper config = new ConfigurableObject


In [12]:
sequence = "@Override"
generate_text(output_dir, sequence, extra_length=20)

@Override public static final Map<string, string>
 /* * Copyright 2012-2024 the original


In [13]:
sequence = "for (String link :"
generate_text(output_dir, sequence, extra_length=20)

for (String link : String) { this.link = link; } void update(HttpServlet


As we can see, the model seems to be generating something similar to what Java syntax would look like. However, it doesn't really seem to be generating Java code that is syntactically correct. Nevertheless, has the model learned the patterns of Java code? We can compare the trained model's output with the original GPT-2 model's output to see if there is any improvement, and we appreciate that the model has indeed learned some Java code patterns.

In [31]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2', device=torch.device("cuda"))



In [41]:
sequence = "public class"
generator(sequence, max_length=30, num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'public class Bool is a constructor function that provides an argument that you can use to define new value types for your classes:\n\nclass Bool'}]

Other examples are the following

In [65]:
sequence = "new Hash"

print("RePylot generation:")
generate_text(output_dir, sequence, extra_length=20)

print("\nGPT-2 generation:")
generator(sequence, max_length=30, num_return_sequences=1)

RePylot generation:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


new HashMap<string, string>
 /* * Copyright 2012-2023 the original author or authors

GPT-2 generation:


[{'generated_text': "new Hash-Entry `initiate_hash_entry'); fn start(_ & mut self, hash: & mut T) -> Self { let"}]

### Improving the Dataset

The model recently trained has shown some potential in generating Java-like code. However, there is still room for improvement. Indeed, the model isn't capable of generating yet an important part of Java syntax, which is the indentation. This is a crucial aspect of Java code, as it defines the structure and hierarchy of the code blocks. This problem is directly related to the dataset used for training, as the code snippets in the dataset do not include indentation. To overcome this limitation, we need to preprocess the dataset to include proper indentation in the code snippets. This will help the model learn the correct structure of Java code and generate more accurate and syntactically correct code snippets.

In [7]:
dataset_path = "./autoreg_data"

In [73]:
def read_file(path):
  content = ""
  try:
      with open(path) as f:
        for line in f.readlines():
          content += line
  except:
    pass

  return content

In [None]:
text_data = read_files(dataset_path)

In [75]:
with open(path + "train_text_lines.txt", "w") as f:
  f.write(text_data)

Now, we have prepared a new dataset that includes indentation in the Java code snippets. We will use this dataset to fine-tune the GPT-2 model once again, with the goal of improving the model's ability to generate Java-like code that adheres to the syntax and structure of the Java programming language. The training process will be similar to the previous one, but we have modified the number of epochs to 3, since the model has been already trained on Java code patterns and we are only focusing on improving the indentation aspect.

In [14]:
train_file_path = "train_text_lines.txt"
model_name = 'models/custom_gpt2'

output_dir = 'models/custom_gpt2_10'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 2.5
save_steps = 0

In [9]:
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)



cuda:0


 10%|█         | 500/4775 [11:31<1:51:16,  1.56s/it]

{'loss': 1.5515, 'grad_norm': 1.981380581855774, 'learning_rate': 4.4764397905759164e-05, 'epoch': 0.26}


 21%|██        | 1000/4775 [24:54<1:40:48,  1.60s/it]

{'loss': 1.3418, 'grad_norm': 1.7969576120376587, 'learning_rate': 3.9528795811518326e-05, 'epoch': 0.52}


 31%|███▏      | 1500/4775 [38:15<1:26:39,  1.59s/it]

{'loss': 1.2527, 'grad_norm': 2.2382631301879883, 'learning_rate': 3.429319371727749e-05, 'epoch': 0.79}


 42%|████▏     | 2000/4775 [51:30<1:14:11,  1.60s/it]

{'loss': 1.1795, 'grad_norm': 1.9175386428833008, 'learning_rate': 2.905759162303665e-05, 'epoch': 1.05}


 52%|█████▏    | 2500/4775 [1:04:55<1:01:33,  1.62s/it]

{'loss': 1.0933, 'grad_norm': 1.9383200407028198, 'learning_rate': 2.382198952879581e-05, 'epoch': 1.31}


 63%|██████▎   | 3000/4775 [1:18:30<48:08,  1.63s/it]  

{'loss': 1.0836, 'grad_norm': 1.7251787185668945, 'learning_rate': 1.8586387434554976e-05, 'epoch': 1.57}


 73%|███████▎  | 3500/4775 [1:31:56<34:16,  1.61s/it]

{'loss': 1.0489, 'grad_norm': 1.8688249588012695, 'learning_rate': 1.3350785340314136e-05, 'epoch': 1.83}


 84%|████████▍ | 4000/4775 [1:45:32<21:15,  1.65s/it]

{'loss': 1.0305, 'grad_norm': 2.081165075302124, 'learning_rate': 8.115183246073298e-06, 'epoch': 2.09}


 94%|█████████▍| 4500/4775 [1:58:56<07:22,  1.61s/it]

{'loss': 1.0004, 'grad_norm': 1.6340305805206299, 'learning_rate': 2.879581151832461e-06, 'epoch': 2.36}


100%|██████████| 4775/4775 [2:06:32<00:00,  1.59s/it]


{'train_runtime': 7592.838, 'train_samples_per_second': 5.03, 'train_steps_per_second': 0.629, 'train_loss': 1.1648007441815282, 'epoch': 2.5}


Indeed, the model has learned the indentation patterns of Java code, and the generated code snippets now include proper indentation. This is a significant improvement over the previous model, as it demonstrates the model's ability to capture the hierarchical structure of Java code and generate code snippets that are syntactically correct and well-formatted.

In [15]:
sequence = "public class Main implement"
generate_text('models/custom_gpt2_10', sequence, extra_length=70)

public class Main implement RequestForActionListener {

    @Override
    protected void request(
         Request request) {
          assertThat(request.getStatusLine().getStatusCode(), is(200));
       


In [16]:
sequence = "public static int"
generate_text('models/custom_gpt2_10', sequence, extra_length=120)

public static int indexNodes() {
        int i = 0;
        boolean hasNoKey = false;
        if(randomBoolean()) {
            hasNoKey = true;
         }

         char[] nodeName = randomCharArray(0, randomIntBetween(0, 100));

        // TODO: add additional


In [17]:
sequence = "ArrayList<String> list = "
generate_text('models/custom_gpt2_10', sequence, extra_length=50)

ArrayList<String> list =   new ArrayList<>();
        for(int i = 0; i < s.length; i++) {
           


# Instruction Tuning

Once our model has been trained on Java code snippets with proper indentation in an autoregressive manner, we can further fine-tune it to generate Java code snippets based on specific instructions. This process involves providing the model with a prompt that includes a specific instruction or requirement, and then asking it to generate Java code that fulfills that instruction. By fine-tuning the model on a set of instruction-code pairs, we can teach it to generate Java code that meets specific criteria or follows certain guidelines.

In this section, we will fine-tune the model on a dataset of instruction-code pairs, where each pair consists of an instruction and the corresponding Java code snippet that fulfills that instruction. Inspiring ourselves in the way Copilot works, we have decided that the best way for a user to specify these instructions is by providing a comment in the same code snippet. This way, it is natural for the user to specify the requirements of the code in the form of comments, which are a common practice in software development.

The dataset we will use for this fine-tuning process contains one only instruction-code template, at least in this section, so we can test the model's ability to learn the specific instruction and generate Java code that fulfills it. The instruction is to create a method prints something that has been specified in the comment. Thus, an example of an instruction-code pair in the dataset would be:

```java
// Create a method that prints "Hello, World!"
public void printHelloWorld() {
    System.out.println("Hello, World!");
} </s>
```

Here, `//` will works as a start token for the instruction, and `</s>` will work as an end token for the instruction. This way, once the model generates the code asked by the instruction, it will generate the end token to indicate that the instruction has been fulfilled.

In [37]:
train_file_path = "mock_data_instruct.jsonl"
model_name = 'models/custom_gpt2_10'

output_dir = 'models/custom_gpt2_instruct_comments'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 2.5
save_steps = 0

In [38]:
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)



cuda:0


  0%|          | 0/103 [00:00<?, ?it/s]

{'train_runtime': 62.247, 'train_samples_per_second': 12.892, 'train_steps_per_second': 1.655, 'train_loss': 0.8876300552516307, 'epoch': 2.51}


After 2.5 epochs of training, let's test the model with some instructions.

**(*) NOTE:** It is important to notice that we have only generated instructions using a single comment template. We did this in order to test the model's ability to infer the instructions from slightly different comments the model has not seen before, making use of the generalization capabilities of the model adquired during the base training. However, the model can be trained with multiple instruction templates to generate more complex code snippets.

In [39]:
def generate_text(model_path, sequence, extra_length):
    model = load_model(model_path).to(torch.device("cuda"))
    model.requires_grad_(False)
    tokenizer = load_tokenizer(model_path)

    ids = tokenizer.encode(f'{sequence}', return_tensors='pt').to(torch.device("cuda"))
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=len(ids) + extra_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )

    return tokenizer.decode(final_outputs[0], skip_special_tokens=True)

In [51]:
sequence = "// A function printing transparente"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// A function printing transparente
public void transparente() { System.out.println("transparente"); } 


In [99]:
sequence = "// A function that shows My Dog is Cute"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
answer

'// A function that shows My Dog is Cute! by Stephane Nicoll\npublic void myDotcalcute() { System.out.println("my dog is cutes"); } '

In [59]:
sequence = "// A function that prints Cayetano explaining NLP"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// A function that prints Cayetano explaining NLP
public void Cayetano() { System.out.println("cayetano explaining nLP"); } 


In [63]:
sequence = "// Haz una función que imprima Cayetano put us a 10 on NLP"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// Haz una función que imprima Cayetano put us a 10 on NLP
public void Cayetano() { System.out.println("cayetano put us a 10 on NLP"); } 


In [70]:
sequence = "// Haz una función que imprima Óscar and Ricardo did a great job!"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// Haz una función que imprima Óscar and Ricardo did a great job!
public void ÓscarAndRico() { System.out.println("áscar and Ricardo did a great job!"); } 


In [83]:
sequence = "// A function showing to the user Hello!"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// A function showing to the user Hello!
public void hello() { System.out.println("hello"); } 


In [96]:
sequence = "// Función que saque Hola mundo! por pantalla"
text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// Función que saque Hola mundo! por pantalla!
public void humora() { System.out.println("hola mundo!"); } 


### Model Evaluation

The instructed fine-tuned model seems to be working. However, it doesn't work that well for all cases. As shown in the next test, the model is only able to generate the correct code a 55.1% of the times, when it is asked to create a method that prints strings the model has not seen before. This shows that the model is not able to generalize well to unseen data, and it is not that good generating code that is not in the training set.

In [157]:
import regex as re

In [166]:
def check_print(answer, word):
    inside = re.search(r"System\.out\.println\((.*)\)", answer)

    if inside:
        inside = inside.group(1).replace("\"", "").lower()

        if word.lower() in inside:
            return True
        
    return False

In [201]:
words = ["cayetano", "Oscar", "ricardo", "dog", "cute", "great", "job", "hello", "world", "pantalla", "mundo", "hola", "funcion", "imprime", "explicando", "transparente", "us", "NLP", "10", "put", "shows", "user", "explaining", "imprimir", "pantalla", "imprime", "explicando", "usuario", "muestra"]

In [168]:
count = 0

for word in words:
    print(f"Word: {word}")
    print("GPT2 generation:")
    sequence = f"// A function that prints {word}\n"
    text = generate_text('models/custom_gpt2_instruct_comments', sequence, extra_length=70)
    answer = text.split("</s>")[0]
    print(answer)

    valid_print = check_print(answer, word)
    print(f"Valid print: {valid_print}")
    if valid_print:
        count += 1

    print("\n")

Word: cayetano
GPT2 generation:
// A function that prints cayetano
public void acesecu() { System.out.println("acesecu"); } 
Valid print: False


Word: Oscar
GPT2 generation:
// A function that prints Oscar
public void Oscar() { System.out.println("obraza"); } 
Valid print: False


Word: ricardo
GPT2 generation:
// A function that prints ricardo
public void ricardos() { System.out.println("ricardo"); } 
Valid print: True


Word: dog
GPT2 generation:
// A function that prints dog
public void printIsafable() { System.out.println("isafable"); } 
Valid print: False


Word: cute
GPT2 generation:
// A function that prints cute
public void cute() { System.out.println("cadorado"); } 
Valid print: False


Word: great
GPT2 generation:
// A function that prints great
public void aGreat() { System.out.println("a great"); } 
Valid print: True


Word: job
GPT2 generation:
// A function that prints job
public void job() { System.out.println("job"); } 
Valid print: True


Word: hello
GPT2 generation:


In [3]:
acc = count/len(words)
print(f'The model generated {acc*100}% of the methods correctly')

The model generated 55.1% of the methods correctly


### Improving the Dataset

As an attempt to improve the model's performance, we can try to increase the number of samples in the dataset. To do so, we will use a dictionary of around 20 thausand words, where just a few are spanish words and the rest are english words. We will use this dictionary to generate much more instructions, combining these words together to come up with more complicated prompts.

In [178]:
train_file_path = "grand_mock_data_instruct.jsonl"
model_name = 'models/custom_gpt2_instruct_comments'

output_dir = 'models/custom_gpt2_instruct_comments_improved'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 5
save_steps = 0

In [179]:
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)



cuda:0


  0%|          | 0/1955 [00:00<?, ?it/s]

{'loss': 1.0146, 'grad_norm': 1.6153161525726318, 'learning_rate': 3.721227621483376e-05, 'epoch': 1.28}
{'loss': 0.8492, 'grad_norm': 1.2364546060562134, 'learning_rate': 2.442455242966752e-05, 'epoch': 2.56}
{'loss': 0.7789, 'grad_norm': 1.2446939945220947, 'learning_rate': 1.163682864450128e-05, 'epoch': 3.84}
{'train_runtime': 3449.962, 'train_samples_per_second': 4.53, 'train_steps_per_second': 0.567, 'train_loss': 0.8483959305316896, 'epoch': 5.0}


As a result, we obtain better generations, with the model being able to generate the correct code 93.1% of the times, being the 6,9% remaining cases related to typos. This is a clear indication that the model could learn even more complex instructions if we provide it with a more diverse dataset.

In [302]:
sequence = "// Announce We are going to pass PLN with flying colors!\n"
text = generate_text('models/custom_gpt2_instruct_comments_improved', sequence, extra_length=70)
answer = text.split("</s>")[0]
print(answer)

// Announce We are going to pass PLN with flying colors!
public void announceWeareGoingToPassPlNwithFlyingComesUnrighted() {
	System.out.println("We are going to pass Pln with flying colors!");
} 


In [300]:
sequence = "// Print En un lugar de la mancha de cuyo nombre no quiero acordarme, a sentence from El Quijote\n"
text = generate_text('models/custom_gpt2_instruct_comments_improved', sequence, extra_length=150)
answer = text.split("</s>")[0]
print(answer)

// Print En un lugar de la mancha de cuyo nombre no quiero acordarme, a sentence from El Quijote
public void en unlugarDeLaManchaNombreNoQuieroAclarmeAclotharme() {
	System.out.println("En un lugar de la Mancha de Cuyo nombre no Quiero Acordarme");
} 


In [257]:
sequence = "// Lo que quieras\n"
text = generate_text('models/custom_gpt2_instruct_comments_improved', sequence, extra_length=150)
answer = text.split("</s>")[0]
print(answer)

// Lo que quieras
public void loqueQuieras() {
	System.out.println("Loque Quieras");
} 


In [202]:
count = 0

for word in words:
    print(f"Word: {word}")
    print("GPT2 generation:")
    sequence = f"// A function that prints {word}\n"
    text = generate_text('models/custom_gpt2_instruct_comments_improved', sequence, extra_length=70)
    answer = text.split("</s>")[0]
    print(answer)

    valid_print = check_print(answer, word)
    print(f"Valid print: {valid_print}")
    if valid_print:
        count += 1

    print("\n")

Word: cayetano
GPT2 generation:
// A function that prints cayetano
public void cayetano() {
	System.out.println("Cayetano");
} 
Valid print: True


Word: Oscar
GPT2 generation:
// A function that prints Oscar
public void Oscar() {
	System.out.println("Oscar");
} 
Valid print: True


Word: ricardo
GPT2 generation:
// A function that prints ricardo
public void ricario() {
	System.out.println("Ricario");
} 
Valid print: False


Word: dog
GPT2 generation:
// A function that prints dog
public void dog() {
	System.out.println("Dog");
} 
Valid print: True


Word: cute
GPT2 generation:
// A function that prints cute
public void cute() {
	System.out.println("Cute");
} 
Valid print: True


Word: great
GPT2 generation:
// A function that prints great
public void great() {
	System.out.println("Great");
} 
Valid print: True


Word: job
GPT2 generation:
// A function that prints job
public void job() {
	System.out.println("Job");
} 
Valid print: True


Word: hello
GPT2 generation:
// A function that

In [271]:
acc = count/len(words)
print(f'The model predicts {round(acc*100,2)}% of the functions correctly')

The model predicts 93.1% of the functions correctly
