## Validating code translation with CLDK

In this tutorial, we will use CLDK to translate code and check properties of the translated code. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis for this task. By the end of this tutorial, you will have implemented a simple Java-to-Python code translator that also performs light-weight property checking on the translated code.

Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code translation and checking the translated code:

1. Create a new instance of the CLDK class.
2. Create an analysis object for the target Java application.
3. Iterate over all files in the application.
4. Iterate over all classes in a file.
5. Sanitize the class for prompting the LLM.
6. Create treesitter-based Java and Python analysis objects

In [None]:
%%bash
python3 -m venv .venv
source .venv/bin/activate
pip install -U -r requirements.txt

## Let's setup our LLM 

We'll be using open router, so we'll load the API key from the environment variable `OPENROUTER_API`.

In [None]:
## Import API keys
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=os.getenv("PWD") + "/.env", override=True)
# Load environment variables from .env file
    
print("API keys loaded successfully.")
print(f"API_KEY: {os.getenv('OPENROUTER_API')[:3]}...{os.getenv('OPENROUTER_API')[-3:]}")

#### Let's create a simple prompting function

This function will take a prompt and return the response from the OpenRouter API.

In [None]:
from openai import OpenAI


def prompt(message: str) -> str:
    """
    Function to prompt the user for input.
    """
    client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=os.getenv("OPENROUTER_API"),  # OpenRouter API key
    )
    completion = client.chat.completions.create(
        model="meta-llama/llama-3.2-3b-instruct:free", messages=[{"role": "user", "content": message}]
    )

    return completion.choices[0].message.content

def test_prompt():
    """
    Test function to check if the prompt function works correctly.
    """
    test_message = "What is the capital of France?"
    response = prompt(test_message)
    
    assert "Paris" in response, f"Expected response to contain 'Paris', but got '{response}'"

test_prompt()

## Translating Java code to Python

We'll start by downloading apache commons cli for this tutorial.

In [None]:
%%bash
COMMONS=commons-cli-1.7.0  
wget https://github.com/apache/commons-cli/archive/refs/tags/rel/$COMMONS.zip -O $COMMONS.zip && \
unzip -o $COMMONS.zip && \
rm -f $COMMONS.zip 

Next, let's create another helper function to formulate the prompt for summarizing the methods in a java application.

In [None]:
def format_inst(code, focal_class, language):
    """
    Format the LLM instruction for the given focal method and class.
    """
    inst = f"Translate the Java class `{focal_class}` below to Python and generate under code block (```)?\n"
    inst += "Generate the code under ``` code block.\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

### Putting it all together

Now that we have the analysis object, we will take a slightly different approach to generate the test cases: 

We go through all the classes in the application, and for each class, 
   1. We collect the signatures of its constructors. 
   2. If a class has no constructors, we add the signature of the default constructor. 
   3. We go through each non-private method of the class and formulate the prompt using the constructor and the method information. 

Finally, we use the prompt to call the LLM to generate test cases and get the LLM response. 

> **NOTE:** For the sake of simplicity, we run the test generation on a single class and method but this filter can be removed to run this code over the entire application.

In [None]:
target_class = "org.apache.commons.cli.GnuParser"

In [None]:
from cldk import CLDK
from cldk.analysis.commons.treesitter import TreesitterPython, TreesitterJava

analysis = CLDK(language="java").analysis(
    project_path="commons-cli-rel-commons-cli-1.7.0",  #  <-- the path to the project we downloaded a few cells ago.
    analysis_level="symbol table",  # <-- This is the default, no need to specify it explicitly.
)

# Go through all the classes in the application
for class_name in analysis.get_classes():

    if class_name == target_class:
        # Get the location of the Java class
        class_path = analysis.get_java_file(qualified_class_name=class_name)

        # Read the file content
        if not class_path:
            class_body = ""
        with open(class_path, "r", encoding="utf-8", errors="ignore") as f:
            class_body = f.read()

        # Sanitize the file content by removing comments
        sanitized_class = TreesitterJava().remove_all_comments(source_code=class_body)
        # ^^^^^^^^^^^^^^^^^^^
        # TreesitterJava API to remove comments

        # Create prompt for translating sanitized Java class to Python
        inst = format_inst(
            code=sanitized_class, focal_class=class_name.split(".")[-1], language="java"
        )

        print(f"Instruction:\n{inst}\n")
        print(
            f"Translating Java code to Python and it will take few minutes (or even seconds) based on where the model has been hosted...\n"
        )

        # Prompt the local model on Ollama
        translated_code = prompt(message=inst)

        # Print translated code
        print(f"Translated Python code: \n{translated_code}\n")

        # Create python sitter instance for analyzing translated Python code
        py_cldk = TreesitterPython()
        # Compute methods, function, and field counts for translated code
        all_methods = py_cldk.get_all_methods(module=translated_code)
        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        all_functions = py_cldk.get_all_functions(module=translated_code)
        all_fields = py_cldk.get_all_fields(module=translated_code)

        # Check counts against method and field counts for Java code
        assert len(all_methods) + len(all_functions) == len(
            analysis.get_methods_in_class(qualified_class_name=class_name)
        ), f"Number of translated method not matching in class {class_name}"
        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        # Raise an exception if the number of translated methods does not match
        # the number of methods in the original Java class.

        print(
            f"Number of translated method in class {class_name} is {len(all_methods)}"
        )
        if all_fields is not None:
            assert len(all_fields) == len(
                analysis.get_class(qualified_class_name=class_name).field_declarations
            ), f"Number of translated field not matching in class {class_name}"

            print(
                f"Number of translated fields in class {class_name} is {len(all_fields)}"
            )