## CLDK Tutorial

We'll be using CLDK and a large language model perform three simple tasks:
1. Generate test cases for all the methods in a java application.
2. Summarize the methods in a python application.
3. Translate a python application to java.

In [1]:
%%bash
python -m venv .venv
source .venv/bin/activate
pip install -U -r requirements.txt

Collecting cldk==1.0.0 (from -r requirements.txt (line 1))
  Using cached cldk-1.0.0-py3-none-any.whl.metadata (14 kB)
Collecting OpenAI (from -r requirements.txt (line 2))
  Using cached openai-1.76.2-py3-none-any.whl.metadata (25 kB)
Collecting jupyter (from -r requirements.txt (line 3))
  Using cached jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting python-dotenv (from -r requirements.txt (line 4))
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting clang==17.0.6 (from cldk==1.0.0->-r requirements.txt (line 1))
  Using cached clang-17.0.6-py3-none-any.whl.metadata (1.0 kB)
Collecting libclang==17.0.6 (from cldk==1.0.0->-r requirements.txt (line 1))
  Using cached libclang-17.0.6-py2.py3-none-macosx_11_0_arm64.whl.metadata (5.2 kB)
Collecting networkx<4.0.0,>=3.4.2 (from cldk==1.0.0->-r requirements.txt (line 1))
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting pandas<3.0.0,>=2.2.3 (from cldk==1.0.0->-r requirement


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Let's setup our LLM 

We'll be using open router, so we'll load the API key from the environment variable `OPENROUTER_API`.

In [None]:
## Import API keys
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=os.getenv("PWD") + "/.env", override=True)
# Load environment variables from .env file

print("OPENROUTER API keys loaded successfully.")
print("OPENROUTER_API:", os.getenv("OPENROUTER_API")[:20])

#### Let's create a simple prompting function

This function will take a prompt and return the response from the OpenRouter API.

In [None]:
from openai import OpenAI


def prompt(message: str) -> str:
    """
    Function to prompt the user for input.
    """
    client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=os.getenv("OPENROUTER_API"),  # OpenRouter API key
    )
    completion = client.chat.completions.create(
        model="qwen/qwen3-8b:free", messages=[{"role": "user", "content": message}]
    )

    return completion.choices[0].message.content

def test_prompt():
    """
    Test function to check if the prompt function works correctly.
    """
    test_message = "What is the capital of France?"
    response = prompt(test_message)
    
    assert "Paris" in response, f"Expected response to contain 'Paris', but got '{response}'"

test_prompt()

## Summarize methods in a java application

We'll start by dowloading apache commons cli for this tutorial. 

In [None]:
%%bash
COMMONS=commons-cli-1.7.0  
wget https://github.com/apache/commons-cli/archive/refs/tags/rel/$COMMONS.zip -O $COMMONS.zip && \
unzip -o $COMMONS.zip && \
rm -f $COMMONS.zip 

Next, let's create another helper function to formulate the prompt for summarizing the methods in a java application.

In [None]:
def format_inst(code, focal_method, focal_class, language):
    """
    Format the instruction for the given focal method and class.
    """
    inst = f"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\n"

    inst += "\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

Let's initialize CLDK with Java as the language

In [None]:
from cldk import CLDK

cldk = CLDK(language="java")

#### Generate analysis artifacts

##### What is CLDK analysis?
CLDK uses [CodeAnalyzer](https://github.com/codellm-devkit/codeanalyzer-java) (built with [WALA](https://github.com/wala/WALA) and [JavaParser](https://github.com/javaparser/javaparser))as the Java analysis engine. CLDK supports different analysis levels: 1) symbol table, 2) call graph, 3) system dependency graph. 

The analysis level can be selected using the `AnalysisLevel` enumerated type. For this example, we select the symbol-table analysis level, with CodeAnalyzer as the default analysis engine.

> **NOTE:** If the next cell throws an error `CalledProcessError`, make sure you have a working Java installation! See the [**CLDK Documentation**](https://codellm-devkit.info/installing/#java-analysis) for how to set this up.

##### How to create an analysis object?

To create an analysis object, we call `cldk.analysis(...)` with the following parameters:
- `project_path`: The path to the project to be analyzed.
- `analysis_level`: The analysis level to be used. This can be one of the following: 
  - `AnalysisLevel.SYMBOL_TABLE`: For analyzing the symbol tables of the application with the analysis engine's JavaParser.
  - `AnalysisLevel.CALL_GRAPH`: To build the call graph of the application with the analysis engine's WALA.


In [None]:
# Setup analysis object
analysis = cldk.analysis(
    project_path="commons-cli-rel-commons-cli-1.7.0", #  <-- the path to the project we downloaded a few cells ago.
    analysis_level="symbol table",  # <-- This is the default, no need to specify it explicitly.
)

> **NOTE:** This will take a few seconds to run, as it will analyze the entire project. 
> The analysis pipeline involves the following steps:
>   1. **Dependency Resolution**: Maven or gradle is used to resolve the dependencies of the project and download them to a local directory.
>   2. **Parsing**: The JavaParser library is used to parse the Java source code files and build an abstract syntax tree (AST) representation of the code.
>   3. **Type Resolution**: The JavaParser library is used to resolve the types of the variables and methods in the code, which is necessary for building the symbol table and call graph.
>   4. **Symbol Table Construction**: The symbol table is constructed from the AST, which includes information about the classes, methods, and variables in the code.
>   5. **Call Graph Construction**: The call graph is constructed using the WALA library, which analyzes the control flow of the program and builds a graph representation of the method calls. (*Not executed this time because we set `analysis_level="symbol table"`*)

### Sanitize class of prompt composition

Instead of passing the entire class for summarization, we can pass the class name and the methods in the class and all the reference the focal method makes: imports, fields, etc. This will help the LLM to focus on the methods and their dependencies, rather than the entire class. To illustrate, consider the floowing class:

```java
package com.ibm.org;
import A.B.C.D;
...
public class Foo {
 // code comment
 public void bar(){
    int a;
    a = baz();
    // do something
    }
 private int baz()
 {
    // do something
 }
 public String dummy (String a)
 {
    // do somthing
 }
```

Let's say we want to generate a summary for method `bar`. To understand what it does, we add the callees of this method in the prompt, which in this case includes `baz`. We remove the other methods, imports, comments, etc.

With CLDK this can be very easily done. All of this can be achieved with a single call to CLDK's `sanitize_focal_class` API!

### Putting it all together

Now that we have the analysis object, we can use it to generate the prompt for summarizing the methods in the java application. We can use CLDK to 

1. iterate over all the methods in all the classes, 
2. sanitize them with the `sanitize_focal_class` API, 
3. compose a prompt with our `format_inst(...)` function we wrote a few cells ago, and
4. call `prompt(...)` method we wrote earlier to get the summary of the methods.



> **NOTE:** For the sake of simplicity, we run the code summarization on a single class and method but this filter can be removed to run this code over the entire application.

In [None]:
target_class = "org.apache.commons.cli.GnuParser"
target_method = "flatten(Options, String[], boolean)"

In [None]:
# -----
# I am import class for type hinting (optional but recommended)
from cldk.utils.sanitization.java import TreesitterSanitizer
from cldk.models.java import JCallable
# -----


# Iterate over all classes in the application
for class_name in analysis.get_classes():
    if class_name == target_class:
        print(f"Class: {class_name}")
        # -----
        # The `get_java_file` method returns the path to the Java file for the given class name.
        class_file_path = analysis.get_java_file(qualified_class_name=class_name)
        # -----

        # Read code for the class
        with open(class_file_path, "r") as f:
            code_body = f.read()

        # -----
        # `tree_sitter_utils` is a utility class that provides methods for working directly with the source code string
        # -----
        tree_sitter_utils: TreesitterSanitizer = cldk.tree_sitter_utils(source_code=code_body)
        #                  ^^^^^^^^^^^^^^^^^^^
        #                  This the TreesitterSanitizer object that we will use to sanitize the code.

        # Iterate over all methods in class
        for method in analysis.get_methods_in_class(qualified_class_name=class_name): # <-- This API takes the class name     
                                                                                      #     to get all the methods there.
            if method == target_method:
                # -----
                # Now we can get the pydantic object that corresponds to the method we are interested in
                # with the `get_method` API.
                # -----
                method_details: JCallable = analysis.get_method(
                #               ^^^^^^^^^
                #               This is the JCallable object that we will use to get the method details.
                    qualified_class_name=class_name, qualified_method_name=method
                )

                # Sanitize the class for analysis with respect to the target method
                sanitized_class = tree_sitter_utils.sanitize_focal_class(
                    method_details.declaration # <-- This is the method declaration string 
                )
                # The `sanitize_focal_class` method will remove all the methods that are not in the class, it will also
                # remove the imports, fields, nested classes, and other auxiliary declerations. The outcome will be a
                # "cleaned" class that only contains the method we are interested in, and all its references.
                
                
                # Format the instruction for the given target method and class
                instruction = format_inst(
                    code=sanitized_class,
                    focal_method=method_details.declaration,
                    focal_class=class_name.split(".")[-1],
                    language="java",
                )

                print(f"Instruction:\n{instruction}\n")
                print(
                    f"Generating code summary and it will take few minutes (or even seconds) based on where the model has been hosted...\n"
                )

                # Prompt the the model on OpenRouter
                llm_output = prompt(message=instruction)

                # Print the LLM output
                print(f"LLM Output:\n{llm_output}")