# Gen-AI Workshop: Automatic Detection of Misplaced Business Logic in Java

This notebook demonstrates using RAG, Agents, and Workflows to automatically detect Clean Architecture violations in Java code.

**Focus:** Identify misplaced business logic (e.g., in controllers, repositories, entities) and explain violations.

**Tech Stack:** 
- Python, OpenAI GPT-4.1-nano
- sentence-transformers (embeddings)
- FAISS (vector search)
- LangChain (agents, workflows)

In [1]:
# Installation of dependencies
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## IMPORTANT: Restart Kernel Now

**After running the cell above, you MUST restart the kernel before continuing:**

1. Click **Kernel** → **Restart Kernel** in the menu
2. Or use keyboard shortcut (typically `0` + `0`)
3. Then continue with the cells below

This is required for the newly installed packages (especially `mcp`) to be available for import.

## Setup: Install Dependencies

**Before running the notebook, install required packages:**

```bash
pip install -r requirements.txt
```

**Installed packages:**
- `sentence-transformers` - Text embedding generation
- `faiss-cpu` - Fast similarity search
- `openai` - OpenAI API client
- `langchain` - Agent and workflow framework
- `langchain-openai` - OpenAI integration for LangChain
- `langchain-community` - Additional LangChain tools
- `mcp` - Model Context Protocol

**Note:** First execution will download the sentence transformer model (~90MB).

In [2]:
import re
import os
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.agents import Tool, initialize_agent, AgentType
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, SequentialChain, TransformChain

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Read the OpenAI API key from the api-key.txt file
try:
    with open('api-key.txt', 'r') as f:
        OPENAI_API_KEY = f.read().strip()
    print("API key loaded from the api-key.txt file.")
except FileNotFoundError:
    raise FileNotFoundError(
        "Error: 'api-key.txt' not found.\n"
        "Please create a file named 'api-key.txt' in the project root directory "
        "containing the OpenAI API key provided and re-run this cell."
    )

API key loaded from the api-key.txt file.


In [4]:
# Load Clean Architecture knowledge base from knowledge-base directory
def load_text_file(filepath):
    """Load text file and return its content."""
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read()
        return content.replace('\u200b', '').replace('\ufeff', '')
    except FileNotFoundError:
        raise FileNotFoundError(f"Error: File not found: {filepath}")

# Load all knowledge base markdown files
kb_files = [
    'knowledge-base/01-layering-principles.md',
    'knowledge-base/02-controller-layer.md',
    'knowledge-base/03-service-layer.md',
    'knowledge-base/04-repository-layer.md',
    'knowledge-base/05-entity-layer.md',
    'knowledge-base/06-anti-patterns-overview.md'
]

# Combine all knowledge base files into single corpus
KB_MARKDOWN = ""
for kb_file in kb_files:
    content = load_text_file(kb_file)
    KB_MARKDOWN += f"\n\n# Source: {kb_file}\n\n{content}"

print("Knowledge base loaded from:")
for kb_file in kb_files:
    print(f"  - {kb_file}")
print(f"\nTotal knowledge base size: {len(KB_MARKDOWN)} characters")

Knowledge base loaded from:
  - knowledge-base/01-layering-principles.md
  - knowledge-base/02-controller-layer.md
  - knowledge-base/03-service-layer.md
  - knowledge-base/04-repository-layer.md
  - knowledge-base/05-entity-layer.md
  - knowledge-base/06-anti-patterns-overview.md

Total knowledge base size: 55076 characters


In [5]:
# Load leaky code samples from dummy-project directory
LEAKY_SAMPLES = {
    "application": load_text_file('dummy-project/LeakyDemoApplication.java'),
    "order_entity": load_text_file('dummy-project/Order.java'),
    "order_controller": load_text_file('dummy-project/OrderController.java'),
    "order_repository": load_text_file('dummy-project/OrderRepository.java')
}

print("Leaky code samples loaded from dummy-project:")
for key in LEAKY_SAMPLES.keys():
    print(f"  - {key}")

print("\nNote: These are intentionally leaky examples for violation detection practice.")

Leaky code samples loaded from dummy-project:
  - application
  - order_entity
  - order_controller
  - order_repository

Note: These are intentionally leaky examples for violation detection practice.


In [6]:
# Initialize RAG components: Sentence transformer and FAISS index
print("Initializing RAG components...")

# Load embedding model (downloads on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Sentence transformer model loaded (all-MiniLM-L6-v2)")

# Split knowledge base into chunks (by double newlines = paragraphs)
chunks = re.split(r'\n\s*\n', KB_MARKDOWN.strip())
print(f"Knowledge base split into {len(chunks)} chunks")

# Generate embeddings for all chunks
embeddings = model.encode(chunks)
print(f"Generated embeddings with dimension {embeddings.shape[1]}")

# Create FAISS index for similarity search
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))
print(f"FAISS index created with {index.ntotal} vectors")

print("\nRAG setup complete. Ready for semantic retrieval.")

Initializing RAG components...
Sentence transformer model loaded (all-MiniLM-L6-v2)
Knowledge base split into 404 chunks
Generated embeddings with dimension 384
FAISS index created with 404 vectors

RAG setup complete. Ready for semantic retrieval.


In [7]:
def retrieve_relevant_rules(query, top_k=3):
    """
    Core retrieval function: Embed query, fetch top-k relevant chunks from knowledge base.
    
    Args:
        query (str): Input query (typically Java code or architectural question)
        top_k (int): Number of relevant chunks to retrieve (default: 3)
        
    Returns:
        str: Concatenated relevant rule chunks from knowledge base
    """
    query_embedding = model.encode([query])
    _, indices = index.search(np.array(query_embedding), top_k)
    relevant = "\n\n".join([chunks[i] for i in indices[0]])
    return relevant.replace('\u200b', '').replace('\ufeff', '')

# Test retrieval with sample query
test_query = "business logic in repository layer"
test_result = retrieve_relevant_rules(test_query)
print(f"Test retrieval for query: '{test_query}'")
print(f"Retrieved {len(test_result)} characters from knowledge base\n")
print("Sample output (first 400 chars):")
print(test_result[:400], "...\n")
print("Retrieval function working correctly")

Test retrieval for query: 'business logic in repository layer'
Retrieved 352 characters from knowledge base

Sample output (first 400 chars):
**Problems:**
- Business rules (eligibility, discount) in repository
- Data transformations based on business logic
- Filtering based on business conditions
- Repository knows too much about business domain

## Violation: Business Logic in Repository

Repositories can use database-level operations for performance, but must not include business logic. ...

Retrieval function working correctly


## Helper Functions for File Generation

These functions are used throughout the notebook to:
1. Prepare output directories for fixed code
2. Infer Java filenames from class definitions
3. Generate corrected Java code using LLM with architecture rules

In [8]:
# Helper functions: prepare fixed directory, infer Java filename, create corrected Java via LLM
import pathlib
import shutil
import re

def prepare_fixed_dir(path: str = 'dummy-project/fixed'):
    """
    Prepare output directory for fixed Java files.
    Creates directory if it doesn't exist, removes existing files if it does.
    
    Args:
        path (str): Path to the fixed files directory
        
    Returns:
        pathlib.Path: Path object for the prepared directory
    """
    d = pathlib.Path(path)
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)
    return d

def infer_java_filename(code: str, fallback: str) -> str:
    """
    Infer Java filename from code by finding the primary type name.
    Looks for class, interface, or enum declarations.
    
    Args:
        code (str): Java source code
        fallback (str): Fallback filename if no type declaration found
        
    Returns:
        str: Inferred filename (e.g., "OrderController.java")
    """
    # Try to find the primary type name: class|interface|enum Name
    m = re.search(r'\b(class|interface|enum)\s+([A-Z][A-Za-z0-9_]*)', code)
    if m:
        return f"{m.group(2)}.java"
    return fallback

def make_fixed_java(filename: str, code: str, rules: str, model: str = 'gpt-4.1-nano') -> str:
    """
    Generate corrected Java code using LLM with architecture rules as context.
    
    Args:
        filename (str): Java filename to determine layer-specific refactoring hints
        code (str): Original Java source code with violations
        rules (str): Relevant architecture rules from knowledge base
        model (str): OpenAI model to use (default: gpt-4.1-nano)
        
    Returns:
        str: Corrected Java source code
    """
    role = ('controller' if filename.lower().endswith('controller.java') else
            'repository' if filename.lower().endswith('repository.java') else
            'entity' if filename.lower().endswith('order.java') else 'java')
    system = (
        'You are a senior Java/Spring reviewer. Refactor the given file to comply with Clean Architecture.\n'
        f'Keep the same package and imports. Remove misplaced business logic from the {role}.\n'
        'Controllers: only HTTP mapping, DTO mapping, delegate to OrderService.\n'
        'Repositories: only persistence interfaces/CRUD, no domain computations.\n'
        'Entities: plain domain with fields/getters/setters, no I/O or service/repo calls.\n'
        'If delegation is needed, call an OrderService (assume it exists); do not inline logic.\n'
        'Return ONLY the corrected Java file content.'
    )
    user = (
        f'Relevant architecture rules:\n{rules}\n\n'
        f'File name: {filename}\n\n'
        f'Original Java file:\n{code}\n'
    )
    client = OpenAI(api_key=OPENAI_API_KEY)
    rsp = client.chat.completions.create(
        model=model,
        messages=[{"role":"system","content":system},{"role":"user","content":user}],
    )
    return rsp.choices[0].message.content

print("Helper functions defined:")
print("  - prepare_fixed_dir(): Prepare output directory")
print("  - infer_java_filename(): Extract filename from code")
print("  - make_fixed_java(): Generate corrected code via LLM")

Helper functions defined:
  - prepare_fixed_dir(): Prepare output directory
  - infer_java_filename(): Extract filename from code
  - make_fixed_java(): Generate corrected code via LLM


---

# Section 1: RAG (Retrieval-Augmented Generation)

**Goal:** Build a RAG pipeline to retrieve relevant architecture rules and use an LLM to detect violations.

**Why RAG?**
- Augments LLM with domain-specific Clean Architecture knowledge
- Ensures analysis references concrete rules and patterns
- Improves accuracy by grounding responses in retrieved context

**Workflow:**
1. **Retrieve:** Semantic search for relevant rules based on code
2. **Augment:** Inject retrieved rules into LLM prompt
3. **Generate:** LLM analyzes code against rules, detects violations

**Hands-on:**
- Analyze leaky code samples and observe violations detected
- Experiment with different code snippets

In [9]:
# Select sample for RAG analysis
# Options: "order_controller", "order_repository", "order_entity", "application"
sample_name = "order_controller"
java_code = LEAKY_SAMPLES[sample_name]

print(f"Analyzing: {sample_name} (leaky code from dummy-project)")
print("=" * 70)
print("Code snippet (first 600 chars):")
print(java_code[:600], "...\n")

# Retrieve relevant architecture rules using semantic search
relevant_rules = retrieve_relevant_rules(java_code)
print("\nRetrieved relevant architecture rules:")
print("=" * 70)
print(relevant_rules[:700], "...\n")
print(f"Total retrieved content: {len(relevant_rules)} characters")

Analyzing: order_controller (leaky code from dummy-project)
Code snippet (first 600 chars):
package com.example.leakydemo;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.List;

@RestController
public class OrderController {

    @Autowired
    private OrderRepository orderRepository;

    // Business logic leakage: Controller handling business rules like approval checks
    @GetMapping("/orders/eligible")
    public List<Order> getEligibleOrders() {
        List<Order> eligibleOrders = orderRepository.findEligibleForDiscount();
   ...


Retrieved relevant architecture rules:
    @GetMapping("/orders/eligible")
    public List<Order> getEligibleOrders() {
        List<Order> orders = orderRepository.findAll();

    @GetMapping("/orders/eligible")
    public List<Order> getEligibleOrders() {
        return orderService.getEligibleOrde

In [10]:
# Augment LLM with retrieved rules for violation analysis
client = OpenAI(api_key=OPENAI_API_KEY)

# Note: gpt-4.1-nano is a real model - do not change
response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "system", 
            "content": (
                "You are a Java architecture expert specializing in Clean Architecture. "
                "Analyze code using the provided architecture rules to detect misplaced business logic violations."
            )
        },
        {
            "role": "user", 
            "content": (
                f"Java Code to Analyze:\n{java_code}\n\n"
                f"Relevant Architecture Rules:\n{relevant_rules}\n\n"
                f"Task: Identify all Clean Architecture violations in this code.\n\n"
                f"For each violation, provide:\n"
                f"1. Exact location (class, method, line number if visible)\n"
                f"2. Type of violation (e.g., 'Business logic in controller')\n"
                f"3. Why it violates Clean Architecture principles\n"
                f"4. Impact on maintainability and testability\n"
                f"5. How to fix it (move to which layer)\n\n"
                f"Reference specific rules from the provided architecture rules."
            )
        }
    ]
)

print("RAG-Enhanced Analysis:")
print("=" * 70)
print(response.choices[0].message.content)

RAG-Enhanced Analysis:
Let's analyze the provided code and identify all Clean Architecture violations based on the given architecture rules.

---

### 1. OrderController Class

**Location:**  
- Class: `com.example.leakydemo.OrderController`  
- Method: `public List<Order> getEligibleOrders()`

---

#### Violation 1: Business Logic Leakage — Approving Orders and Discount Calculation

**Description:**  
The controller retrieves eligible orders from the repository and then applies business-specific logic—for example, applying a discount or VIP treatment based on order total:

```java
for (Order order : eligibleOrders) {
    if (order.getTotal() > 500) {
        order.setTotal(order.getTotal() * 0.95); // VIP discount
    }
}
```

**Type of Violation:**  
- Business logic (discount application and VIP approval) incorrectly placed in the controller.

**Why it violates Clean Architecture:**  
- Business rules (like VIP discounts, approval criteria) should reside within the domain layer or a

In [11]:
# Write fixed file for current RAG sample (generic filename inference)
fixed_dir = prepare_fixed_dir()
fallback = f"{sample_name}.java"
src_filename = infer_java_filename(java_code, fallback)
fixed_code = make_fixed_java(src_filename, java_code, relevant_rules)
(fixed_dir / src_filename).write_text(fixed_code, encoding='utf-8')
print(f"Wrote fixed file: {(fixed_dir / src_filename).resolve()}")

Wrote fixed file: C:\Users\david.arnold\GenAI\genai-clean-arch-workshop\dummy-project\fixed\OrderController.java


**Exercise:**

1. **Try Different Samples:**
   ```python
   sample_name = "order_repository"  # or "order_entity"
   ```
   Re-run the previous cells to analyze different violation types.

2. **Adjust Retrieval:**
   - Modify `top_k` parameter in `retrieve_relevant_rules()` (try 5 or 10)
   - Does more context improve analysis quality or introduce noise?

3. **Custom Code Analysis:**
   ```python
   java_code = """
   // Paste your own Java code here
   """
   relevant_rules = retrieve_relevant_rules(java_code)
   # Then run LLM analysis
   ```

---

# Section 2: Agents (ReAct Framework)

**Goal:** Create an autonomous agent that reasons about when to retrieve rules and how to analyze code step-by-step.

**Why Agents?**
- **Autonomy:** Agent decides if/when to use the retrieval tool
- **Reasoning:** Breaks down complex analysis into logical steps
- **Flexibility:** Handles multi-file or contextual analysis

**ReAct Pattern:** 
- **Reason (Thought):** Agent thinks about what to do next
- **Act (Action):** Agent uses a tool (e.g., RetrieveArchitectureRules)
- **Observe (Observation):** Agent sees tool output
- **Repeat:** Continue until reaching final answer

**Builds on RAG:** Wraps `retrieve_relevant_rules` as a tool the agent can call autonomously.

**Hands-on:**
- Observe agent's reasoning process (`verbose=True` shows thoughts)
- See how it decides to use the retrieval tool
- Experiment with different prompts

In [12]:
# Wrap retrieval function as an agent tool
tools = [
    Tool(
        name="RetrieveArchitectureRules",
        func=retrieve_relevant_rules,
        description=(
            "Retrieve Clean Architecture rules, anti-patterns, and violation examples "
            "for analyzing Java code. Input should be Java code or a description of "
            "the architectural concern. Returns relevant rules from the knowledge base."
        )
    )
]

print("Agent tools defined:")
for tool in tools:
    print(f"  - Tool: {tool.name}")
    print(f"    Description: {tool.description[:100]}...")

Agent tools defined:
  - Tool: RetrieveArchitectureRules
    Description: Retrieve Clean Architecture rules, anti-patterns, and violation examples for analyzing Java code. In...


In [13]:
# Initialize ReAct agent with tools
llm = ChatOpenAI(model="gpt-4.1-nano", api_key=OPENAI_API_KEY)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True
)

print("Agent initialized successfully")
print("  - Agent type: ZERO_SHOT_REACT_DESCRIPTION")
print("  - Verbose mode: ON (reasoning will be visible)")
print("  - Error handling: Enabled")
print("\nAgent will now reason step-by-step using the ReAct pattern.")

Agent initialized successfully
  - Agent type: ZERO_SHOT_REACT_DESCRIPTION
  - Verbose mode: ON (reasoning will be visible)
  - Error handling: Enabled

Agent will now reason step-by-step using the ReAct pattern.


  agent = initialize_agent(


In [14]:
# Select sample for agent analysis
agent_sample_name = "order_repository"
agent_code = LEAKY_SAMPLES[agent_sample_name]

# Craft prompt to encourage tool use and step-by-step reasoning
agent_prompt = (
    f"Analyze the following Java repository interface for Clean Architecture violations. "
    f"First, use the RetrieveArchitectureRules tool to get relevant rules about repositories. "
    f"Then, identify all violations step-by-step.\n\n"
    f"Java Code:\n{agent_code}"
)

print(f"Running agent analysis on: {agent_sample_name}")
print("=" * 70)
print("Watch the agent's reasoning process below:\n")

result = agent.run(agent_prompt)

print("\n" + "=" * 70)
print("Agent's Final Analysis:")
print("=" * 70)
print(result)

  result = agent.run(agent_prompt)


Running agent analysis on: order_repository
Watch the agent's reasoning process below:



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: RetrieveArchitectureRules  
Action Input: "Java repository interface, Clean Architecture violations, focusing on repositories, data access layer, and business logic leakage"[0m
Observation: [36;1m[1;3mThis document catalogs the most frequent violations of Clean Architecture principles found in Java applications. Understanding these patterns helps identify and prevent architectural erosion.

| Violation | Symptom | Solution |
|-----------|---------|----------|
| God Class | One class does everything | Split into separate layers |
| Leaky Abstraction | Layer knows too much about others | Use interfaces, hide implementation |
| Fat Controller | Business logic in controller | Move to service layer |
| Transaction Script | Procedural code everywhere | Object-oriented services |
| Repository Violation | Business logic in repo | Keep r

In [15]:
# Demonstrate agent handling multiple file context
print("Agent Analysis: Multiple Files from dummy-project")
print("=" * 70)

multi_file_prompt = (
    f"I have a Spring Boot application with potential architecture violations. "
    f"Analyze these three files and identify which layers are violating Clean Architecture:\n\n"
    f"1. Order Controller:\n{LEAKY_SAMPLES['order_controller']}\n\n"
    f"2. Order Repository:\n{LEAKY_SAMPLES['order_repository']}\n\n"
    f"3. Order Entity:\n{LEAKY_SAMPLES['order_entity']}\n\n"
    f"For each file, identify violations and explain their impact on maintainability."
)

print("Agent will analyze all three files...\n")
multi_result = agent.run(multi_file_prompt)

print("\n" + "=" * 70)
print("Multi-File Analysis Result:")
print("=" * 70)
print(multi_result)

Agent Analysis: Multiple Files from dummy-project
Agent will analyze all three files...



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To accurately identify the architectural violations in each of the provided files, I will analyze how each layer interacts with others, whether business logic is misplaced, and if any data access or presentation concerns leak into inappropriate layers. I will use the retrieve tool to find relevant rules about layering principles, separation of concerns, and the recommended placement of business logic.

Action: RetrieveArchitectureRules("Clean Architecture layering principles and common violations regarding controllers, repositories, and entities in Java Spring applications", top_k=3)
[0m
Observation: Invalid Format: Missing 'Action Input:' after 'Action:'
Thought:[32;1m[1;3mQuestion: I have a Spring Boot application with potential architecture violations. Analyze these three files and identify which layers are violating Clean 

**Exercise:**

1. **Add Custom Tool:**
   ```python
   def suggest_fix(code_description):
       return "Move business logic to service layer. Create OrderService class."
   
   tools.append(Tool(
       name="SuggestFix",
       func=suggest_fix,
       description="Suggest refactoring approach for violations"
   ))
   # Re-initialize agent with new tools
   ```

2. **Different Prompts:**
   - "Prioritize violations by severity (critical, major, minor)"
   - "Find only controller violations, ignore other layers"
   - "Explain which violations would fail a code review"

3. **Test Agent Limits:**
   - Provide non-Java code (Python, JavaScript) - what happens?
   - Ask simple questions ("What is Clean Architecture?") - does agent still call tools?
   - Very long code snippets - does reasoning quality degrade?

4. **Observe Reasoning:**
   - Count how many times agent calls RetrieveArchitectureRules
   - Does it retrieve rules for each file separately in multi-file analysis?
   - When does agent decide it has enough information?

---

# Section 3: Workflows (Deterministic Pipelines)

**Goal:** Orchestrate a fixed, predictable sequence of steps: Retrieve → Analyze → Output.

**Why Workflows?**
- **Deterministic:** Same input always produces same sequence of operations
- **Production-ready:** Suitable for CI/CD integration
- **Debuggable:** Easy to trace execution with verbose logging
- **Consistent:** Every code sample analyzed the same way

**Workflow Steps:**
1. **TransformChain (Retrieval):** Fetch relevant rules based on input code
2. **LLMChain (Analysis):** Analyze code with retrieved rules, generate violation report

**Builds on RAG:** Uses retrieval from Section 1, chains it with LLM analysis.

**vs Agents:** Workflows lack autonomy but guarantee predictable execution.

**Hands-on:**
- Run workflow on different samples
- Observe verbose logging showing each step
- Compare deterministic workflow vs agent flexibility

In [16]:
# Initialize LLM for workflow chains
llm = ChatOpenAI(model="gpt-4.1-nano", api_key=OPENAI_API_KEY)
print("LLM initialized for workflow chains")

LLM initialized for workflow chains


In [17]:
# Chain 1: Retrieval step - wraps retrieval function as TransformChain
def transform_retrieval(inputs):
    """
    Transform function for retrieval chain.
    Takes code as input, retrieves relevant rules from knowledge base.
    Note: Returns only 'rules' to avoid key duplication in SequentialChain.
    """
    code = inputs["code"]
    rules = retrieve_relevant_rules(code)
    return {"rules": rules}

retrieval_chain = TransformChain(
    input_variables=["code"],
    output_variables=["rules"],
    transform=transform_retrieval
)

print("Retrieval chain created (TransformChain)")
print("  - Input: code")
print("  - Output: rules")
print("  - Function: retrieve_relevant_rules() via transform")

Retrieval chain created (TransformChain)
  - Input: code
  - Output: rules
  - Function: retrieve_relevant_rules() via transform


In [18]:
# Chain 2: Analysis step - LLM analyzes code with retrieved rules
analysis_prompt = PromptTemplate.from_template(
    "You are a Java architecture expert analyzing code for Clean Architecture violations.\n\n"
    "Java Code:\n{code}\n\n"
    "Relevant Architecture Rules:\n{rules}\n\n"
    "Task:\n"
    "1. List all violations with exact locations (class, method, line)\n"
    "2. Explain why each violates Clean Architecture principles\n"
    "3. Cite specific rules from the provided architecture rules\n"
    "4. Describe impact on maintainability, testability, and scalability\n"
    "5. Provide refactoring recommendations (which layer should contain the logic)\n\n"
    "Format your analysis clearly with numbered sections for each violation."
)

analysis_chain = LLMChain(
    llm=llm,
    prompt=analysis_prompt,
    output_key="analysis"
)

print("Analysis chain created (LLMChain)")
print("  - Input: code, rules")
print("  - Output: analysis")
print("  - Prompt: Structured violation analysis with citations")

Analysis chain created (LLMChain)
  - Input: code, rules
  - Output: analysis
  - Prompt: Structured violation analysis with citations


  analysis_chain = LLMChain(


In [19]:
# Compose full workflow: Retrieval → Analysis
workflow = SequentialChain(
    chains=[retrieval_chain, analysis_chain],
    input_variables=["code"],
    output_variables=["analysis"],
    verbose=True
)

print("Workflow created with SequentialChain")
print("  Step 1: TransformChain (retrieval)")
print("  Step 2: LLMChain (analysis)")
print("  - Verbose mode: ON (execution logs will be shown)")
print("\nWorkflow ready for execution")

Workflow created with SequentialChain
  Step 1: TransformChain (retrieval)
  Step 2: LLMChain (analysis)
  - Verbose mode: ON (execution logs will be shown)

Workflow ready for execution


In [20]:
# Execute workflow on leaky controller
workflow_sample_name = "order_controller"
workflow_code = LEAKY_SAMPLES[workflow_sample_name]

print(f"Executing workflow on: {workflow_sample_name} (leaky code)")
print("=" * 70)

result = workflow({"code": workflow_code})

print("\n" + "=" * 70)
print("Workflow Output:")
print("=" * 70)
print(result["analysis"])

Executing workflow on: order_controller (leaky code)


[1m> Entering new SequentialChain chain...[0m


  result = workflow({"code": workflow_code})



[1m> Finished chain.[0m

Workflow Output:
### 1. Violations with Exact Locations

#### Violation 1: Business logic leakage in `OrderController`
- **Class:** `OrderController` (com.example.leakydemo.OrderController)
- **Method:** `getEligibleOrders()`
- **Line:** Starting from line 14 onwards
  - **Details:**
    - `for (Order order : eligibleOrders)`
    - `if (order.getTotal() > 500)`
    - `order.setTotal(order.getTotal() * 0.95);`
  
This loop performs business decision-making (applying discounts based on total order value) directly within the controller.

#### Violation 2: Business validation in `createOrder` method
- **Class:** Presumably in same controller or unspecified class (but assuming a controller based on the snippet)
- **Method:** `createOrder(@RequestBody Order order)`
- **Line:** Inside method, around line 16-20
  - **Details:**
    - `if (order.getTotal() < 0)`
    - `throw new IllegalArgumentException("Negative total");`

This validation is business logic (checking

In [21]:
# Run workflow on all leaky samples for comprehensive analysis
print("Batch Workflow Execution: All Leaky Samples from dummy-project")
print("=" * 70)

batch_results = {}

for sample_name, code in LEAKY_SAMPLES.items():
    if sample_name == "application":
        continue
        
    print(f"\nAnalyzing: {sample_name}")
    print("-" * 70)
    
    try:
        result = workflow({"code": code})
        batch_results[sample_name] = result["analysis"]
        print(f"Analysis complete for {sample_name}")
        print("Summary (first 400 chars):")
        print(result["analysis"][:400], "...\n")
    except Exception as e:
        print(f"Error analyzing {sample_name}: {str(e)}")
        batch_results[sample_name] = f"Error: {str(e)}"

print("\n" + "=" * 70)
print("Batch Execution Complete")
print(f"Successfully analyzed {len(batch_results)} files")
print("\nAll results stored in batch_results dictionary")

Batch Workflow Execution: All Leaky Samples from dummy-project

Analyzing: order_entity
----------------------------------------------------------------------


[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
Analysis complete for order_entity
Summary (first 400 chars):
**Analysis of Clean Architecture Violations in Provided Code**

---

### 1. Violations List with Exact Locations

| # | Violation Description                                              | Class             | Method                   | Line Number(s) |
|---|----------------------------------------------------------------------|-------------------|--------------------------|----------------|
| 1 |  ...


Analyzing: order_controller
----------------------------------------------------------------------


[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
Analysis complete for order_controller
Summary (first 400 chars):
Certainly. Below is a structured analysis of each vio

---

# Section 4: Generate Corrected Java Files

This section takes the analyzed leaky code samples and generates corrected versions that comply with Clean Architecture principles.

**Two approaches available:**
1. **Standard Workflow (4.1):** Uses LLM directly with architecture rules to refactor code
2. **MCP-Based Workflow (4.2):** Uses Model Context Protocol server (optional, advanced)

**Output:** Corrected Java files written to `dummy-project/fixed/`

## 4.1: Write Fixed Files (Standard Workflow)

Generate corrected Java files using the helper functions defined earlier.
Directory is recreated (existing files removed) before writing new corrected files.

In [22]:
# Prepare output directory dummy-project/fixed (create or clean)
fixed_dir = prepare_fixed_dir('dummy-project/fixed')
print(f"Prepared fixed directory: {fixed_dir.resolve()}")

Prepared fixed directory: C:\Users\david.arnold\GenAI\genai-clean-arch-workshop\dummy-project\fixed


In [23]:
# Generate fixed versions for all leaky files and write them to dummy-project/fixed
generated_paths = []

filename_map = {
    'application': 'LeakyDemoApplication.java',
    'order_controller': 'OrderController.java',
    'order_repository': 'OrderRepository.java',
    'order_entity': 'Order.java'
}

print("Generating corrected Java files...")
print("-" * 70)

for name, code in LEAKY_SAMPLES.items():
    src_filename = filename_map.get(name, f"{name}.java")
    print(f"Processing: {src_filename}")
    
    rules = retrieve_relevant_rules(code, top_k=5)
    fixed_code = make_fixed_java(src_filename, code, rules)
    
    out_path = (fixed_dir / src_filename)
    out_path.write_text(fixed_code, encoding='utf-8')
    generated_paths.append(str(out_path))
    print(f"  Written: {out_path.name}")

print("\n" + "=" * 70)
print("File Generation Complete")
print(f"Total files generated: {len(generated_paths)}")
print("\nGenerated files:")
for p in generated_paths:
    print(f"  - {p}")

Generating corrected Java files...
----------------------------------------------------------------------
Processing: LeakyDemoApplication.java
  Written: LeakyDemoApplication.java
Processing: Order.java
  Written: Order.java
Processing: OrderController.java
  Written: OrderController.java
Processing: OrderRepository.java
  Written: OrderRepository.java

File Generation Complete
Total files generated: 4

Generated files:
  - dummy-project\fixed\LeakyDemoApplication.java
  - dummy-project\fixed\Order.java
  - dummy-project\fixed\OrderController.java
  - dummy-project\fixed\OrderRepository.java


In [24]:
# Verification: List all generated files in the fixed directory
print("\nVerification: Files in dummy-project/fixed/")
print("=" * 70)

fixed_files = sorted(fixed_dir.glob('*.java'))
if fixed_files:
    print(f"Total Java files: {len(fixed_files)}\n")
    for file_path in fixed_files:
        file_size = file_path.stat().st_size
        print(f"  {file_path.name:30s} ({file_size:5d} bytes)")
    print("\nAll corrected files successfully generated")
else:
    print("Warning: No files found in fixed directory")


Verification: Files in dummy-project/fixed/
Total Java files: 4

  LeakyDemoApplication.java      ( 3565 bytes)
  Order.java                     (  985 bytes)
  OrderController.java           (  557 bytes)
  OrderRepository.java           (  388 bytes)

All corrected files successfully generated


## 4.2: MCP-Based Workflow (Optional)

**Model Context Protocol (MCP):** Advanced approach using a separate server process.

**Requirements:**
- MCP server implementation in `mcp-fixer-api/` directory
- `mcp` Python package (already installed via requirements.txt)
- OPENAI_API_KEY environment variable set
- **Kernel must have been restarted after initial installation**

**Note:** This section is optional and can be skipped if MCP server is not available.

In [25]:
# Export OPENAI_API_KEY from api-key.txt for the MCP subprocess
import os
import pathlib

key_file = pathlib.Path('api-key.txt')
if key_file.exists():
    api_key = key_file.read_text(encoding='utf-8').strip()
    if api_key:
        os.environ['OPENAI_API_KEY'] = api_key
        print("OPENAI_API_KEY set for MCP subprocess")
    else:
        print("Warning: api-key.txt is empty")
else:
    print('Warning: api-key.txt not found. Set OPENAI_API_KEY in environment manually.')

OPENAI_API_KEY set for MCP subprocess


In [26]:
# Call MCP tool and write results - Jupyter-compatible async version
try:
    import asyncio
    import json
    
    # Try different import paths for different mcp versions
    Session = None
    stdio_client = None
    
    # Strategy 1: Original import path (mcp < 1.0)
    try:
        from mcp.client.session import Session
        from mcp.client.stdio import stdio_client
        print("Using MCP import strategy 1 (legacy)")
    except ImportError:
        pass
    
    # Strategy 2: New import path (mcp >= 1.0)
    if Session is None:
        try:
            from mcp import ClientSession as Session
            from mcp.client.stdio import stdio_client
            print("Using MCP import strategy 2 (new API)")
        except ImportError:
            pass
    
    # Strategy 3: Alternative paths
    if Session is None:
        try:
            from mcp.client import Session, stdio_client
            print("Using MCP import strategy 3 (flat structure)")
        except ImportError:
            pass
    
    if Session is None or stdio_client is None:
        raise ImportError(
            "Could not import MCP components. Please check the debug output above.\n"
            "The mcp package structure may have changed.\n"
            "Consider checking: https://github.com/modelcontextprotocol/python-sdk"
        )

    MCP_WORKDIR = 'mcp-fixer-api'
    MCP_CMD = ['python', '-m', 'app.mcp_server']

    async def call_mcp_server_async(samples_dict):
        """Call MCP server asynchronously - Jupyter compatible."""
        env = os.environ.copy()
        if not env.get('OPENAI_API_KEY'):
            raise RuntimeError('OPENAI_API_KEY is not set')
        
        async with stdio_client(command=MCP_CMD, cwd=MCP_WORKDIR, env=env) as (read, write):
            async with Session(read, write) as session:
                await session.initialize()
                result = await session.call_tool('fix_from_json', arguments=samples_dict)
                if not result or getattr(result[0], 'type', '') != 'text':
                    raise RuntimeError(f'Unexpected MCP response: {result}')
                return json.loads(result[0].text)

    samples = {
        'application': LEAKY_SAMPLES['application'],
        'order_controller': LEAKY_SAMPLES['order_controller'],
        'order_repository': LEAKY_SAMPLES['order_repository'],
        'order_entity': LEAKY_SAMPLES['order_entity'],
    }

    print("Calling MCP server to generate fixed files...")
    
    # Use await directly in Jupyter (top-level await is supported)
    fixed_map = await call_mcp_server_async(samples)
    
    mcp_output_dir = prepare_fixed_dir('dummy-project/fixed-mcp')
    for filename, content in fixed_map.items():
        (mcp_output_dir / filename).write_text(content, encoding='utf-8')
    
    print(f'\nMCP-generated files written to: {mcp_output_dir.resolve()}')
    print(f"Total files: {len(fixed_map)}")
    
except ImportError as e:
    print(f"MCP package import failed: {str(e)}")
    print("\nTroubleshooting steps:")
    print("  1. Check the debug output above to see what's available in mcp package")
    print("  2. Verify kernel was restarted after installation")
    print("  3. Try: pip install --upgrade mcp")
    print("  4. Check MCP SDK docs: https://github.com/modelcontextprotocol/python-sdk")
    print("\nNote: Standard workflow (4.1) already generated fixed files successfully.")
except FileNotFoundError:
    print("MCP server directory not found. Ensure mcp-fixer-api/ exists.")
    print("The MCP server must be implemented separately.")
except Exception as e:
    print(f"MCP execution failed: {str(e)}")
    print(f"Error type: {type(e).__name__}")
    import traceback
    print("\nFull traceback:")
    traceback.print_exc()
    print("\nThis is optional - standard workflow already generated fixed files.")

Using MCP import strategy 2 (new API)
Calling MCP server to generate fixed files...
MCP execution failed: stdio_client() got an unexpected keyword argument 'command'
Error type: TypeError

Full traceback:

This is optional - standard workflow already generated fixed files.


Traceback (most recent call last):
  File "C:\Users\david.arnold\AppData\Local\Temp\ipykernel_3688\2913149528.py", line 69, in <module>
    fixed_map = await call_mcp_server_async(samples)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\david.arnold\AppData\Local\Temp\ipykernel_3688\2913149528.py", line 51, in call_mcp_server_async
    async with stdio_client(command=MCP_CMD, cwd=MCP_WORKDIR, env=env) as (read, write):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\david.arnold\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 322, in helper
    return _AsyncGeneratorContextManager(func, args, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\david.arnold\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 105, in __init__
    self.gen = func(*args, **kwds)
               ^^^^^^^^^^^^^^^^^^^
TypeError: stdio_client() got an unexpected keyword argument 'command'
