Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2025 Angular
Copyright (c) 2025 Google LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
147 changes: 113 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,57 @@
# Web Codegen Scorer

This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs).
**Web Codegen Scorer** is a tool for evaluating the quality of web code generated by Large Language
Models (LLMs).

## Documentation directory
You can use this tool to make evidence-based decisions relating to AI-generated code. For example:

- [Environment config reference](./docs/environment-reference.md)
- [How to set up a new model?](./docs/model-setup.md)
* 🔄 Iterate on a system prompt to find most effective instructions for your project.
* ⚖️ Compare the code quality of code produced by different models.
* 📈 Monitor generated code quality over time as models and agents evolve.

Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_
code and relies primarily on well-established measures of code quality.

## Features

* ⚙️ Configure your evaluations with different models, frameworks, and tools.
* ✍️ Specify system instructions and add MCP servers.
* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
coding best practices. (More built-in checks coming soon!)
* 🔧 Automatically attempt to repair issues detected during code generating.
* 📊 View and compare results with an intuitive report viewer UI.

## Setup

1. **Install the package:**
1. **Install the package:**

```bash
npm install -g web-codegen-scorer
```

2. **Set up your API keys:**
In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
2. **Set up your API keys:**

In order to run an eval, you have to specify an API keys for the relevant providers as
environment variables:

```bash
export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
```

3. **Run an eval:**
You can run your first eval using our Angular example with the following command:

You can run your first eval using our Angular example with the following command:

```bash
web-codegen-scorer eval --env=angular-example
```

4. (Optional) **Set up your own eval:**
If you want to set up a custom eval, instead of using our built-in examples, you can run the following
command which will guide you through the process:

If you want to set up a custom eval, instead of using our built-in examples, you can run the
following command which will guide you through the process:

```bash
web-codegen-scorer init
Expand All @@ -40,54 +61,112 @@ web-codegen-scorer init

You can customize the `web-codegen-scorer eval` script with the following flags:

- `--env=<path>` (alias: `--environment`): (**Required**) Specifies the path from which to load the environment config.
- Example: `web-codegen-scorer eval --env=foo/bar/my-env.js`
- `--env=<path>` (alias: `--environment`): (**Required**) Specifies the path from which to load the
environment config.
- Example: `web-codegen-scorer eval --env=foo/bar/my-env.js`

- `--model=<name>`: Specifies the model to use when generating code. Defaults to the value of `DEFAULT_MODEL_NAME`.
- Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>`
- `--model=<name>`: Specifies the model to use when generating code. Defaults to the value of
`DEFAULT_MODEL_NAME`.
- Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>`

- `--runner=<name>`: Specifies the runner to use to execute the eval. Supported runners are `genkit` (default) or `gemini-cli`.
- `--runner=<name>`: Specifies the runner to use to execute the eval. Supported runners are
`genkit` (default) or `gemini-cli`.

- `--local`: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the `.llm-output` directory (e.g., `.llm-output/todo-app.ts`). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.
- **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to generate the initial files in `.llm-output`.
- The `web-codegen-scorer eval:local` script is a shortcut for `web-codegen-scorer eval --local`.
- `--local`: Runs the script in local mode for the initial code generation request. Instead of
calling the LLM, it will attempt to read the initial code from a corresponding file in the
`.web-codegen-scorer/llm-output` directory (e.g., `.web-codegen-scorer/llm-output/todo-app.ts`).
This is useful for re-running assessments or debugging the build/repair process without incurring
LLM costs for the initial generation.
- **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to
generate the initial files in `.web-codegen-scorer/llm-output`.
- The `web-codegen-scorer eval:local` script is a shortcut for
`web-codegen-scorer eval --local`.

- `--limit=<number>`: Specifies the number of application prompts to process. Defaults to `5`.
- Example: `web-codegen-scorer eval --limit=10 --env=<config path>`
- Example: `web-codegen-scorer eval --limit=10 --env=<config path>`

- `--output-directory=<name>` (alias: `--output-dir`): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory.
- Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>`
- `--output-directory=<name>` (alias: `--output-dir`): Specifies which directory to output the
generated code under which is useful for debugging. By default, the code will be generated in a
temporary directory.
- Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>`

- `--concurrency=<number>`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`).
- Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>`
- `--concurrency=<number>`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (
as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`).
- Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>`

- `--report-name=<name>`: Sets the name for the generated report directory. Defaults to a timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric characters replaced with hyphens).
- Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>`
- `--report-name=<name>`: Sets the name for the generated report directory. Defaults to a
timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric
characters replaced with hyphens).
- Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>`

- `--rag-endpoint=<url>`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a `PROMPT` substring, which will be replaced with the user prompt.
- Example: `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>`
- `--rag-endpoint=<url>`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The
URL must contain a `PROMPT` substring, which will be replaced with the user prompt.
- Example:
`web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>`

- `--prompt-filter=<name>`: String used to filter which prompts should be run. By default a random sample (controlled by `--limit`) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.
- Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>`
- `--prompt-filter=<name>`: String used to filter which prompts should be run. By default, a random
sample (controlled by `--limit`) will be taken from the prompts in the current environment.
Setting this can be useful for debugging a specific prompt.
- Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>`

- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to `false`.
- Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>`
- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to
`false`.
- Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>`

- `--labels=<label1> <label2>`: Metadata labels that will be attached to the run.
- Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>`
- Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>`

- `--mcp`: Whether to start an MCP for the evaluation. Defaults to `false`.
- Example: `web-codegen-scorer eval --mcp --env=<config path>`
- Example: `web-codegen-scorer eval --mcp --env=<config path>`

- `--help`: Prints out usage information about the script.

### Additional configuration options

- [Environment config reference](./docs/environment-reference.md)
- [How to set up a new model?](./docs/model-setup.md)

## Local development

If you've cloned this repo and want to work on the tool, you have to install its dependencies by running `pnpm install`.
If you've cloned this repo and want to work on the tool, you have to install its dependencies by
running `pnpm install`.
Once they're installed, you can run the following commands:

* `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm.
* `pnpm run eval` - Runs an eval from source.
* `pnpm run report` - Runs the report app from source.
* `pnpm run init` - Runs the init script from source.
* `pnpm run format` - Formats the source code using Prettier.

## FAQ

### Who built this tool?

This tool is built by the Angular team at Google.

### Does this tool only work for Angular code or Google models?

No! You can use this tool with any web library or framework (or none at all) as well as any model.

### Why did you build this tool?

As more and more developers reach for LLM-based tools to create and modify code, we wanted to be
able to empirically measure the effect of different factors on the quality of generated code. While
many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the
specific quality metrics we cared about.

In the absence of such a tool, we found that many developers based their judgements on codegen with
different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web
Codegen Scorer gives us a platform to consistently measure codegen across different configurations
with consistency and repeatability.

### Will you add more features over time?

Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.

Our roadmap includes:

* Including _interaction testing_ in the rating, to ensure the generated code performs any requested
behaviors.
* Measure Core Web Vitals.
* Measuring the effectiveness of LLM-driven edits on an existing codebase.
Loading