From 6f0fcb331b34472e73ddf7c77aaaf1019fadd966 Mon Sep 17 00:00:00 2001 From: Jeremy Elbourn Date: Mon, 15 Sep 2025 08:47:13 -0700 Subject: [PATCH] Polish documentation for open-source release * Expand the README with additional intro and FAQ * Change some bulleted lists to tables * Minor grammar edits and formatting throughout --- LICENSE | 2 +- README.md | 147 ++++++++++++++++++++++++++-------- docs/environment-reference.md | 88 ++++++++++---------- docs/model-setup.md | 8 +- package.json | 14 +++- 5 files changed, 174 insertions(+), 85 deletions(-) diff --git a/LICENSE b/LICENSE index 322654b..9eb3c39 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2025 Angular +Copyright (c) 2025 Google LLC Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index f98bda0..05ab175 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,39 @@ # Web Codegen Scorer -This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs). +**Web Codegen Scorer** is a tool for evaluating the quality of web code generated by Large Language +Models (LLMs). -## Documentation directory +You can use this tool to make evidence-based decisions relating to AI-generated code. For example: -- [Environment config reference](./docs/environment-reference.md) -- [How to set up a new model?](./docs/model-setup.md) +* 🔄 Iterate on a system prompt to find most effective instructions for your project. +* ⚖️ Compare the code quality of code produced by different models. +* 📈 Monitor generated code quality over time as models and agents evolve. + +Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_ +code and relies primarily on well-established measures of code quality. + +## Features + +* ⚙️ Configure your evaluations with different models, frameworks, and tools. +* ✍️ Specify system instructions and add MCP servers. +* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and + coding best practices. (More built-in checks coming soon!) +* 🔧 Automatically attempt to repair issues detected during code generating. +* 📊 View and compare results with an intuitive report viewer UI. ## Setup -1. **Install the package:** +1. **Install the package:** + ```bash npm install -g web-codegen-scorer ``` -2. **Set up your API keys:** -In order to run an eval, you have to specify an API keys for the relevant providers as environment variables: +2. **Set up your API keys:** + + In order to run an eval, you have to specify an API keys for the relevant providers as + environment variables: + ```bash export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models @@ -23,14 +41,17 @@ export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models ``` 3. **Run an eval:** -You can run your first eval using our Angular example with the following command: + + You can run your first eval using our Angular example with the following command: + ```bash web-codegen-scorer eval --env=angular-example ``` 4. (Optional) **Set up your own eval:** -If you want to set up a custom eval, instead of using our built-in examples, you can run the following -command which will guide you through the process: + + If you want to set up a custom eval, instead of using our built-in examples, you can run the + following command which will guide you through the process: ```bash web-codegen-scorer init @@ -40,50 +61,75 @@ web-codegen-scorer init You can customize the `web-codegen-scorer eval` script with the following flags: -- `--env=` (alias: `--environment`): (**Required**) Specifies the path from which to load the environment config. - - Example: `web-codegen-scorer eval --env=foo/bar/my-env.js` +- `--env=` (alias: `--environment`): (**Required**) Specifies the path from which to load the + environment config. + - Example: `web-codegen-scorer eval --env=foo/bar/my-env.js` -- `--model=`: Specifies the model to use when generating code. Defaults to the value of `DEFAULT_MODEL_NAME`. - - Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=` +- `--model=`: Specifies the model to use when generating code. Defaults to the value of + `DEFAULT_MODEL_NAME`. + - Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=` -- `--runner=`: Specifies the runner to use to execute the eval. Supported runners are `genkit` (default) or `gemini-cli`. +- `--runner=`: Specifies the runner to use to execute the eval. Supported runners are + `genkit` (default) or `gemini-cli`. -- `--local`: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the `.llm-output` directory (e.g., `.llm-output/todo-app.ts`). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation. - - **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to generate the initial files in `.llm-output`. - - The `web-codegen-scorer eval:local` script is a shortcut for `web-codegen-scorer eval --local`. +- `--local`: Runs the script in local mode for the initial code generation request. Instead of + calling the LLM, it will attempt to read the initial code from a corresponding file in the + `.web-codegen-scorer/llm-output` directory (e.g., `.web-codegen-scorer/llm-output/todo-app.ts`). + This is useful for re-running assessments or debugging the build/repair process without incurring + LLM costs for the initial generation. + - **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to + generate the initial files in `.web-codegen-scorer/llm-output`. + - The `web-codegen-scorer eval:local` script is a shortcut for + `web-codegen-scorer eval --local`. - `--limit=`: Specifies the number of application prompts to process. Defaults to `5`. - - Example: `web-codegen-scorer eval --limit=10 --env=` + - Example: `web-codegen-scorer eval --limit=10 --env=` -- `--output-directory=` (alias: `--output-dir`): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory. - - Example: `web-codegen-scorer eval --output-dir=test-output --env=` +- `--output-directory=` (alias: `--output-dir`): Specifies which directory to output the + generated code under which is useful for debugging. By default, the code will be generated in a + temporary directory. + - Example: `web-codegen-scorer eval --output-dir=test-output --env=` -- `--concurrency=`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`). - - Example: `web-codegen-scorer eval --concurrency=3 --env=` +- `--concurrency=`: Sets the maximum number of concurrent AI API requests. Defaults to `5` ( + as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`). + - Example: `web-codegen-scorer eval --concurrency=3 --env=` -- `--report-name=`: Sets the name for the generated report directory. Defaults to a timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric characters replaced with hyphens). - - Example: `web-codegen-scorer eval --report-name=my-custom-report --env=` +- `--report-name=`: Sets the name for the generated report directory. Defaults to a + timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric + characters replaced with hyphens). + - Example: `web-codegen-scorer eval --report-name=my-custom-report --env=` -- `--rag-endpoint=`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a `PROMPT` substring, which will be replaced with the user prompt. - - Example: `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=` +- `--rag-endpoint=`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The + URL must contain a `PROMPT` substring, which will be replaced with the user prompt. + - Example: + `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=` -- `--prompt-filter=`: String used to filter which prompts should be run. By default a random sample (controlled by `--limit`) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt. - - Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=` +- `--prompt-filter=`: String used to filter which prompts should be run. By default, a random + sample (controlled by `--limit`) will be taken from the prompts in the current environment. + Setting this can be useful for debugging a specific prompt. + - Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=` -- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to `false`. - - Example: `web-codegen-scorer eval --skip-screenshots --env=` +- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to + `false`. + - Example: `web-codegen-scorer eval --skip-screenshots --env=` - `--labels= `: Metadata labels that will be attached to the run. - - Example: `web-codegen-scorer eval --labels my-label another-label --env=` + - Example: `web-codegen-scorer eval --labels my-label another-label --env=` - `--mcp`: Whether to start an MCP for the evaluation. Defaults to `false`. - - Example: `web-codegen-scorer eval --mcp --env=` + - Example: `web-codegen-scorer eval --mcp --env=` - `--help`: Prints out usage information about the script. +### Additional configuration options + +- [Environment config reference](./docs/environment-reference.md) +- [How to set up a new model?](./docs/model-setup.md) + ## Local development -If you've cloned this repo and want to work on the tool, you have to install its dependencies by running `pnpm install`. +If you've cloned this repo and want to work on the tool, you have to install its dependencies by +running `pnpm install`. Once they're installed, you can run the following commands: * `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm. @@ -91,3 +137,36 @@ Once they're installed, you can run the following commands: * `pnpm run report` - Runs the report app from source. * `pnpm run init` - Runs the init script from source. * `pnpm run format` - Formats the source code using Prettier. + +## FAQ + +### Who built this tool? + +This tool is built by the Angular team at Google. + +### Does this tool only work for Angular code or Google models? + +No! You can use this tool with any web library or framework (or none at all) as well as any model. + +### Why did you build this tool? + +As more and more developers reach for LLM-based tools to create and modify code, we wanted to be +able to empirically measure the effect of different factors on the quality of generated code. While +many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the +specific quality metrics we cared about. + +In the absence of such a tool, we found that many developers based their judgements on codegen with +different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web +Codegen Scorer gives us a platform to consistently measure codegen across different configurations +with consistency and repeatability. + +### Will you add more features over time? + +Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios. + +Our roadmap includes: + +* Including _interaction testing_ in the rating, to ensure the generated code performs any requested + behaviors. +* Measure Core Web Vitals. +* Measuring the effectiveness of LLM-driven edits on an existing codebase. diff --git a/docs/environment-reference.md b/docs/environment-reference.md index 00fae60..b177ce5 100644 --- a/docs/environment-reference.md +++ b/docs/environment-reference.md @@ -1,7 +1,7 @@ # Environment configuration reference Environments are configured by creating a `config.js` that exposes an object that satisfies the -`EnvironmentConfig` interface. This document covers all the possible options in `EnvironmentConfig` +`EnvironmentConfig` interface. This document covers all options in `EnvironmentConfig` and what they do. ## Required properties @@ -10,29 +10,27 @@ These properties all have to be specified in order for the environment to functi ### `displayName` -Human-readable name that will be shown in eval reports about this environment. +Human-readable name that is shown in eval reports about this environment. ### `id` -Unique ID for the environment. If ommitted, one will be generated from the `displayName`. +Unique ID for the environment. If omitted, one is generated from the `displayName`. ### `clientSideFramework` -ID of the client-side framework that the environment will be running, for example `angular`. +ID of the client-side framework that the environment runs, for example `angular`. ### `ratings` -An array defining the ratings that will be executed as a part of the evaluation. -The ratings determine what score that will be assigned to the test run. -Currently we support the following types of ratings: +An array defining the ratings that are executed as a part of the evaluation. +The ratings determine the score assigned for the test run. +Currently, the tool supports the following built-in ratings: -- `PerBuildRating` - assigns a score based on the build result of the generated code, e.g. - "Does it build on the first run?" or "Does it build after X repair attempts?" -- `PerFileRating` - assigns a score based on the content of individual files generated by the LLM. - Can be run either against all file types by setting the `filter` to - `PerFileRatingContentType.UNKNOWN` or against specific files. -- `LLMBasedRating` - rates the generated code by asking an LLM to assign a score to it, - e.g. "Does this app match the specified prompts?" +| Rating Name | Description | +|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `PerBuildRating` | Assigns a score based on the build result of the generated code, e.g. "Does it build on the first run?" or "Does it build after X repair attempts?" | +| `PerFileRating` | Assigns a score based on the content of individual files generated by the LLM. Can be run either against all file types by setting the `filter` to
`PerFileRatingContentType.UNKNOWN` or against specific files. | +| `LLMBasedRating` | Rates the generated code by asking an LLM to assign a score to it, e.g. "Does this app match the specified prompts?" | ### `packageManager` @@ -62,29 +60,25 @@ This is useful when evaluating confidential code. ### `skipInstall` -Whether to skip installing dependencies during the eval run. This can be useful if you've already -ensured that all dependencies are installed through something like pnpm workspaces. +Whether to skip installing dependencies during the eval run. This is useful if you've already +installed dependencies through something like pnpm workspaces. ### Prompt templating -Prompts are typically stored in `.md` files. We support the following template syntax inside of -these files in order to augment the prompt and reduce boilerplate: +Prompts are typically stored in `.md` files. The tool supports the following template syntax inside +of these files in order to augment the prompt and reduce boilerplate: -- `{{> embed file='../path/to/file.md' }}` - embeds the content of the specified file in the - current one. -- `{{> contextFiles '**/*.foo' }}` - specifies files that should be passed to the LLM as context - when the prompt is executed. Should be a comma-separated string of glob pattern **within** the - environments project code. E.g. `{{> contextFiles '**/*.ts, **/*.html' }}` will pass all `.ts` - and `.html` files as context. -- `{{CLIENT_SIDE_FRAMEWORK_NAME}}` - insert the name of the client-side framework of the current - environment. -- `{{FULL_STACK_FRAMEWORK_NAME}}` - insert the name of the full-stack framework of the current - environment. +| Helper / Variable | Description | +|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `{{> embed file='../path/to/file.md' }}` | Embeds the content of the specified file in the current one. | +| `{{> contextFiles '**/*.foo' }}` | Specifies files that should be passed to the LLM as context when the prompt is executed. Should be a comma-separated string of glob pattern **within** the environments project code. E.g. `{{> contextFiles '**/*.ts, **/*.html' }}` passes all `.ts` and `.html` files as context. | +| `{{CLIENT_SIDE_FRAMEWORK_NAME}}` | Insert the name of the client-side framework of the current environment. | +| `{{FULL_STACK_FRAMEWORK_NAME}}` | Insert the name of the full-stack framework of the current environment. | ### Prompt-specific ratings -If you want to run a set of ratings against a specific prompt, you can set an object literal -in the `executablePrompts` array, instead of a string: +If you want to run a set of ratings against a specific prompt, set an object literal in the +`executablePrompts` array, instead of a string: ```ts executablePrompts: [ @@ -101,10 +95,12 @@ executablePrompts: [ ### Multi-step prompts -Multi-step prompts are prompts meant to evaluate workflows made up of one or more stages. -Steps execute one after another **inside the same directory**, but are rated individually and -snapshots after each step are stored in the final report. You can create a multi-step prompt by -passing an instrance of the `MultiStepPrompt` class into the `executablePrompts` array, for example: +**Multistep prompts** evaluate workflows composed of one or more stages. +Steps execute one after another **inside the same directory**, but are rated individually. The tool +takes +snapshots after each step and includes them in the final report. You can create a multistep prompt +by +passing an instance of the `MultiStepPrompt` class into the `executablePrompts` array, for example: ```ts executablePrompts: [ @@ -142,34 +138,36 @@ run against it. ## Optional properties -These properties aren't required for the environment to run, but can be used to configure it further. +These properties aren't required for the environment to run, but can be used to configure it +further. ### `sourceDirectory` -Project into which the LLM-generated files will be placed, built, executed and evaluated. -Can be an entire project or a handful of files that will be merged with the +Directory into which the LLM-generated files are written, built, executed, and evaluated. +Can be an entire project or a handful of files to be merged with the `projectTemplate` ([see below](#projecttemplate)) ### `projectTemplate` Used for reducing the boilerplate when setting up an environment, `projectTemplate` specifies the -path of the project template that will be merged together with the files from `sourceDirectory` to -create the final project structure that the evaluation will run against. +path of a project template to be merged together with the files from `sourceDirectory`, creating +the final project structure against which the evaluation runs. -For example, if the config has `projectTemplate: './templates/angular', sourceDirectory: './project'`, -the eval runner will copy the files from `./templates/angular` into the output directory -and then apply the files from `./project` on top of them, merging directories and replacing +For example, if the config has +`projectTemplate: './templates/angular', sourceDirectory: './project'`, +the eval runner copies the files from `./templates/angular` into the output directory +and then applies the files from `./project` on top of them, merging directories and replacing overlapping files. ### `fullStackFramework` -Name of the full-stack framework that is used in the evaluation, in addition to the -`clientSideFramework`. If omitted, the `fullStackFramework` will be set to the same value as +Name of the full-stack framework that used in the evaluation, in addition to the +`clientSideFramework`. If omitted, the `fullStackFramework` is set to the same value as the `clientSideFramework`. ### `mcpServers` -IDs of Model Context Protocol servers that will be started and exposed to the LLM as a part of +IDs of Model Context Protocol (MCP) servers that are started and exposed to the LLM as a part of the evaluation. ### `buildCommand` diff --git a/docs/model-setup.md b/docs/model-setup.md index 4648d4e..5ce305c 100644 --- a/docs/model-setup.md +++ b/docs/model-setup.md @@ -1,9 +1,11 @@ -# How to setup up a new LLM? +# How to set up a new LLM? If you want to test out a model that isn't yet available in the runner, you can add support for it by following these steps: -1. Ensure that the provider of the model is supported by Genkit. -2. Find the provider for the model in `runner/codegen/genkit/providers`. If the provider hasn't been implemented yet, do so by creating a new `GenkitModelProvider` and adding it to the `MODEL_PROVIDERS` in `runner/genkit/models.ts`. +1. Ensure that the provider of the model is supported by [Genkit](https://genkit.dev/). +2. Find the provider for the model in `runner/codegen/genkit/providers`. If the provider hasn't been + implemented yet, do so by creating a new `GenkitModelProvider` and adding it to the + `MODEL_PROVIDERS` in `runner/genkit/models.ts`. 3. Add your model to the `GenkitModelProvider` configs. 4. Done! 🎉 You can now run your model by passing `--model=`. diff --git a/package.json b/package.json index 9d238d6..ef1edb9 100644 --- a/package.json +++ b/package.json @@ -11,10 +11,20 @@ "format": "prettier --write \"runner/**/*.ts\" \"report-app/**/*.ts\" \"*.json\"", "check-format": "prettier --check \"runner/**/*.ts\" \"report-app/**/*.ts\" \"*.json\"" }, - "keywords": [], + "keywords": [ + "codegen", + "code generation", + "benchmark", + "llm", + "evaluation", + "web", + "web development", + "code quaility", + "prompt engineering" + ], "author": "", "license": "MIT", - "description": "", + "description": "Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language Models (LLMs).", "type": "module", "bugs": { "url": "https://github.com/angular/web-codegen-scorer/issues"