Skip to content

facebookexperimental/semcode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Semcode - Semantic Code Search

Semcode is a semantic code search tool for C/C++ codebases that indexes your codebase and allows you to search for functions, types, and code patterns using both exact matches and semantic similarity.

Features

  • Fast indexing of C/C++ codebases using Tree-sitter
  • Interactive query interface with comprehensive command set
  • Call graph analysis with forward/reverse traversal
  • Type and macro discovery with detailed structural information
  • Diff analysis for understanding code changes and their impact
  • Pattern matching
  • MCP server for integration with AI code tools
  • GIT integration for incremental scans of new commits

While semcode provides both a query tool and an MCP server, the primary use case is via the MCP server. It gives AI code tools the ability to quickly find context about the kernel, and generally makes them more effective.

The MCP server can also be used by Kernel AI Review Prompts

Future features

  • Semantic search -- exists today but largely untested
  • Git commit history indexing for finding future and past versions of functions/types
  • lore.kernel.org index via git for searching mailing list archives

Quick Start

Dependencies

Install protobuf-devel from your distro

**Rust:**
Install from [rustup.rs](https://rustup.rs/)

### Build

```bash
# Clone and build
git clone <repository-url>
cd semcode
cargo build --release

Binaries end up in target/release

Basic Usage

# Index a codebase

cd linux
semcode-index -s .

This assumes you have a linux git repo, and it puts the semcode database
into linux/.semcode.db (in the source directory)

# Start interactive query tool
semcode

You can index git ranges as well:

semcode-index -s . --git v6.14..v6.15

Once a git range is indexed, you can either use the --git arguments with individual commands in the query tools, or git checkout some_sha and run the query tool. It'll grab the current HEAD and return results against it.

Type help in the interactive shell for complete command documentation. Here are the most common commands:

Function and Macro Search:

func truncates the list of calls/callers by default, -v gives you everything

semcode> func printk                      # Find function by name
semcode> f EXPORT_SYMBOL                  # Find macro (short form)
semcode> function mutex_lock              # Find function (long form)
semcode> function btrfs_search.*          # Find function regex

Type and Typedef Search:

semcode> type struct task_struct          # Find struct by name
semcode> ty size_t                        # Find typedef (short form)
semcode> type pthread_mutex_t             # Find type by name

Call Graph Analysis:

These functions all have -v to show you paths and git file shas

semcode> callers mutex_lock               # Show what calls mutex_lock
semcode> calls schedule                 # Show what schedule calls
semcode> callchain kmalloc                # Show complete call graph

Semantic Search (requires vectors):

semcode> vgrep btrfs search slot

Data Export:

semcode> dump-functions functions.json    # Export all functions
semcode> df funcs.json                    # Export functions (short form)
semcode> dump-types types.json            # Export all types
semcode> dt types.json                    # Export types (short form)
semcode> dump-typedefs typedefs.json      # Export all typedefs
semcode> dump-macros macros.json          # Export all macros
semcode> tables                           # Show available data tables

General Commands:

semcode> help                             # Show complete help with examples
semcode> h                                # Show help (short form)
semcode> quit                             # Exit the program
semcode> q                                # Exit (short form)

More Usage guides

Setting up semcode with claude: docs/claude-semcode-setup.md

Doing patch review in the kernel: docs/claude-patch-review.md

Data Storage

Semcode uses LanceDB to store:

  • Functions with signatures, bodies, call relationships, and optional embeddings
  • Types (structs, unions, enums) with field information
  • Typedefs with underlying type mappings
  • Macros (function-like only, for better signal-to-noise ratio)

Configuration

Performance Tuning

Use the -j flag to control parallelism:

The default is to try and saturate your cpus.

Proxy Support

The model setup script honors standard proxy environment variables:

export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=http://proxy.company.com:8080
python scripts/direct_download.py

MCP (Model Context Protocol) Server

Semcode includes an MCP server that exposes its query functionality to AI agents like Claude:

The MCP server provides these tools:

Basic Search:

  • find_function - Find functions and macros by exact name
  • find_type - Find types, structs, and typedefs by exact name

Call Graph Analysis:

  • find_callers - Find all functions that call a specific function
  • find_callees - Find all functions called by a specific function
  • find_callchain - Show complete call chain for a function
  • diff_functions - Extract and list functions from a unified diff
  • grep_functions - regex searches through function bodies
  • vgrep_functions - vector searches through function bodies

MCP Configuration

Check your claude documentation on this one, but it is setup for one semcode-mcp server per claude instance.

cd linux semcode-index -s . claude --mcp-config mcp-config.json

func btrfs_search_slot

mcp-config.json:

{"mcpServers":{"semcode":{"command":"/some_path/semcode-mcp"}}}

See examples/mcp-config.json for an example file

Verify MCP Tools Are Available

Ask Claude to list available tools:

User: "What semcode tools do you have access to?"
User: "What semcode-myproject tools are available?" # For specific server

Configuration Options

Required:

  • command: Absolute path to semcode-mcp binary Optional:
  • args: --database with path to your indexed database or --git for the repo

Manual Testing

You can test the MCP server outside of Claude:

# Start the MCP server manually
./bin/semcode-mcp --database /path/to/your.db

# It will wait for JSON-RPC input on stdin
# Press Ctrl+C to exit

Security Notes:

  • The MCP server operates in read-only mode
  • It only accesses the pre-indexed database files
  • All queries are logged to stderr for debugging

Model Setup (for vector search)

Note: this is mostly untested, and not required for general usage. I've been trying nomic v2 and running it through model2vec to make things faster without a GPU. There are scripts:

scripts/direct_download.py scripts/nomic2vec.py

Take the resulting model and move it to ~/.cache/semcode/model2vec.

Then do a index run: semcode-index -s .

Then do a vector run: semcode-index -s . --vectors

Then you can do vector searches with the vgrep command either in the query tool or MCP.

Model Storage

This is only relevant for vector searching

Models are stored in: ~/.cache/semcode/models/

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

semcode

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages