DataRedactor

A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.

What it does

DataRedactor scans text for sensitive patterns and replaces matches with [REDACTED]. It uses a C extension backed by POSIX regex.h so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.

Usage

require "data_redactor"

text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
DataRedactor.redact(text)
# => "User CF is [REDACTED] and key is [REDACTED]"

Filtering by tag

Every pattern belongs to one tag. Use only: to redact a subset, or except: to skip one.

DataRedactor.tags
# => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other]

# Only redact API keys / tokens / private keys
DataRedactor.redact(text, only: [:credentials])

# Redact everything except contact info (emails, phone numbers)
DataRedactor.redact(text, except: [:contact])

# Single symbol works too
DataRedactor.redact(text, only: :financial)

Passing an unknown tag raises DataRedactor::UnknownTagError. Passing both only: and except: raises ArgumentError.

Configurable placeholder

By default every match is replaced with [REDACTED]. Use the placeholder: keyword to change this:

# Plain string — any replacement text
DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: "")

# Tagged — embeds the pattern's tag name so you know what was redacted
DataRedactor.redact(text, placeholder: :tagged)
# "user@example.com"  → "[REDACTED:CONTACT]"
# "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
# "DE89370400440532013000" → "[REDACTED:FINANCIAL]"

# Hash — deterministic 4-hex suffix of the matched value
# Same value always produces the same token — useful for correlating
# redactions across log lines without leaking the original.
DataRedactor.redact(text, placeholder: :hash)
# "user@example.com"  → "[CONTACT_3d7a]"
# "user@example.com"  → "[CONTACT_3d7a]"  (same every time)
# "other@example.com" → "[CONTACT_91fc]"  (different value, different hash)

All three modes compose with only: and except::

DataRedactor.redact(text, only: :contact, placeholder: :tagged)

Scan / dry-run mode

DataRedactor.scan returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:

result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
# => {
#   redacted: "User [REDACTED] logged in from [REDACTED]",
#   matches: [
#     { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5,  length: 20 },
#     { tag: :network,     name: "ipv4",              value: "192.168.1.1",          start: 35, length: 11 }
#   ]
# }

# :start and :length are byte offsets into the original string
m = result[:matches].first
original_text.byteslice(m[:start], m[:length])  # => "AKIAIOSFODNN7EXAMPLE"

# Accepts the same tag filters as redact
DataRedactor.scan(text, only: :credentials)
DataRedactor.scan(text, except: :network)

Custom patterns

Teams often have internal IDs that the gem can't ship. Register them at boot:

# String (POSIX ERE) or Regexp — both accepted
DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "ticket_ref",  regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)

# Custom patterns are tagged :custom by default; pass any built-in tag to group differently
DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)

DataRedactor.redact(text)                         # runs all patterns including custom
DataRedactor.redact(text, only: [:custom])         # only user patterns
DataRedactor.redact(text, only: [:custom, :credentials]) # mix

DataRedactor.custom_patterns   # => [{name:, source:, tag:, boundary:}, ...]
DataRedactor.remove_pattern("employee_id")
DataRedactor.clear_custom_patterns!               # mostly for test suites

Regex rules — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: \d, \s, \w, \b, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise DataRedactor::InvalidPatternError at registration time, never at redaction time. Use [0-9] instead of \d, [[:space:]] instead of \s, etc.

boundary: true — wraps the pattern with (^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$) so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.

Detected patterns (49 total)

Cloud & API secrets

#	Pattern	Example
0	AWS Access Key ID	`AKIAIOSFODNN7EXAMPLE`
1	AWS Secret Access Key	40-character base64 string
5	Google API Key	`AIzaSyXXXX...`
6	GitHub Personal Access Token	`github_pat_XXXX...`
7	Slack Webhook URL	`https://hooks.slack.com/services/T.../B.../...`
8	Stripe Secret Key	`sk_live_XXXX...`
9	PEM Private Key header	`-----BEGIN RSA PRIVATE KEY-----`
13	Scaleway Access Key	`SCW12345ABCDE6789FGHIJ`
14	UUID v4 / Scaleway Secret Key	`550e8400-e29b-41d4-a716-446655440000`

Travel documents

#	Pattern	Example
2	Italian Codice Fiscale (basic)	`RSSMRA85M01H501Z`
3	Passport — letter prefix + digits	`AB1234567`
4	Passport — 9 consecutive digits ¹	`123456789`
22	Italian Codice Fiscale (omocodia)	`RSSMRALPMNLH5LMZ`

Payment & network

#	Pattern	Example
11	Credit card — Visa, Mastercard, Amex, Discover, JCB	`4111111111111111`
12	IPv4 address	`192.168.1.100`

IBANs

#	Country	Example
10	Italy	`IT60X0542811101000000123456`
15	France	`FR7630006000011234567890189`
16	Germany	`DE89370400440532013000`
17	Spain	`ES9121000418450200051332`
18	Netherlands	`NL91ABNA0417164300`
19	Belgium	`BE68539007547034`
20	Portugal	`PT50000201231234567890154`
21	Ireland	`IE29AIBK93115212345678`
28	Sweden	`SE4550000000058398257466`
29	Denmark	`DK5000400440116243`
30	Norway	`NO9386011117947`
31	Finland	`FI2112345600000785`
37	Poland	`PL61109010140000071219812874`
38	Austria	`AT611904300234573201`
39	Switzerland	`CH9300762011623852957`
40	Czechia	`CZ6508000000192000145399`
41	Hungary	`HU42117730161111101800000000`
42	Romania	`RO49AAAA1B31007593840000`

National personal identifiers

#	Country	Type	Example
23	France	NIR / Social Security ¹	`185126203450342`
24	Spain	DNI ¹	`12345678Z`
25	Spain	NIE	`X1234567L`
26	Netherlands	BSN ¹	`123456789`
27	Poland	PESEL ¹	`85121612345`
32	Belgium	National Number ¹	`85121612345`
33	Sweden	Personnummer ¹	`850101-1234`
34	Denmark	CPR Number ¹	`010185-1234`
35	Norway	Fødselsnummer ¹	`01018512345`
36	Finland	HETU ¹	`010185-123A`
43	Poland	PESEL (alt slot) ¹	`90010112345`
44	Austria	Abgabenkontonummer ¹	`123456789`
45	Switzerland	AHV Number ¹	`756.1234.5678.90`
46	Czechia	Rodné číslo ¹	`856121/1234`
47	Hungary	Tax ID ¹	`8012345678`
48	Romania	CNP ¹	`1850101123456`

¹ Word-boundary protected — these patterns are wrapped with (^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$) at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.

Directory structure

redactor/
├── data_redactor.gemspec
├── Gemfile
├── Rakefile
├── lib/
│   ├── data_redactor.rb          # Ruby entry point, loads the .so
│   └── data_redactor/
│       └── version.rb
├── ext/
│   └── data_redactor/
│       ├── extconf.rb         # Checks for C headers, generates Makefile
│       └── data_redactor.c       # C extension: regex compilation + redaction
└── spec/
    └── data_redactor_spec.rb     # RSpec tests (61 examples, one per pattern)

Requirements

Ruby >= 2.7
A C compiler (gcc or clang)
POSIX regex.h (standard on Linux and macOS)

Setup

bundle install

Compile the C extension

bundle exec rake compile

This runs extconf.rb via rake-compiler, which generates a Makefile and compiles data_redactor.c into a .so shared library placed under lib/data_redactor/.

Run the tests

bundle exec rake spec

Or compile and test in one step:

bundle exec rake

How it works

At load time, Init_data_redactor compiles all 49 regex patterns once using regcomp (POSIX ERE) and stores them as static regex_t structs. Patterns marked as boundary-wrapped are expanded with wrap_boundary() before compilation.
DataRedactor.redact(text) receives a Ruby String, converts it to a C char* via StringValueCStr, and runs each compiled pattern in sequence on a working buffer.
For each pattern, replace_all_matches iterates using regexec, copies non-matching segments to a fresh output buffer, and inserts [REDACTED] in place of each match. For boundary-wrapped patterns, regexec is called with nmatch=4 and sub-match groups [1]/[3] identify the boundary characters so they are preserved verbatim.
The output buffer is grown with realloc as needed. After all patterns are applied the result is returned as a Ruby String via rb_str_new_cstr. All intermediate malloc/strdup allocations are explicitly freed.

Memory management

All C-side buffers are heap-allocated with malloc/strdup and freed before the function returns. The only Ruby-managed allocation is the final return value from rb_str_new_cstr. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.

Versioning

This project follows Semantic Versioning 2.0.0. Until 1.0.0, minor versions may introduce breaking changes; from 1.0.0 onward, breaking changes will only land in major versions. See CHANGELOG.md for the release history.

License

Released under the MIT License.

Known limitations

Pattern ordering matters — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
AWS Secret Key (pattern 1) — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
Duplicate digit patterns — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
ext/data_redactor		ext/data_redactor
lib		lib
spec		spec
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
Rakefile		Rakefile
TODO.md		TODO.md
data_redactor.gemspec		data_redactor.gemspec
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataRedactor

What it does

Usage

Filtering by tag

Configurable placeholder

Scan / dry-run mode

Custom patterns

Detected patterns (49 total)

Cloud & API secrets

Travel documents

Payment & network

IBANs

National personal identifiers

Directory structure

Requirements

Setup

Compile the C extension

Run the tests

How it works

Memory management

Versioning

License

Known limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataRedactor

What it does

Usage

Filtering by tag

Configurable placeholder

Scan / dry-run mode

Custom patterns

Detected patterns (49 total)

Cloud & API secrets

Travel documents

Payment & network

IBANs

National personal identifiers

Directory structure

Requirements

Setup

Compile the C extension

Run the tests

How it works

Memory management

Versioning

License

Known limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages