A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.
DataRedactor scans text for sensitive patterns and replaces matches with [REDACTED]. It uses a C extension backed by POSIX regex.h so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.
require "data_redactor"
text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
DataRedactor.redact(text)
# => "User CF is [REDACTED] and key is [REDACTED]"Every pattern belongs to one tag. Use only: to redact a subset, or except: to skip one.
DataRedactor.tags
# => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other]
# Only redact API keys / tokens / private keys
DataRedactor.redact(text, only: [:credentials])
# Redact everything except contact info (emails, phone numbers)
DataRedactor.redact(text, except: [:contact])
# Single symbol works too
DataRedactor.redact(text, only: :financial)Passing an unknown tag raises DataRedactor::UnknownTagError. Passing both only: and except: raises ArgumentError.
By default every match is replaced with [REDACTED]. Use the placeholder: keyword to change this:
# Plain string — any replacement text
DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: "")
# Tagged — embeds the pattern's tag name so you know what was redacted
DataRedactor.redact(text, placeholder: :tagged)
# "user@example.com" → "[REDACTED:CONTACT]"
# "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
# "DE89370400440532013000" → "[REDACTED:FINANCIAL]"
# Hash — deterministic 4-hex suffix of the matched value
# Same value always produces the same token — useful for correlating
# redactions across log lines without leaking the original.
DataRedactor.redact(text, placeholder: :hash)
# "user@example.com" → "[CONTACT_3d7a]"
# "user@example.com" → "[CONTACT_3d7a]" (same every time)
# "other@example.com" → "[CONTACT_91fc]" (different value, different hash)All three modes compose with only: and except::
DataRedactor.redact(text, only: :contact, placeholder: :tagged)DataRedactor.scan returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:
result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
# => {
# redacted: "User [REDACTED] logged in from [REDACTED]",
# matches: [
# { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5, length: 20 },
# { tag: :network, name: "ipv4", value: "192.168.1.1", start: 35, length: 11 }
# ]
# }
# :start and :length are byte offsets into the original string
m = result[:matches].first
original_text.byteslice(m[:start], m[:length]) # => "AKIAIOSFODNN7EXAMPLE"
# Accepts the same tag filters as redact
DataRedactor.scan(text, only: :credentials)
DataRedactor.scan(text, except: :network)Teams often have internal IDs that the gem can't ship. Register them at boot:
# String (POSIX ERE) or Regexp — both accepted
DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "ticket_ref", regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)
# Custom patterns are tagged :custom by default; pass any built-in tag to group differently
DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)
DataRedactor.redact(text) # runs all patterns including custom
DataRedactor.redact(text, only: [:custom]) # only user patterns
DataRedactor.redact(text, only: [:custom, :credentials]) # mix
DataRedactor.custom_patterns # => [{name:, source:, tag:, boundary:}, ...]
DataRedactor.remove_pattern("employee_id")
DataRedactor.clear_custom_patterns! # mostly for test suitesRegex rules — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: \d, \s, \w, \b, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise DataRedactor::InvalidPatternError at registration time, never at redaction time. Use [0-9] instead of \d, [[:space:]] instead of \s, etc.
boundary: true — wraps the pattern with (^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$) so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
| # | Pattern | Example |
|---|---|---|
| 0 | AWS Access Key ID | AKIAIOSFODNN7EXAMPLE |
| 1 | AWS Secret Access Key | 40-character base64 string |
| 5 | Google API Key | AIzaSyXXXX... |
| 6 | GitHub Personal Access Token | github_pat_XXXX... |
| 7 | Slack Webhook URL | https://hooks.slack.com/services/T.../B.../... |
| 8 | Stripe Secret Key | sk_live_XXXX... |
| 9 | PEM Private Key header | -----BEGIN RSA PRIVATE KEY----- |
| 13 | Scaleway Access Key | SCW12345ABCDE6789FGHIJ |
| 14 | UUID v4 / Scaleway Secret Key | 550e8400-e29b-41d4-a716-446655440000 |
| # | Pattern | Example |
|---|---|---|
| 2 | Italian Codice Fiscale (basic) | RSSMRA85M01H501Z |
| 3 | Passport — letter prefix + digits | AB1234567 |
| 4 | Passport — 9 consecutive digits ¹ | 123456789 |
| 22 | Italian Codice Fiscale (omocodia) | RSSMRALPMNLH5LMZ |
| # | Pattern | Example |
|---|---|---|
| 11 | Credit card — Visa, Mastercard, Amex, Discover, JCB | 4111111111111111 |
| 12 | IPv4 address | 192.168.1.100 |
| # | Country | Example |
|---|---|---|
| 10 | Italy | IT60X0542811101000000123456 |
| 15 | France | FR7630006000011234567890189 |
| 16 | Germany | DE89370400440532013000 |
| 17 | Spain | ES9121000418450200051332 |
| 18 | Netherlands | NL91ABNA0417164300 |
| 19 | Belgium | BE68539007547034 |
| 20 | Portugal | PT50000201231234567890154 |
| 21 | Ireland | IE29AIBK93115212345678 |
| 28 | Sweden | SE4550000000058398257466 |
| 29 | Denmark | DK5000400440116243 |
| 30 | Norway | NO9386011117947 |
| 31 | Finland | FI2112345600000785 |
| 37 | Poland | PL61109010140000071219812874 |
| 38 | Austria | AT611904300234573201 |
| 39 | Switzerland | CH9300762011623852957 |
| 40 | Czechia | CZ6508000000192000145399 |
| 41 | Hungary | HU42117730161111101800000000 |
| 42 | Romania | RO49AAAA1B31007593840000 |
| # | Country | Type | Example |
|---|---|---|---|
| 23 | France | NIR / Social Security ¹ | 185126203450342 |
| 24 | Spain | DNI ¹ | 12345678Z |
| 25 | Spain | NIE | X1234567L |
| 26 | Netherlands | BSN ¹ | 123456789 |
| 27 | Poland | PESEL ¹ | 85121612345 |
| 32 | Belgium | National Number ¹ | 85121612345 |
| 33 | Sweden | Personnummer ¹ | 850101-1234 |
| 34 | Denmark | CPR Number ¹ | 010185-1234 |
| 35 | Norway | Fødselsnummer ¹ | 01018512345 |
| 36 | Finland | HETU ¹ | 010185-123A |
| 43 | Poland | PESEL (alt slot) ¹ | 90010112345 |
| 44 | Austria | Abgabenkontonummer ¹ | 123456789 |
| 45 | Switzerland | AHV Number ¹ | 756.1234.5678.90 |
| 46 | Czechia | Rodné číslo ¹ | 856121/1234 |
| 47 | Hungary | Tax ID ¹ | 8012345678 |
| 48 | Romania | CNP ¹ | 1850101123456 |
¹ Word-boundary protected — these patterns are wrapped with
(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.
redactor/
├── data_redactor.gemspec
├── Gemfile
├── Rakefile
├── lib/
│ ├── data_redactor.rb # Ruby entry point, loads the .so
│ └── data_redactor/
│ └── version.rb
├── ext/
│ └── data_redactor/
│ ├── extconf.rb # Checks for C headers, generates Makefile
│ └── data_redactor.c # C extension: regex compilation + redaction
└── spec/
└── data_redactor_spec.rb # RSpec tests (61 examples, one per pattern)
- Ruby >= 2.7
- A C compiler (
gccorclang) - POSIX
regex.h(standard on Linux and macOS)
bundle installbundle exec rake compileThis runs extconf.rb via rake-compiler, which generates a Makefile and compiles data_redactor.c into a .so shared library placed under lib/data_redactor/.
bundle exec rake specOr compile and test in one step:
bundle exec rake- At load time,
Init_data_redactorcompiles all 49 regex patterns once usingregcomp(POSIX ERE) and stores them as staticregex_tstructs. Patterns marked as boundary-wrapped are expanded withwrap_boundary()before compilation. DataRedactor.redact(text)receives a RubyString, converts it to a Cchar*viaStringValueCStr, and runs each compiled pattern in sequence on a working buffer.- For each pattern,
replace_all_matchesiterates usingregexec, copies non-matching segments to a fresh output buffer, and inserts[REDACTED]in place of each match. For boundary-wrapped patterns,regexecis called withnmatch=4and sub-match groups[1]/[3]identify the boundary characters so they are preserved verbatim. - The output buffer is grown with
reallocas needed. After all patterns are applied the result is returned as a RubyStringviarb_str_new_cstr. All intermediatemalloc/strdupallocations are explicitlyfreed.
All C-side buffers are heap-allocated with malloc/strdup and freed before the function returns. The only Ruby-managed allocation is the final return value from rb_str_new_cstr. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
This project follows Semantic Versioning 2.0.0. Until 1.0.0, minor versions may introduce breaking changes; from 1.0.0 onward, breaking changes will only land in major versions. See CHANGELOG.md for the release history.
Released under the MIT License.
- Pattern ordering matters — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
- AWS Secret Key (pattern 1) — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
- Duplicate digit patterns — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.