Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8
Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8
Conversation
…nsitive information Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| masked_id = _pseudonymize(info.get("id")) | ||
| masked_name = _pseudonymize(info.get("name")) | ||
| masked_job = _pseudonymize(info.get("job")) | ||
| f.write(f"| `{masked_id}` | {masked_name} | {masked_job} |\n") |
Check failure
Code scanning / CodeQL
Clear-text storage of sensitive information High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix clear-text storage of sensitive information in reports/logs, you either (1) avoid including the sensitive fields altogether, or (2) replace them with non-reversible pseudonyms that cannot be used to reconstruct the original values and do not expose any direct substring of the data. Deterministic, secret-key–based tokens are better than raw hashes, and exposing no raw prefix at all is safer than exposing a few characters.
For this specific code, the problem is localized in _pseudonymize and its use in write_group_to_file. We can fix the issue without changing existing functionality in a user-visible way by tightening _pseudonymize so that it no longer includes any portion of the original string, and only outputs a fixed label plus a short, deterministic pseudonymous token derived from the input. This keeps stable equality (the same input always yields the same masked output), so duplicate grouping and report readability (“same masked name appears several times”) are preserved, while eliminating direct leakage of clear-text characters. To further reduce the risk of offline guessing attacks, we can introduce a process-local random salt, so that the digest used for masking cannot be precomputed from the raw database values; since the salt only needs to be consistent within a single run (for generating this report), a single generated salt in the module scope is sufficient and does not require persistent storage.
Concretely, in scripts/analyze_duplicates.py:
- Add an import for
secretsandbase64(both from the standard library) alongside the existing imports. - Add a module-level random salt (e.g.,
PSEUDONYM_SALT = secrets.token_bytes(16)). - Rewrite
_pseudonymizeto:- Handle
None/ empty the same as today ("N/A"). - Compute
digest = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest(). - Encode a short prefix of the digest with URL-safe base64 (or hex) and truncate it for readability (e.g., 10 chars).
- Return a generic label like
anon:<token>orpseudonym:<token>without embedding any substring oftext.
- Handle
- Leave
write_group_to_fileunchanged; it will automatically start writing the safer tokens instead of the current"prefix…:digest"values.
This modification is restricted to the shown file and lines, uses only standard-library modules, preserves the behavior that identical inputs yield identical masked outputs, and removes clear-text leakage that CodeQL flags.
| @@ -6,6 +6,8 @@ | ||
| from pathlib import Path | ||
| from typing import Any | ||
| import hashlib | ||
| import secrets | ||
| import base64 | ||
|
|
||
| from dex_python.deduplication import ( | ||
| find_birthday_name_duplicates, | ||
| @@ -20,7 +22,13 @@ | ||
| DEFAULT_DB_PATH = DATA_DIR / "dex_contacts.db" | ||
| DEFAULT_REPORT_PATH = DATA_DIR / "DUPLICATE_REPORT.md" | ||
|
|
||
| # Per-process random salt used to strengthen pseudonymization. This ensures | ||
| # that the masked values in the report cannot be directly linked back to | ||
| # database values via precomputed hashes, while remaining stable within | ||
| # a single run of the script. | ||
| PSEUDONYM_SALT = secrets.token_bytes(16) | ||
|
|
||
|
|
||
| def get_contact_summary(conn: sqlite3.Connection, contact_id: str) -> dict[str, Any]: | ||
| """Fetch basic info for a contact to display in the report.""" | ||
| cursor = conn.cursor() | ||
| @@ -43,15 +50,25 @@ | ||
|
|
||
| This avoids storing raw identifiers or PII in clear text while still | ||
| allowing consistent comparison within the report. | ||
|
|
||
| The function produces a deterministic, non-reversible token for a given | ||
| input *within a single run* of the script, and does not expose any | ||
| substring of the original value. | ||
| """ | ||
| if value is None: | ||
| return "N/A" | ||
| text = str(value) | ||
| digest = hashlib.sha256(text.encode("utf-8")).hexdigest()[:8] | ||
| prefix = text[:3] if len(text) > 3 else text | ||
| return f"{prefix}…:{digest}" | ||
| if not text: | ||
| return "N/A" | ||
|
|
||
| # Combine a per-process random salt with the value so that the resulting | ||
| # token cannot be matched against precomputed hashes of database values. | ||
| digest_bytes = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest() | ||
| # Use a short, URL-safe base64 representation for readability. | ||
| token = base64.urlsafe_b64encode(digest_bytes).decode("ascii").rstrip("=")[:10] | ||
| return f"anon:{token}" | ||
|
|
||
|
|
||
| def write_group_to_file( | ||
| f: Any, conn: sqlite3.Connection, group: dict[str, Any], title: str | ||
| ) -> None: |
Potential fix for https://github.com/domfahey/dex-python/security/code-scanning/2
In general, the way to fix clear-text storage of sensitive information in a report like this is either (a) avoid including the sensitive data at all in the persisted file, or (b) protect it, e.g., by redacting or pseudonymizing it so the stored content is not directly identifying. Full cryptographic encryption of the whole report would defeat its purpose as a human-readable report unless there is a decryption workflow; given the limited snippet and the requirement not to change external behavior too much, the most practical mitigation is to pseudonymize or partially mask sensitive fields before writing them out.
For this script, we can keep the core functionality (detecting duplicates and summarizing them) while reducing privacy risk by masking personally identifying details in the persisted Markdown. A straightforward change is to avoid writing the raw
info['id']and fullinfo['name']andinfo['job']fields; instead, we can write a hashed or truncated version of the ID and partially masked name and job. Since we cannot modify upstream functions likefind_birthday_name_duplicates, the best place to introduce protection is directly before writing to the file inwrite_group_to_file. We can add a small helper that takes a string and returns a pseudonymous representation (for example, the first few characters plus a fixed-length hash using a standard library function such ashashlib.sha256), and apply this to each of the three fields used in the report. This keeps the ability to distinguish entries within the report (same ID hash indicates same record), while ensuring the stored report does not contain the raw identifiers.Concretely, in
scripts/analyze_duplicates.py:import hashlibat the top (standard library, no external dependency).def _pseudonymize(value: str) -> str:that returns a safe string, e.g.,prefix + ":" + short_hash. We can place this just abovewrite_group_to_file.write_group_to_file, after callingget_contact_summary, derivemasked_id,masked_name, andmasked_jobusing_pseudonymize, then write those to the file instead of the rawinfo['id'],info['name'], andinfo['job'].Suggested fixes powered by Copilot Autofix. Review carefully before merging.