Skip to content

Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8

Draft
domfahey wants to merge 1 commit intomainfrom
alert-autofix-2
Draft

Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8
domfahey wants to merge 1 commit intomainfrom
alert-autofix-2

Conversation

@domfahey
Copy link
Copy Markdown
Owner

Potential fix for https://github.com/domfahey/dex-python/security/code-scanning/2

In general, the way to fix clear-text storage of sensitive information in a report like this is either (a) avoid including the sensitive data at all in the persisted file, or (b) protect it, e.g., by redacting or pseudonymizing it so the stored content is not directly identifying. Full cryptographic encryption of the whole report would defeat its purpose as a human-readable report unless there is a decryption workflow; given the limited snippet and the requirement not to change external behavior too much, the most practical mitigation is to pseudonymize or partially mask sensitive fields before writing them out.

For this script, we can keep the core functionality (detecting duplicates and summarizing them) while reducing privacy risk by masking personally identifying details in the persisted Markdown. A straightforward change is to avoid writing the raw info['id'] and full info['name'] and info['job'] fields; instead, we can write a hashed or truncated version of the ID and partially masked name and job. Since we cannot modify upstream functions like find_birthday_name_duplicates, the best place to introduce protection is directly before writing to the file in write_group_to_file. We can add a small helper that takes a string and returns a pseudonymous representation (for example, the first few characters plus a fixed-length hash using a standard library function such as hashlib.sha256), and apply this to each of the three fields used in the report. This keeps the ability to distinguish entries within the report (same ID hash indicates same record), while ensuring the stored report does not contain the raw identifiers.

Concretely, in scripts/analyze_duplicates.py:

  • Add an import hashlib at the top (standard library, no external dependency).
  • Define a helper function like def _pseudonymize(value: str) -> str: that returns a safe string, e.g., prefix + ":" + short_hash. We can place this just above write_group_to_file.
  • In write_group_to_file, after calling get_contact_summary, derive masked_id, masked_name, and masked_job using _pseudonymize, then write those to the file instead of the raw info['id'], info['name'], and info['job'].
  • Keep the rest of the script unchanged so duplicate detection and grouping behavior remain intact.

Suggested fixes powered by Copilot Autofix. Review carefully before merging.

…nsitive information

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 28, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch alert-autofix-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

masked_id = _pseudonymize(info.get("id"))
masked_name = _pseudonymize(info.get("name"))
masked_job = _pseudonymize(info.get("job"))
f.write(f"| `{masked_id}` | {masked_name} | {masked_job} |\n")

Check failure

Code scanning / CodeQL

Clear-text storage of sensitive information High

This expression stores
sensitive data (private)
as clear text.

Copilot Autofix

AI 4 months ago

In general, to fix clear-text storage of sensitive information in reports/logs, you either (1) avoid including the sensitive fields altogether, or (2) replace them with non-reversible pseudonyms that cannot be used to reconstruct the original values and do not expose any direct substring of the data. Deterministic, secret-key–based tokens are better than raw hashes, and exposing no raw prefix at all is safer than exposing a few characters.

For this specific code, the problem is localized in _pseudonymize and its use in write_group_to_file. We can fix the issue without changing existing functionality in a user-visible way by tightening _pseudonymize so that it no longer includes any portion of the original string, and only outputs a fixed label plus a short, deterministic pseudonymous token derived from the input. This keeps stable equality (the same input always yields the same masked output), so duplicate grouping and report readability (“same masked name appears several times”) are preserved, while eliminating direct leakage of clear-text characters. To further reduce the risk of offline guessing attacks, we can introduce a process-local random salt, so that the digest used for masking cannot be precomputed from the raw database values; since the salt only needs to be consistent within a single run (for generating this report), a single generated salt in the module scope is sufficient and does not require persistent storage.

Concretely, in scripts/analyze_duplicates.py:

  • Add an import for secrets and base64 (both from the standard library) alongside the existing imports.
  • Add a module-level random salt (e.g., PSEUDONYM_SALT = secrets.token_bytes(16)).
  • Rewrite _pseudonymize to:
    • Handle None / empty the same as today ("N/A").
    • Compute digest = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest().
    • Encode a short prefix of the digest with URL-safe base64 (or hex) and truncate it for readability (e.g., 10 chars).
    • Return a generic label like anon:<token> or pseudonym:<token> without embedding any substring of text.
  • Leave write_group_to_file unchanged; it will automatically start writing the safer tokens instead of the current "prefix…:digest" values.

This modification is restricted to the shown file and lines, uses only standard-library modules, preserves the behavior that identical inputs yield identical masked outputs, and removes clear-text leakage that CodeQL flags.


Suggested changeset 1
scripts/analyze_duplicates.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/analyze_duplicates.py b/scripts/analyze_duplicates.py
--- a/scripts/analyze_duplicates.py
+++ b/scripts/analyze_duplicates.py
@@ -6,6 +6,8 @@
 from pathlib import Path
 from typing import Any
 import hashlib
+import secrets
+import base64
 
 from dex_python.deduplication import (
     find_birthday_name_duplicates,
@@ -20,7 +22,13 @@
 DEFAULT_DB_PATH = DATA_DIR / "dex_contacts.db"
 DEFAULT_REPORT_PATH = DATA_DIR / "DUPLICATE_REPORT.md"
 
+# Per-process random salt used to strengthen pseudonymization. This ensures
+# that the masked values in the report cannot be directly linked back to
+# database values via precomputed hashes, while remaining stable within
+# a single run of the script.
+PSEUDONYM_SALT = secrets.token_bytes(16)
 
+
 def get_contact_summary(conn: sqlite3.Connection, contact_id: str) -> dict[str, Any]:
     """Fetch basic info for a contact to display in the report."""
     cursor = conn.cursor()
@@ -43,15 +50,25 @@
 
     This avoids storing raw identifiers or PII in clear text while still
     allowing consistent comparison within the report.
+
+    The function produces a deterministic, non-reversible token for a given
+    input *within a single run* of the script, and does not expose any
+    substring of the original value.
     """
     if value is None:
         return "N/A"
     text = str(value)
-    digest = hashlib.sha256(text.encode("utf-8")).hexdigest()[:8]
-    prefix = text[:3] if len(text) > 3 else text
-    return f"{prefix}…:{digest}"
+    if not text:
+        return "N/A"
 
+    # Combine a per-process random salt with the value so that the resulting
+    # token cannot be matched against precomputed hashes of database values.
+    digest_bytes = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest()
+    # Use a short, URL-safe base64 representation for readability.
+    token = base64.urlsafe_b64encode(digest_bytes).decode("ascii").rstrip("=")[:10]
+    return f"anon:{token}"
 
+
 def write_group_to_file(
     f: Any, conn: sqlite3.Connection, group: dict[str, Any], title: str
 ) -> None:
EOF
@@ -6,6 +6,8 @@
from pathlib import Path
from typing import Any
import hashlib
import secrets
import base64

from dex_python.deduplication import (
find_birthday_name_duplicates,
@@ -20,7 +22,13 @@
DEFAULT_DB_PATH = DATA_DIR / "dex_contacts.db"
DEFAULT_REPORT_PATH = DATA_DIR / "DUPLICATE_REPORT.md"

# Per-process random salt used to strengthen pseudonymization. This ensures
# that the masked values in the report cannot be directly linked back to
# database values via precomputed hashes, while remaining stable within
# a single run of the script.
PSEUDONYM_SALT = secrets.token_bytes(16)


def get_contact_summary(conn: sqlite3.Connection, contact_id: str) -> dict[str, Any]:
"""Fetch basic info for a contact to display in the report."""
cursor = conn.cursor()
@@ -43,15 +50,25 @@

This avoids storing raw identifiers or PII in clear text while still
allowing consistent comparison within the report.

The function produces a deterministic, non-reversible token for a given
input *within a single run* of the script, and does not expose any
substring of the original value.
"""
if value is None:
return "N/A"
text = str(value)
digest = hashlib.sha256(text.encode("utf-8")).hexdigest()[:8]
prefix = text[:3] if len(text) > 3 else text
return f"{prefix}…:{digest}"
if not text:
return "N/A"

# Combine a per-process random salt with the value so that the resulting
# token cannot be matched against precomputed hashes of database values.
digest_bytes = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest()
# Use a short, URL-safe base64 representation for readability.
token = base64.urlsafe_b64encode(digest_bytes).decode("ascii").rstrip("=")[:10]
return f"anon:{token}"


def write_group_to_file(
f: Any, conn: sqlite3.Connection, group: dict[str, Any], title: str
) -> None:
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants