Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information by domfahey · Pull Request #8 · domfahey/dex-python

domfahey · 2025-12-28T03:19:28Z

Potential fix for https://github.com/domfahey/dex-python/security/code-scanning/2

In general, the way to fix clear-text storage of sensitive information in a report like this is either (a) avoid including the sensitive data at all in the persisted file, or (b) protect it, e.g., by redacting or pseudonymizing it so the stored content is not directly identifying. Full cryptographic encryption of the whole report would defeat its purpose as a human-readable report unless there is a decryption workflow; given the limited snippet and the requirement not to change external behavior too much, the most practical mitigation is to pseudonymize or partially mask sensitive fields before writing them out.

For this script, we can keep the core functionality (detecting duplicates and summarizing them) while reducing privacy risk by masking personally identifying details in the persisted Markdown. A straightforward change is to avoid writing the raw info['id'] and full info['name'] and info['job'] fields; instead, we can write a hashed or truncated version of the ID and partially masked name and job. Since we cannot modify upstream functions like find_birthday_name_duplicates, the best place to introduce protection is directly before writing to the file in write_group_to_file. We can add a small helper that takes a string and returns a pseudonymous representation (for example, the first few characters plus a fixed-length hash using a standard library function such as hashlib.sha256), and apply this to each of the three fields used in the report. This keeps the ability to distinguish entries within the report (same ID hash indicates same record), while ensuring the stored report does not contain the raw identifiers.

Concretely, in scripts/analyze_duplicates.py:

Add an import hashlib at the top (standard library, no external dependency).
Define a helper function like def _pseudonymize(value: str) -> str: that returns a safe string, e.g., prefix + ":" + short_hash. We can place this just above write_group_to_file.
In write_group_to_file, after calling get_contact_summary, derive masked_id, masked_name, and masked_job using _pseudonymize, then write those to the file instead of the raw info['id'], info['name'], and info['job'].
Keep the rest of the script unchanged so duplicate detection and grouping behavior remain intact.

Suggested fixes powered by Copilot Autofix. Review carefully before merging.

…nsitive information Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

coderabbitai · 2025-12-28T03:19:32Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch alert-autofix-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

+        masked_id = _pseudonymize(info.get("id"))
+        masked_name = _pseudonymize(info.get("name"))
+        masked_job = _pseudonymize(info.get("job"))
+        f.write(f"| `{masked_id}` | {masked_name} | {masked_job} |\n")


In general, to fix clear-text storage of sensitive information in reports/logs, you either (1) avoid including the sensitive fields altogether, or (2) replace them with non-reversible pseudonyms that cannot be used to reconstruct the original values and do not expose any direct substring of the data. Deterministic, secret-key–based tokens are better than raw hashes, and exposing no raw prefix at all is safer than exposing a few characters.

For this specific code, the problem is localized in _pseudonymize and its use in write_group_to_file. We can fix the issue without changing existing functionality in a user-visible way by tightening _pseudonymize so that it no longer includes any portion of the original string, and only outputs a fixed label plus a short, deterministic pseudonymous token derived from the input. This keeps stable equality (the same input always yields the same masked output), so duplicate grouping and report readability (“same masked name appears several times”) are preserved, while eliminating direct leakage of clear-text characters. To further reduce the risk of offline guessing attacks, we can introduce a process-local random salt, so that the digest used for masking cannot be precomputed from the raw database values; since the salt only needs to be consistent within a single run (for generating this report), a single generated salt in the module scope is sufficient and does not require persistent storage.

Concretely, in scripts/analyze_duplicates.py:

Add an import for secrets and base64 (both from the standard library) alongside the existing imports.

Add a module-level random salt (e.g., PSEUDONYM_SALT = secrets.token_bytes(16)).

Rewrite _pseudonymize to:

Handle None / empty the same as today ("N/A").

Compute digest = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest().

Encode a short prefix of the digest with URL-safe base64 (or hex) and truncate it for readability (e.g., 10 chars).

Return a generic label like anon:<token> or pseudonym:<token> without embedding any substring of text.

Leave write_group_to_file unchanged; it will automatically start writing the safer tokens instead of the current "prefix…:digest" values.

This modification is restricted to the shown file and lines, uses only standard-library modules, preserves the behavior that identical inputs yield identical masked outputs, and removes clear-text leakage that CodeQL flags.

Potential fix for code scanning alert no. 2: Clear-text storage of se…

60f75c6

…nsitive information Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

github-advanced-security AI found potential problems Dec 28, 2025

View reviewed changes

@@ -6,6 +6,8 @@
             from pathlib import Path
             from typing import Any
             import hashlib
+            import secrets
+            import base64
             from dex_python.deduplication import (
                 find_birthday_name_duplicates,
@@ -20,7 +22,13 @@
             DEFAULT_DB_PATH = DATA_DIR / "dex_contacts.db"
             DEFAULT_REPORT_PATH = DATA_DIR / "DUPLICATE_REPORT.md"
+            # Per-process random salt used to strengthen pseudonymization. This ensures
+            # that the masked values in the report cannot be directly linked back to
+            # database values via precomputed hashes, while remaining stable within
+            # a single run of the script.
+            PSEUDONYM_SALT = secrets.token_bytes(16)
             def get_contact_summary(conn: sqlite3.Connection, contact_id: str) -> dict[str, Any]:
                 """Fetch basic info for a contact to display in the report."""
                 cursor = conn.cursor()
@@ -43,15 +50,25 @@
                 This avoids storing raw identifiers or PII in clear text while still
                 allowing consistent comparison within the report.
+                The function produces a deterministic, non-reversible token for a given
+                input *within a single run* of the script, and does not expose any
+                substring of the original value.
                 """
                 if value is None:
                     return "N/A"
                 text = str(value)
-                digest = hashlib.sha256(text.encode("utf-8")).hexdigest()[:8]
-                prefix = text[:3] if len(text) > 3 else text
-                return f"{prefix}…:{digest}"
+                if not text:
+                    return "N/A"
+                # Combine a per-process random salt with the value so that the resulting
+                # token cannot be matched against precomputed hashes of database values.
+                digest_bytes = hashlib.sha256(PSEUDONYM_SALT + text.encode("utf-8")).digest()
+                # Use a short, URL-safe base64 representation for readability.
+                token = base64.urlsafe_b64encode(digest_bytes).decode("ascii").rstrip("=")[:10]
+                return f"anon:{token}"
             def write_group_to_file(
                 f: Any, conn: sqlite3.Connection, group: dict[str, Any], title: str
             ) -> None:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8

Potential fix for code scanning alert no. 2: Clear-text storage of sensitive information#8
domfahey wants to merge 1 commit intomainfrom
alert-autofix-2

domfahey commented Dec 28, 2025

Uh oh!

coderabbitai Bot commented Dec 28, 2025

Review skipped

Uh oh!

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

domfahey commented Dec 28, 2025

Uh oh!

coderabbitai Bot commented Dec 28, 2025

Review skipped

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants