Skip to content

fix: remove PII from log messages#1237

Merged
prasmussen15 merged 2 commits intomainfrom
fix/remove-pii-from-logs
Feb 17, 2026
Merged

fix: remove PII from log messages#1237
prasmussen15 merged 2 commits intomainfrom
fix/remove-pii-from-logs

Conversation

@prasmussen15
Copy link
Collaborator

Summary

  • Remove entity names, edge names/facts, and full LLM message content from log messages to prevent PII from leaking into logs
  • Replace with UUIDs, counts, and structural metadata only
  • Affects WARNING, ERROR, and DEBUG level logs across the ingestion pipeline

Files changed

  • graphiti_core/utils/maintenance/edge_operations.py — Remove entity names from 3 WARNING-level logs (source/target entity not found). Replace full edge objects and (name, uuid) tuples with UUID-only lists in 6 DEBUG logs.
  • graphiti_core/utils/maintenance/node_operations.py — Remove entity names from 2 WARNING logs (unknown entity summary, invalid duplicate name). Replace (name, uuid) tuples with UUID-only lists in 5 DEBUG logs. Remove entity names from dedup sampling logs.
  • graphiti_core/llm_client/client.py — Replace full message content dump in _get_failed_generation_log() (called at ERROR level by Gemini client) with message count and role metadata.

What was leaking

# BEFORE — entity names in WARNING logs
WARNING: Source entity name "John Smith" not found in nodes for edge WORKS_AT
WARNING: LLM returned summary for unknown entity: Jane Doe

# BEFORE — full LLM input content in ERROR logs
ERROR: Input messages: [{"role": "user", "content": "Extract entities from: John called Mary..."}]

# AFTER — only UUIDs and metadata
WARNING: Source entity not found in nodes for extracted edge
WARNING: LLM returned summary for unknown entity
ERROR: Input messages: 2 message(s), roles: ['system', 'user']

Test plan

  • ruff check — 0 errors
  • pyright — 0 errors, 0 warnings
  • 282 unit tests pass (1 pre-existing failure unrelated to this change)

🤖 Generated with Claude Code

Remove entity names, edge facts, and LLM input/output content from log
messages to prevent personally identifiable information from leaking
into logs. Replace with UUIDs, counts, and structural metadata only.

Changes:
- edge_operations.py: Remove entity names from WARNING logs, replace
  full edge objects and name tuples with UUIDs in DEBUG logs
- node_operations.py: Remove entity names from WARNING and DEBUG logs,
  log only UUIDs and counts instead of (name, uuid) tuples
- llm_client/client.py: Replace full message content dump in
  _get_failed_generation_log with message count and role metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
log += f'Raw output: {output[:2000]}... (truncated) ...{output[-2000:]}\n'
else:
log += f'Raw output: {output}\n'
log += f'Raw output length: {len(output)} chars\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely omitting the raw output content removes valuable debugging information. When LLM generation fails, knowing the actual malformed output (JSON parsing errors, unexpected format, etc.) is crucial for diagnosing issues.

Consider truncating the output instead of just showing the length:

Suggested change
log += f'Raw output length: {len(output)} chars\n'
if output is not None:
# Truncate to avoid PII while preserving debug value
truncated = output[:500] + '...' if len(output) > 500 else output
log += f'Raw output (truncated): {truncated}\n'

This balances PII protection (500 chars is unlikely to contain substantial user data in a structured LLM response) with debuggability.

f'Source entity name "{source_name}" not found in nodes '
f'for edge {edge_data.relation_type}'
)
logger.warning('Source entity not found in nodes for extracted edge')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning now lacks enough context to diagnose the issue. When an LLM returns invalid entity names, operators need to know which edge relationship was affected to understand if there's a pattern (e.g., certain relation types consistently failing).

Consider keeping the relation type which is schema/metadata, not user PII:

Suggested change
logger.warning('Source entity not found in nodes for extracted edge')
logger.warning(
f'Source entity not found in nodes for edge relation: {edge_data.relation_type}'
)

f'Target entity name "{target_name}" not found in nodes '
f'for edge {edge_data.relation_type}'
)
logger.warning('Target entity not found in nodes for extracted edge')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern - the relation type is not PII and helps diagnose which edge extractions are failing.

# Check if this entity type should be excluded
if excluded_entity_types and entity_type_name in excluded_entity_types:
logger.debug(f'Excluding entity "{extracted_entity.name}" of type "{entity_type_name}"')
logger.debug(f'Excluding entity of type "{entity_type_name}"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entity type is not PII - it's schema metadata (e.g., "Person", "Organization"). This information is valuable for debugging without exposing user data. The original line was fine:

Suggested change
logger.debug(f'Excluding entity of type "{entity_type_name}"')
logger.debug(f'Excluding entity of type "{entity_type_name}"')

Wait, looking again this is unchanged. But the original had entity name which is correct to remove. This change looks appropriate.

logger.warning(
'Invalid duplicate_name %r for extracted node %s; treating as no duplicate.',
duplicate_name,
'Invalid duplicate_name for extracted node %s; treating as no duplicate.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing duplicate_name from this warning makes it impossible to diagnose LLM deduplication issues. The duplicate name returned by the LLM is structural metadata about how entities were matched, not necessarily user PII.

If the LLM consistently returns invalid duplicate names, operators need to see what value was returned to identify patterns (e.g., case sensitivity issues, partial matches, etc.).

Consider at minimum logging whether a duplicate_name was provided:

Suggested change
'Invalid duplicate_name for extracted node %s; treating as no duplicate.',
logger.warning(
'Invalid duplicate_name for extracted node %s; treating as no duplicate. '
'duplicate_name was: %r',
extracted_node.uuid,
duplicate_name[:50] + '...' if len(duplicate_name) > 50 else duplicate_name,
)

Or if truly concerned about PII, at least log the length to distinguish empty vs invalid values.

node.summary = truncated_summary
else:
logger.warning(f'LLM returned summary for unknown entity: {summarized_entity.name}')
logger.warning('LLM returned summary for unknown entity')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning now provides no actionable information. When an LLM returns a summary for an entity that doesn't exist, it indicates either:

  1. A bug in the summarization prompt/context
  2. An LLM hallucination pattern

Without knowing which entity name the LLM returned, there's no way to debug this. Consider logging the summarized_entity.name truncated:

Suggested change
logger.warning('LLM returned summary for unknown entity')
logger.warning(
'LLM returned summary for unknown entity (first 30 chars): %.30s',
summarized_entity.name,
)

@claude
Copy link
Contributor

claude bot commented Feb 17, 2026

Review Summary

The goal of removing PII from logs is important for security and compliance. However, this PR overcorrects in several places, removing debugging information that is not PII while making the logs nearly useless for diagnosing production issues.

Main Concerns

  1. Distinguishing PII from metadata: Entity names can contain PII, but relation types (e.g., WORKS_AT, LIVES_IN) and entity types (e.g., Person, Organization) are schema metadata, not user data. These should be preserved in logs.

  2. Debuggability regression: Several warnings now provide no actionable context:

    • "Source entity not found in nodes for extracted edge" - which edge? what relation?
    • "LLM returned summary for unknown entity" - what name did the LLM hallucinate?
    • "Invalid duplicate_name for extracted node" - what value did the LLM return?
  3. Raw output removal in _get_failed_generation_log: Completely removing the LLM's raw output means failed generation errors become nearly impossible to debug. The raw output typically contains structural/formatting issues, not user PII. Consider truncating rather than omitting entirely.

Recommendations

  • Preserve relation types and entity types in logs (these are not PII)
  • Truncate potentially sensitive values rather than removing them entirely
  • Consider structured logging with a separate "sensitive" field that can be filtered at the log collector level, rather than removing from source

See inline comments for specific suggestions.

Address review feedback — the initial PII fix overcorrected by removing
non-PII debugging context:

- Restore relation types in edge WARNING logs (schema metadata, not PII)
- Restore truncated duplicate_name in dedup WARNING (needed for diagnosis)
- Restore truncated entity name (first 30 chars) in summary WARNING
- Restore truncated raw LLM output (first 500 chars) in failed generation
  ERROR logs — malformed output is structural, not user content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
logger.warning(
f'Could not find nodes for edge: {edge_data.source_entity_name} -> {edge_data.target_entity_name}'
)
logger.warning('Could not find source or target node for extracted edge')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning still lacks the relation type context that was restored in the similar warnings above (lines 157, 164). Since edge_data is in scope here, consider including edge_data.relation_type for consistency:

Suggested change
logger.warning('Could not find source or target node for extracted edge')
logger.warning('Could not find source or target node for edge relation: %s', edge_data.relation_type)

@prasmussen15 prasmussen15 merged commit 7985997 into main Feb 17, 2026
13 checks passed
@prasmussen15 prasmussen15 deleted the fix/remove-pii-from-logs branch February 17, 2026 19:58
@getzep getzep locked and limited conversation to collaborators Feb 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant