Skip to content

Selective Sharing and Privacy Protection for File-based Queries and Knowledge Base #887

@ItouTerukazu

Description

@ItouTerukazu

Selective Sharing and Privacy Protection for File-based Queries and Knowledge Base

Issue Type

Feature Request

Description

CrewAI currently stores both general knowledge and individual query content, including content from file-based queries (e.g., PDF RAG tools), in a shared SQLite database via EmbedChain. This approach leads to unintended sharing of potentially sensitive information across all sessions. While sharing some knowledge can be beneficial, users need more control over what is shared, especially for file-based queries.

We propose implementing a feature to allow selective sharing of knowledge while protecting the privacy of individual queries and file contents.

Proposed Solution

Implement a mechanism to separate the storage and handling of general knowledge, individual query content, and file-based query results, with user-controlled sharing options. This could be achieved through:

  1. Tiered Storage System:

    • Shared Knowledge Base: A persistent SQLite database for storing and sharing general knowledge across sessions.
    • Session-Specific Storage: Temporary storage (e.g., in-memory database or session-specific file) for individual queries and their immediate context.
    • File-Based Query Cache: Separate storage for file-based query results (e.g., PDF content embeddings).
  2. Selective Sharing Options:

    • Implement user controls to decide whether file-based query results should be:
      a) Kept private to the current session
      b) Shared temporarily for a specified duration
      c) Permanently added to the shared knowledge base
  3. Data Classification and Tagging:

    • Automatically classify incoming data as "general knowledge," "session-specific query," or "file-based query result."
    • Allow users to tag certain file-based queries or results for sharing or privacy protection.
  4. Query and Result Anonymization:

    • Implement an anonymization layer that removes or encrypts personally identifiable information before storing or processing queries and results.
  5. Automatic Cleanup and Retention Policies:

    • Implement configurable policies for automatic deletion of session-specific and file-based query data.
    • Allow users to set retention periods for shared file-based query results.

Example Use Case

When a user utilizes a PDF RAG tool to query a PDF file:

  1. The system processes and embeds the PDF content.
  2. The user is prompted with options:
    • Keep this PDF's content private to this session
    • Share this PDF's content for the next X hours/days
    • Add this PDF's content to the permanent shared knowledge base
  3. Based on the user's choice, the system stores and handles the PDF content and query results accordingly.

Benefits

  • Enhanced Privacy: Protect sensitive information in individual queries and file contents.
  • User Control: Give users granular control over what information is shared or persisted.
  • Flexible Knowledge Management: Allow beneficial knowledge sharing while respecting privacy preferences.
  • Improved Compliance: Better align with data protection regulations by giving users control over their data.

Potential Challenges

  • UI/UX Design: Create an intuitive interface for users to manage sharing preferences.
  • Implementation Complexity: Significant changes to the current data handling architecture may be required.
  • Performance Considerations: Ensure that the tiered storage system doesn't negatively impact system performance.
  • Data Lifecycle Management: Implement robust systems for managing data retention, sharing durations, and deletions.

Questions

  1. How can we design an intuitive UI/UX for users to control sharing preferences without overwhelming them with options?
  2. What's the best way to implement temporary sharing with automatic expiration of shared content?
  3. How should we handle scenarios where multiple users have different sharing preferences for the same or similar file-based queries?
  4. Are there existing privacy-preserving techniques in RAG systems that we could incorporate?

We appreciate your consideration of this feature request and look forward to discussing how we can enhance privacy protection and user control in CrewAI, particularly for file-based queries and knowledge sharing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions