feat: enhance message processing with document handling and summary generation#11
Conversation
…eneration - Introduced `generateSummary` function to create concise summaries for text. - Added `processDocument` method to handle various document types, including PDFs and text files. - Implemented centralized document processing based on MIME type. - Updated `processMessage` to integrate document processing and summary extraction. - Enhanced logging for document processing and error handling.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe update enhances the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Telegram
participant MessageManager
participant PDFService
participant Summarizer
User->>Telegram: Sends message with document
Telegram->>MessageManager: Receives message
MessageManager->>MessageManager: processMessage(message)
alt Document detected
MessageManager->>MessageManager: processDocument(message)
alt PDF Document
MessageManager->>PDFService: Convert PDF to text
PDFService-->>MessageManager: Text content
else Text/CSV/JSON Document
MessageManager->>MessageManager: Fetch and extract text
end
MessageManager->>Summarizer: generateSummary(text)
Summarizer-->>MessageManager: Title, description
MessageManager->>MessageManager: Build attachment with summary
end
MessageManager->>MessageManager: Compose processedContent, attachments
MessageManager-->>Telegram: Callback/event with enriched content
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Note 🔌 MCP (Model Context Protocol) integration is now available in Early Access!Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Bugbot found 3 bugsTo see them, have a team admin activate your membership in the Cursor dashboard. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
src/messageManager.ts (6)
169-179: Remove duplicate JSDoc commentsThere are two JSDoc comment blocks for the same method. Keep only one comprehensive documentation block.
Remove the duplicate JSDoc block at lines 169-175 and keep the more concise one at lines 176-179.
217-222: Consider making processors a class property for better performanceThe processors object is recreated on every call. Consider making it a class property to avoid repeated object creation.
Move the processors definition to a class property:
export class MessageManager { public bot: Telegraf<Context>; protected runtime: IAgentRuntime; + private readonly documentProcessors = { + "application/pdf": this.processPdfDocument.bind(this), + "text/plain": this.processTextDocument.bind(this), + "text/csv": this.processTextDocument.bind(this), + "application/json": this.processTextDocument.bind(this), + }; // ... rest of the class private getDocumentProcessor(mimeType?: string): ((document: Document, url: string) => Promise<{ description: string }>) | null { if (!mimeType) return null; - const processors = { - "application/pdf": this.processPdfDocument.bind(this), - "text/plain": this.processTextDocument.bind(this), - "text/csv": this.processTextDocument.bind(this), - "application/json": this.processTextDocument.bind(this), - }; - for (const [pattern, processor] of Object.entries(processors)) { + for (const [pattern, processor] of Object.entries(this.documentProcessors)) { if (mimeType.startsWith(pattern)) { return processor; } }Note: You'll need to initialize the bindings in the constructor if you go with this approach.
322-323: Remove duplicate logging statementThe same log message appears at the beginning and end of the method.
Remove the duplicate logging at line 408 since the one at line 322 already captures the initial state.
Also applies to: 408-409
671-671: Consider using empty string instead of space for fallbackUsing a single space
" "as fallback for empty text could cause issues with trimming operations downstream.- text: processedContent || " ", + text: processedContent || "",
30-75: Add unit tests for the generateSummary functionThis critical function handles AI model interactions and JSON parsing. It should have comprehensive unit tests covering success cases, parsing failures, and edge cases.
Would you like me to generate unit tests for the
generateSummaryfunction or open an issue to track this task?
180-300: Consider implementing caching for document processingDocuments might be processed multiple times if referenced in different messages. Consider implementing a simple cache to avoid redundant processing of the same documents.
You could use a Map or LRU cache with document file_id as the key and processed content as the value, with appropriate TTL to manage memory usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
src/messageManager.ts(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
- Introduced `DocumentProcessingResult` interface for better structure in document handling. - Updated `generateSummary` function to return a summary instead of a description. - Refactored `processDocument`, `processPdfDocument`, and `processTextDocument` methods to utilize the new structured result format. - Enhanced logging and error handling to improve clarity and maintainability.
- Introduced `cleanText` utility function to remove NULL characters from text. - Updated `MessageManager` to utilize `cleanText` for processing message content and attachments. - Ensured that both processed content and attachments are cleaned before further processing.
|
cursor review |
|
the improved document handling is great, but I think the summarization doesn't belong here, more of a pluing-knowledge thing in my book. an uploaded document (text-based) could be run through embeddings and saved to memories. And then other actions and subsystem could provide summaries or do searches. |
- Eliminated the `generateSummary` function and its usage in document processing methods. - Updated `MessageManager` to directly set titles and descriptions based on document metadata. - Simplified formatted descriptions for PDF and text documents, removing summary references.
- Modified `formattedDescription` in `MessageManager` to reflect successful text extraction for both PDF and text documents. - Removed references to errors in content extraction, enhancing clarity in document summaries.
You're right, I've deleted the part about summarization. |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
src/messageManager.ts(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/messageManager.ts (1)
src/utils.ts (1)
cleanText(222-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (7)
src/messageManager.ts (7)
28-39: Well-structured interface for document processing resultsThe
DocumentProcessingResultinterface effectively addresses the previous fragile string parsing approach by providing structured data fields. This improves maintainability and reduces parsing errors.
10-27: LGTM - Import changes support new functionalityThe added imports are appropriate for the new document processing capabilities and text cleaning functionality.
110-111: Good logic improvement to exclude PDFs from image processingThe additional filter prevents PDF documents from being incorrectly processed as images, which improves processing accuracy.
287-382: Excellent refactoring that addresses previous review concernsThis method successfully:
- Eliminates redundant document processing by processing once and reusing results
- Uses structured data directly instead of fragile regex parsing
- Provides comprehensive error handling with fallback attachments
- Includes proper logging for debugging
- Consolidates message processing logic effectively
The implementation properly addresses all the issues identified in past reviews.
609-623: LGTM - Well-implemented refactoring with proper sanitizationThe refactoring effectively:
- Utilizes the new unified
processMessagemethod for cleaner code organization- Applies
cleanTextsanitization to prevent NULL character issues- Properly handles edge cases by skipping messages without meaningful content
- Maintains all existing functionality while improving code structure
651-652: Good data integrity practiceUsing cleaned content and attachments in the memory object ensures NULL characters don't cause issues in downstream processing.
657-661: LGTM - Reply functionality preserved correctlyThe reply-to functionality is properly maintained with appropriate UUID generation for message references.
There was a problem hiding this comment.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
src/messageManager.ts (2)
100-101: Avoid logging full Telegram message objects (PII risk and noisy logs)Dumping the entire message leaks PII and drastically increases log volume. Log only essentials at debug level.
Apply:
- logger.info(`Telegram Message: ${JSON.stringify(message, null, 2)}`); + logger.debug( + { + messageId: (message as any).message_id, + chatId: (message as any).chat?.id, + hasPhoto: 'photo' in message, + hasDocument: 'document' in message, + }, + 'Telegram message received' + );
388-416: Async map not awaited → possible unhandled rejections; improve contentType mapping
- The async map is not awaited; failures will be unhandled and the function returns immediately, racing the send operations.
- The 'doc' prefix isn’t a valid MIME prefix; map on standard prefixes like application/ and text/.
- if (content.attachments && content.attachments.length > 0) { - content.attachments.map(async (attachment: Media) => { - const typeMap: { [key: string]: MediaType } = { - 'image/gif': MediaType.ANIMATION, - image: MediaType.PHOTO, - doc: MediaType.DOCUMENT, - video: MediaType.VIDEO, - audio: MediaType.AUDIO, - }; + if (content.attachments && content.attachments.length > 0) { + await Promise.all(content.attachments.map(async (attachment: Media) => { + const typeMap: { [key: string]: MediaType } = { + 'image/gif': MediaType.ANIMATION, + 'image/': MediaType.PHOTO, + 'video/': MediaType.VIDEO, + 'audio/': MediaType.AUDIO, + 'application/': MediaType.DOCUMENT, + 'text/': MediaType.DOCUMENT, + }; let mediaType: MediaType | undefined = undefined; - for (const prefix in typeMap) { + for (const prefix of Object.keys(typeMap)) { if (attachment.contentType?.startsWith(prefix)) { mediaType = typeMap[prefix]; break; } } if (!mediaType) { throw new Error( `Unsupported Telegram attachment content type: ${attachment.contentType}` ); } await this.sendMedia(ctx, attachment.url, mediaType, attachment.description); - }); + })); return []; } else {Optional: consider returning the sent media messages so the caller can create memories for attachments; this would require
sendMediato return the Telegram message object instead ofvoid.
🧹 Nitpick comments (1)
src/messageManager.ts (1)
198-210: Clarify fallback message when PDF service is unavailableThis is not a processing error; it’s a feature unavailability. Make the message neutral to avoid confusion.
- formattedDescription: `[PDF Document: ${document.file_name || 'Unknown Document'}\nSize: ${document.file_size || 0} bytes\nUnable to extract text content]`, + formattedDescription: `[PDF Document: ${document.file_name || 'Unknown Document'}\nSize: ${document.file_size || 0} bytes\nText extraction unavailable (PDF service not configured)]`,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
src/messageManager.ts(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (6)
src/messageManager.ts (6)
106-106: Exclude PDFs from image-doc path — good guardThe additional check to skip PDFs in the image document branch is correct and prevents misclassification.
221-229: PDF success path messaging is now accurateThe formattedDescription correctly reflects successful extraction with character count. Good fix.
247-276: Text document success path messaging is accurateThe success message reflects extracted character count; error path is informative. Looks good.
369-371: Accurate post-processing attachments count loggingLogging after processing documents and images avoids the earlier "0 attachments" issue. Good.
577-588: Null character sanitization for content and attachments is correctCleaning both text and attachment fields proactively prevents downstream issues. Nice touch.
132-170: AI summary mismatches current code (summarization removed)The AI summary mentions summary generation, but the code no longer generates summaries or calls a generateSummary function. The refactor correctly focuses on document handling only.
Likely an incorrect or invalid review comment.
- Handling all text/* documents. - Removed commented-out code and unnecessary comments in `MessageManager` for improved readability. - Updated import statements in `utils.ts` to use type imports for better clarity and performance.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/messageManager.ts (3)
363-365: LGTM: Log after processing attachmentsThis corrects the earlier premature logging issue—count now reflects processed attachments.
124-131: Outdated JSDoc return type for processDocumentJSDoc still says it returns
{ description: string } | null, but the function returnsDocumentProcessingResult | null.Apply this diff:
- /** - * Process a document from a Telegram message to extract the document URL and description. - * Handles PDFs and other document types by converting them to text when possible. - * - * @param {Message} message - The Telegram message object containing the document. - * @returns {Promise<{ description: string } | null>} The description of the processed document or null if no document found. - */ + /** + * Process a document from a Telegram message and extract its content. + * Handles PDFs, text files, and other document types centrally. + * + * @param {Message} message - The Telegram message object containing the document. + * @returns {Promise<DocumentProcessingResult | null>} + */
306-310: Bound embedded document text to avoid memory/token bloatAppending full document text into processedContent can be huge. Truncate in the message body and keep the full text in the attachment.
Apply this diff:
- if (fullText) { - const documentContent = `\n\n--- DOCUMENT CONTENT ---\nTitle: ${title}\n\nFull Content:\n${fullText}\n--- END DOCUMENT ---\n\n`; - processedContent += documentContent; - } + if (fullText) { + const MAX_EMBED_CHARS = 20000; + const embeddedText = + fullText.length > MAX_EMBED_CHARS + ? `${fullText.slice(0, MAX_EMBED_CHARS)}... [truncated]` + : fullText; + const documentContent = `\n\n--- DOCUMENT CONTENT ---\nTitle: ${title}\n\nFull Content:\n${embeddedText}\n--- END DOCUMENT ---\n\n`; + processedContent += documentContent; + }
🧹 Nitpick comments (1)
src/messageManager.ts (1)
105-108: Simplify image document MIME checkThe negative PDF check is redundant because a MIME starting with image/ cannot start with application/pdf.
Apply this diff:
-} else if ('document' in message && message.document?.mime_type?.startsWith('image/') && !message.document?.mime_type?.startsWith('application/pdf')) { +} else if ('document' in message && message.document?.mime_type?.startsWith('image/')) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
src/messageManager.ts(5 hunks)src/utils.ts(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/messageManager.ts (1)
src/utils.ts (1)
cleanText(222-225)
🪛 Biome (2.1.2)
src/utils.ts
[error] 224-225: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (3)
src/utils.ts (1)
2-3: LGTM: Type-only imports reduce runtime footprintUsing type-only imports for Button and InlineKeyboardButton is correct and avoids bundling types at runtime.
src/messageManager.ts (2)
170-186: LGTM: MIME router with prefix matchingThe prefix-based routing covers text/* broadly and keeps application/json explicit. This is clear and extensible.
571-579: LGTM: Sanitize content and attachments via cleanTextGood defensive step to strip NULLs before persistence; prevents downstream formatting/parsing issues.
|
up |
processDocumentmethod to handle various document types, including PDFs and text files.processMessageto integrate document processing and summary extraction.Summary by CodeRabbit
New Features
Bug Fixes
Refactor