Skip to content

Conversation

@dmux
Copy link
Owner

@dmux dmux commented Jun 29, 2025

This pull request introduces significant enhancements to the lambda_rag_lite library, focusing on modularization, centralized configurations, and improved functionality for document processing, chunking, and encoding detection. The most important changes include the addition of centralized configuration classes, modular architecture improvements, and new specialized detectors for encoding and file types.

Modular Architecture Enhancements:

  • Centralized Configuration: Added multiple configuration classes (ChunkingConfig, TextProcessingConfig, LoaderConfig, EmbeddingConfig, VectorStoreConfig) to centralize settings and improve type safety. Includes presets for common use cases like small documents, academic papers, and code files. (lambda_rag_lite/config.py, lambda_rag_lite/config.pyR1-R235)
  • Specialized Modules: Introduced new modules for encoding detection (EncodingDetector) and file type detection (FileTypeDetector) to handle specific tasks more effectively. (lambda_rag_lite/detectors/encoding.py, [1]; lambda_rag_lite/detectors/__init__.py, [2]

Code Quality and File Organization:

  • Constants Centralization: Added a constants.py file to centralize file extensions, programming languages, and MIME type mappings, reducing duplication across modules. (lambda_rag_lite/constants.py, lambda_rag_lite/constants.pyR1-R197)
  • Updated __init__.py: Refactored imports in lambda_rag_lite/__init__.py to include new classes and maintain backward compatibility with legacy APIs. (lambda_rag_lite/__init__.py, lambda_rag_lite/init.pyR10-R68)

Configuration and Quality Tooling:

  • Qlty Tooling Setup: Added .qlty configuration files (qlty.toml, .yamllint.yaml) and updated .gitignore to integrate code quality tools like yamllint, ruff, and markdownlint. (.qlty/configs/.yamllint.yaml, [1]; .qlty/qlty.toml, [2]; .qlty/.gitignore, [3]

Documentation and Versioning:

  • Changelog Update: Documented major refactoring and improvements in the CHANGELOG.md for version 0.2.0, including migration guides and directory structure changes. (CHANGELOG.md, CHANGELOG.mdR8-R125)

dmux added 3 commits June 29, 2025 18:31
- Added DocumentMetadataManager for centralized metadata handling of documents.
- Introduced TextProcessor for advanced text processing with customizable configurations.
- Created chunking strategies for text segmentation, including Separator and Character strategies.
- Developed utility functions for text cleaning, keyword extraction, and text statistics.
- Updated vector store implementation to align with new features.
- Enhanced overall structure and modularity of the codebase.
- Bumped version to 0.2.0 for new features and improvements.
- Deleted `debug_chunking.py`, `debug_chunking_detailed.py`, `debug_sentence_break.py`, and `debug_text_processor.py` as they were used for testing and debugging purposes.
- These files contained various test cases for chunking text, processing sentences, and cleaning text, which are no longer needed in the repository.
@dmux dmux self-assigned this Jun 29, 2025
@dmux dmux added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 29, 2025
@dmux dmux requested a review from Copilot June 29, 2025 21:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the library for better modularity, centralized configuration, and specialized detection components.

  • Introduces dataclass-based configuration and presets for chunking, processing, loading, embedding, and vector stores.
  • Adds EncodingDetector, FileTypeDetector, and a Strategy-pattern-based chunker.
  • Overhauls loaders to use factory and metadata manager patterns, and updates utils to wrap the new implementations.

Reviewed Changes

Copilot reviewed 23 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyproject.toml Bumped version to 0.2.0 and added Python 3.13 classifier.
lambda_rag_lite/config.py New dataclasses for centralized, type-safe configuration and presets.
lambda_rag_lite/constants.py Centralized file-extension and MIME mappings to remove duplication.
lambda_rag_lite/detectors/encoding.py Added EncodingDetector for intelligent file-encoding detection.
lambda_rag_lite/detectors/file_type.py Added FileTypeDetector for unified file-type inference.
lambda_rag_lite/strategies/chunking.py Refactored text chunking into Strategy pattern with ChunkingConfig.
lambda_rag_lite/processors/text_processor.py New TextProcessor using configurable processing and chunking.
lambda_rag_lite/utils.py Replaced inline helpers with wrappers around the new architecture.
lambda_rag_lite/loaders.py Updated loaders to use detectors, factory, and centralized metadata.
lambda_rag_lite/init.py Exposed new classes and maintained backward-compatible API exports.
Comments suppressed due to low confidence (2)

lambda_rag_lite/loaders.py:239

  • TextLoader uses an inconsistent metadata key 'extension' instead of 'file_extension' used elsewhere. Consolidate metadata schema, possibly via DocumentMetadataManager, for consistency across loaders.
            metadata = {

lambda_rag_lite/init.py:24

  • [nitpick] Alias 'NewTextProcessor' may be confusing to consumers. Consider renaming to a more descriptive identifier or exposing it under its original class name 'TextProcessor' for clarity.
from .processors.text_processor import TextProcessor as NewTextProcessor

@dmux dmux merged commit 10d71b1 into main Jun 29, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants