A robust content extractor for LLM outputs with support for extracting and parsing JSON, XML, HTML, and code blocks from raw strings.
- 🎯 Multiple Format Support: Extract JSON, XML, HTML, and code blocks
- 🛡️ Fault Tolerant:
- Automatically handle Markdown code fences (```json ... ```)
- Intelligently extract content embedded in text
- Fix common LLM errors (e.g., trailing commas in JSON)
- 🏗️ Strategy Pattern: Easy to extend with custom extractors
- 📦 Simple API: Functional interface, ready to use
- 🧪 Well Tested: High test coverage for reliability
- 🔧 Type Safe: Full type annotations support
Install with pip:
pip install llm-content-extractorInstall with Poetry:
poetry add llm-content-extractorfrom llm_content_extractor import extract, ContentType
# Extract JSON
json_text = '''
Here's the data you requested:
```json
{
"name": "Alice",
"age": 30,
"hobbies": ["reading", "coding"],
}'''
result = extract(json_text, ContentType.JSON) print(result) # {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'coding']}
### JSON Extraction Examples
```python
from llm_content_extractor import extract, ContentType
# 1. JSON with Markdown fence
text1 = '```json\n{"status": "success"}\n```'
extract(text1, ContentType.JSON) # {'status': 'success'}
# 2. Plain JSON
text2 = '{"status": "success"}'
extract(text2, ContentType.JSON) # {'status': 'success'}
# 3. JSON embedded in text
text3 = 'The result is: {"status": "success"} - done!'
extract(text3, ContentType.JSON) # {'status': 'success'}
# 4. JSON with trailing commas (common LLM error)
text4 = '{"items": [1, 2, 3,],}'
extract(text4, ContentType.JSON) # {'items': [1, 2, 3]}
# 5. Using string content type
extract(text1, "json") # Also works
Given text containing a fenced XML block:
A response from the LLM:
```xml
<root>
<item id="1">First</item>
<item id="2">Second</item>
</root>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `xml_text`
result = extract(xml_text, ContentType.XML)
print(result) # Returns cleaned XML string
Given text containing a fenced HTML block:
LLM says:
```html
<div class="container">
<h1>Title</h1>
<p>Content here</p>
</div>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `html_text`
result = extract(html_text, ContentType.HTML)
print(result) # Returns cleaned HTML string
1. Extract language-specific code
Given a Python code block:
```python
def greet(name):
return f"Hello, {name}!"
print(greet("World"))
Extract it by specifying the language:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `python_code_text`
code = extract(python_code_text, ContentType.CODE, language='python')
print(code)
# Output:
# def greet(name):
# return f"Hello, {name}!"
#
# print(greet("World"))
2. Extract any code block
Given a generic code block (no language specified):
const x = 42; console.log(x);
Extract it without specifying a language:
# Assuming the text above is in a variable `generic_code_text`
code = extract(generic_code_text, ContentType.CODE)
print(code) # const x = 42;\nconsole.log(x);from llm_content_extractor import JSONExtractor, XMLExtractor
# Use extractor classes directly
json_extractor = JSONExtractor()
result = json_extractor.extract('{"key": "value"}')
xml_extractor = XMLExtractor()
result = xml_extractor.extract('<root><item>test</item></root>')Create custom extractors by inheriting from the ContentExtractor base class:
from llm_content_extractor.base import ContentExtractor
from llm_content_extractor import extract, ContentType, register_extractor
import json
class CustomJSONExtractor(ContentExtractor):
def extract(self, raw_text: str):
# Custom extraction logic
cleaned = raw_text.strip()
# ... your logic here
return json.loads(cleaned)
# Register custom extractor
register_extractor(ContentType.JSON, CustomJSONExtractor)
# Use the custom extractor
result = extract(text, ContentType.JSON)from llm_content_extractor import extract, JSONExtractor
# Create a custom configured extractor
my_extractor = JSONExtractor(strict=True)
# Pass the extractor instance directly
result = extract(raw_text, ContentType.JSON, extractor=my_extractor)LLM Content Extractor handles various common issues in LLM outputs:
# ✅ Supports various fence formats
extract('```json\n{"a": 1}\n```', ContentType.JSON)
extract('```JSON\n{"a": 1}\n```', ContentType.JSON) # Uppercase
extract('```\n{"a": 1}\n```', ContentType.JSON) # No language identifier# ✅ Extract content from surrounding text
text = '''
Here is the configuration:
{"enabled": true, "timeout": 30}
This will set the timeout to 30 seconds.
'''
extract(text, ContentType.JSON) # Successfully extracts# ✅ Automatically fix trailing commas
extract('{"items": [1, 2,],}', ContentType.JSON) # {'items': [1, 2]}
extract('[{"id": 1,}, {"id": 2,}]', ContentType.JSON) # [{'id': 1}, {'id': 2}]# ✅ Handle complex nested structures
nested = {
"user": {
"profile": {
"name": "Alice",
"contacts": ["email", "phone"]
}
}
}
# Fully supportedThis project uses the Strategy Pattern:
ContentExtractor (Abstract Base Class)
├── JSONExtractor
├── XMLExtractor
├── HTMLExtractor
└── CodeBlockExtractor
This design provides:
- ✅ Easy to add new extractor types
- ✅ Single responsibility for each extractor
- ✅ Flexible replacement and extension of extraction logic
Main extraction function.
Parameters:
raw_text(str): Raw string output from LLMcontent_type(ContentType | str): Content type (JSON, XML, HTML, CODE)language(str, optional): For CODE type, specify the programming languageextractor(ContentExtractor, optional): Custom extractor instance
Returns:
- JSON:
dictorlist - XML/HTML/CODE:
str
Raises:
ValueError: If valid content cannot be extractedTypeError: If an invalid extractor is provided
class ContentType(Enum):
JSON = "json"
XML = "xml"
HTML = "html"
CODE = "code"JSONExtractor(strict=False)strict: If True, disable auto-fixing of errors like trailing commas
XMLExtractor(validate=True, recover=True)validate: If True and lxml is available, validate XML syntaxrecover: If True, attempt to recover from malformed XML
HTMLExtractor(validate=False, clean=False)validate: If True, validate HTML structureclean: If True, clean and normalize HTML
CodeBlockExtractor(language="", strict=False)language: Specific language to extract (e.g., 'python', 'javascript')strict: If True, only extract fenced code blocks
# Clone the repository
git clone https://github.com/aihes/llm-content-extractor.git
cd llm-content-extractor
# Install dependencies
poetry install
# Run tests
poetry run pytest
# Format code
poetry run black .
# Type checking
poetry run mypy llm_content_extractor# Run all tests
poetry run pytest
# With coverage report
poetry run pytest --cov=llm_content_extractor --cov-report=html
# Run specific tests
poetry run pytest tests/test_json_extractor.pySee docs/PUBLISHING.md for detailed publishing instructions.
Quick steps:
# 1. Update version
poetry version patch
# 2. Build
poetry build
# 3. Publish
poetry publishContributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
LLM Content Extractor is particularly useful for:
- 🤖 LLM Application Development: Extract structured data from model outputs
- 🔄 Data Pipelines: Clean and standardize AI-generated content
- 🧪 Testing Tools: Validate LLM output formats
- 📊 Data Processing: Batch process LLM responses
Q: Why is my JSON extraction failing?
A: Ensure the text contains valid JSON structure. This library tries multiple strategies, but cannot recover completely corrupted JSON.
Q: Can I extract multiple code blocks?
A: The current version extracts the first matching code block. To extract multiple blocks, use the extract_all_blocks() method on CodeBlockExtractor or call the function multiple times.
Q: Is there support for other formats?
A: Yes! You can add support for new formats by inheriting from ContentExtractor and registering it in the system.
Q: How do I enable strict mode?
A: Use the extractor classes directly:
extractor = JSONExtractor(strict=True)
result = extractor.extract(text)from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
code = "def hello(): return 'world'"
language = extractor.detect_language(code) # Returns 'python'from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
blocks = extractor.extract_all_blocks(multi_code_text)
for block in blocks:
print(f"{block['language']}: {block['code']}")from llm_content_extractor.strategies import XMLExtractor, HTMLExtractor
xml_extractor = XMLExtractor()
is_valid = xml_extractor.is_valid_xml(xml_string)
html_extractor = HTMLExtractor()
is_valid = html_extractor.is_valid_html(html_string)- Architecture - Detailed architecture documentation
- Publishing Guide - How to publish to PyPI
- Examples - Usage examples
Thanks to all contributors and developers using this project!
- Report Issues: GitHub Issues
- Feature Requests: GitHub Discussions
If this project helps you, please consider giving it a ⭐️!