Complete bidirectional file format conversion for AI workflows - seamlessly convert files to markdown and back to any format.
FileAlchemy is a comprehensive file conversion library that transforms any file format into LLM-ready markdown and generates professional documents from markdown. Perfect for AI workflows, content processing, and document automation.
- File โ Markdown: Convert 17+ file formats to structured markdown
- Markdown โ File: Generate HTML, JSON, CSV, Office documents from markdown
- Web Scraping: Extract clean content from any URL
- Bidirectional Workflows: Complete file processing pipelines
- AI-Optimized: Output designed for LLM consumption and generation
pip install filealchemy
pip install filealchemy
pip install PyPDF2 python-docx openpyxl python-pptx Pillow beautifulsoup4 lxml pandas requests readability-lxml
# Clone the repository
git clone <repository-url>
cd filealchemy
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"
from filealchemy import FileAlchemyConverter
converter = FileAlchemyConverter()
# Convert any file
result = converter.convert("document.pdf")
print(result.content)
# Convert URL
result = converter.convert("https://example.com")
print(result.content)
# Auto-detect input type
result = converter.convert_auto("data.json")
print(result.content)
markdown_content = """
# My Report
## Data Analysis
| Metric | Value |
|--------|-------|
| Sales | $1000 |
| Growth | 15% |
## Summary
Key findings from our analysis...
"""
# Generate multiple formats
converter.generate_file(markdown_content, "report.html", "text/html")
converter.generate_file(markdown_content, "report.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
converter.generate_file(markdown_content, "data.json", "application/json")
converter.generate_file(markdown_content, "table.csv", "text/csv")
# Convert file
filealchemy document.pdf -o output.md
# Convert URL
filealchemy https://example.com --url -o webpage.md
# Auto-detect input type
filealchemy data.json --auto -o data.md
# Generate HTML
filealchemy report.md --generate text/html -o report.html
# Generate Word document
filealchemy report.md --generate application/vnd.openxmlformats-officedocument.wordprocessingml.document -o report.docx
# Generate JSON
filealchemy report.md --generate application/json -o data.json
# List available formats
filealchemy --list-types # Input formats
filealchemy --list-output-types # Output formats
filealchemy --list-instructions # Available instruction types
filealchemy --list-sample-data # Available sample data types
# Show instructions and sample data
filealchemy --instructions docx # Show DOCX generation instructions
filealchemy --sample-data json # Show JSON sample data
```##
๐ Supported Formats
### Input Formats (File โ Markdown)
| Format | Extensions | Features | Dependencies |
|--------|------------|----------|--------------|
| **Text** | .txt, .md, .rst | Encoding detection, title extraction | None |
| **JSON** | .json | Structure analysis, pretty formatting | None |
| **PDF** | .pdf | Page extraction, metadata | PyPDF2 |
| **Word** | .docx | Headings, tables, formatting | python-docx |
| **Excel** | .xlsx | Multiple sheets, statistics | openpyxl, pandas |
| **PowerPoint** | .pptx | Slide content, notes | python-pptx |
| **Images** | .jpg, .png, .gif | EXIF data, dimensions | Pillow |
| **CSV** | .csv | Table formatting, statistics | pandas |
| **XML** | .xml | Structure analysis | beautifulsoup4 |
| **HTML** | .html | Semantic conversion | beautifulsoup4 |
| **URLs** | http/https | Content extraction, metadata | requests, readability-lxml |
### Output Formats (Markdown โ File)
| Format | MIME Type | Features |
|--------|-----------|----------|
| **HTML** | text/html | Full documents, CSS styling, responsive |
| **JSON** | application/json | Multiple structures (document/data/api) |
| **CSV** | text/csv | Table extraction, list conversion |
| **Text** | text/plain | Clean formatting, structure preservation |
| **Word** | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Professional formatting, styles |
| **Excel** | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Multiple modes, table extraction |
| **PowerPoint** | application/vnd.openxmlformats-officedocument.presentationml.presentation | Slide generation, layouts |
## ๐ง Advanced Usage
### HTML Generation with Styling
```python
# Generate styled HTML
converter.generate_file(
markdown_content,
"styled.html",
"text/html",
include_css=True,
full_document=True
)
# Document structure
converter.generate_file(content, "doc.json", "application/json", structure_type="document")
# Data extraction
converter.generate_file(content, "data.json", "application/json", structure_type="data")
# API format
converter.generate_file(content, "api.json", "application/json", structure_type="api")
# Word document with custom styling
converter.generate_file(
content, "report.docx",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
font_name="Arial",
font_size=12,
line_spacing=1.2
)
# Excel with specific mode
converter.generate_file(
content, "data.xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
mode='tables' # 'tables', 'outline', 'data', 'auto'
)
# PowerPoint with slide mode
converter.generate_file(
content, "slides.pptx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
mode='headers' # 'headers', 'sections', 'auto'
)
# Advanced URL conversion
result = converter.convert_url(
"https://example.com",
timeout=30,
user_agent="FileAlchemy/1.0",
include_links=True,
include_images=True,
clean_content=True,
max_content_length=50000
)
from pathlib import Path
# Convert multiple files to markdown
input_dir = Path("documents")
output_dir = Path("markdown")
for file_path in input_dir.glob("*"):
if file_path.is_file():
result = converter.convert(str(file_path))
output_file = output_dir / f"{file_path.stem}.md"
with open(output_file, 'w') as f:
f.write(result.content)
# Generate multiple formats from markdown
with open("source.md", 'r') as f:
content = f.read()
formats = [
("text/html", "output.html"),
("application/json", "output.json"),
("text/csv", "output.csv"),
("application/vnd.openxmlformats-officedocument.wordprocessingml.document", "output.docx")
]
for mime_type, filename in formats:
converter.generate_file(content, filename, mime_type)
```## ๐๏ธ
Architecture
### Project Structure
filealchemy/ โโโ init.py # Main API exports โโโ converter.py # Core FileAlchemyConverter class โโโ result.py # ConversionResult data structure โโโ cli.py # Command-line interface โโโ converters/ # File-to-markdown converters โ โโโ base.py # Abstract base converter โ โโโ text.py # Text/Markdown files โ โโโ json.py # JSON structure analysis โ โโโ pdf.py # PDF text extraction โ โโโ docx.py # Word documents โ โโโ xlsx.py # Excel spreadsheets โ โโโ pptx.py # PowerPoint presentations โ โโโ image.py # Image metadata โ โโโ csv.py # CSV data analysis โ โโโ xml.py # XML structure โ โโโ html.py # HTML to markdown โ โโโ url.py # Web scraping โโโ generators/ # Markdown-to-file generators โโโ base.py # Abstract base generator โโโ text.py # Plain text output โโโ json.py # JSON generation โโโ html.py # HTML with CSS โโโ csv.py # CSV from tables โโโ docx.py # Word documents โโโ xlsx.py # Excel spreadsheets โโโ pptx.py # PowerPoint presentations
### Extensible Design
```python
# Add custom converter
from filealchemy.converters.base import BaseConverter
from filealchemy.result import ConversionResult
class MyConverter(BaseConverter):
def convert(self, file_path, **kwargs):
# Your conversion logic
return ConversionResult(
content="# Converted Content",
title="My Document",
metadata={"custom": "data"}
)
# Register with main converter
converter = FileAlchemyConverter()
converter.converters['application/my-type'] = MyConverter()
# Add custom generator
from filealchemy.generators.base import BaseGenerator
class MyGenerator(BaseGenerator):
def generate(self, markdown_content, output_path, **kwargs):
# Your generation logic
with open(output_path, 'w') as f:
f.write(processed_content)
return ConversionResult(content=f"Generated: {output_path}")
converter.generators['application/my-output'] = MyGenerator()
# Run all tests
python -m pytest
# Run with coverage
python -m pytest --cov=filealchemy --cov-report=html
# Run comprehensive test suite
python run_tests.py
Explore FileAlchemy's capabilities:
# Bidirectional conversion demo
python demos/demo_bidirectional.py
# Office file generation demo
python demos/demo_office_generation.py
# URL scraping demo
python demos/url_demo.py
# Complete feature demonstration
python demos/complete_demo.py
- Document Preprocessing: Convert files to LLM-ready markdown
- Content Generation: Generate professional documents from AI output
- Data Extraction: Extract structured data for AI training
- Workflow Automation: End-to-end document processing pipelines
- Report Generation: Markdown โ Professional Word/PowerPoint documents
- Data Analysis: Extract tables to Excel for analysis
- Content Management: Multi-format publishing from single source
- Documentation: Technical docs in multiple formats
- Content Scraping: Clean extraction from web pages
- Archive Creation: Convert web content to structured documents
- Research Automation: Batch processing of online sources
- Content Analysis: Extract and analyze web content
- LLM Optimized: Minimal markup, maximum content with structure preservation
- Professional Output: Office-compatible documents with clean formatting
- Fast Processing: Instant text/JSON, 1-3 seconds for Office documents
- Error Handling: Graceful degradation with detailed error messages
- Cross-Platform: Consistent results across all operating systems
# Clone and install in development mode
git clone <repository-url>
cd filealchemy
pip install -e ".[dev]"
# Run tests
python run_tests.py
- Core: Python 3.8+ (no external dependencies for basic functionality)
- Optional: PDF, Office, Image, and Web scraping libraries (auto-installed)
- PDF Generation - Convert markdown to PDF documents
- Template System - Customizable output templates
- Plugin Architecture - Third-party format extensions
- Cloud Integration - Direct cloud storage support
- API Server - REST API for web-based conversion
MIT License - see LICENSE file for details.
- Usage Guide: Complete examples and tutorials
- API Reference: Detailed method documentation
- Format Guides: Specific instructions for each format
- Troubleshooting: Common issues and solutions
- GitHub Issues: Bug reports and feature requests
- Discussions: Community support and questions
- Contributing: Guidelines for contributors
- Examples: Sample code and use cases
FileAlchemy - Transform any content into any format with AI-optimized processing. Perfect for modern document workflows, content automation, and LLM integration. ๐โจ