FileAlchemy

Complete bidirectional file format conversion for AI workflows - seamlessly convert files to markdown and back to any format.

🚀 Overview

FileAlchemy is a comprehensive file conversion library that transforms any file format into LLM-ready markdown and generates professional documents from markdown. Perfect for AI workflows, content processing, and document automation.

Key Capabilities

File → Markdown: Convert 17+ file formats to structured markdown
Markdown → File: Generate HTML, JSON, CSV, Office documents from markdown
Web Scraping: Extract clean content from any URL
Bidirectional Workflows: Complete file processing pipelines
AI-Optimized: Output designed for LLM consumption and generation

📦 Installation

Basic Installation

pip install filealchemy

Full Installation (All Features)

pip install filealchemy
pip install PyPDF2 python-docx openpyxl python-pptx Pillow beautifulsoup4 lxml pandas requests readability-lxml

Development Installation

# Clone the repository
git clone <repository-url>
cd filealchemy

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

🔄 Quick Start

File to Markdown

from filealchemy import FileAlchemyConverter

converter = FileAlchemyConverter()

# Convert any file
result = converter.convert("document.pdf")
print(result.content)

# Convert URL
result = converter.convert("https://example.com")
print(result.content)

# Auto-detect input type
result = converter.convert_auto("data.json")
print(result.content)

Markdown to File

markdown_content = """
# My Report

## Data Analysis
| Metric | Value |
|--------|-------|
| Sales  | $1000 |
| Growth | 15%   |

## Summary
Key findings from our analysis...
"""

# Generate multiple formats
converter.generate_file(markdown_content, "report.html", "text/html")
converter.generate_file(markdown_content, "report.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
converter.generate_file(markdown_content, "data.json", "application/json")
converter.generate_file(markdown_content, "table.csv", "text/csv")

🖥️ Command Line Usage

Convert to Markdown

# Convert file
filealchemy document.pdf -o output.md

# Convert URL  
filealchemy https://example.com --url -o webpage.md

# Auto-detect input type
filealchemy data.json --auto -o data.md

Generate from Markdown

# Generate HTML
filealchemy report.md --generate text/html -o report.html

# Generate Word document
filealchemy report.md --generate application/vnd.openxmlformats-officedocument.wordprocessingml.document -o report.docx

# Generate JSON
filealchemy report.md --generate application/json -o data.json

# List available formats
filealchemy --list-types          # Input formats
filealchemy --list-output-types   # Output formats
filealchemy --list-instructions   # Available instruction types
filealchemy --list-sample-data    # Available sample data types

# Show instructions and sample data
filealchemy --instructions docx   # Show DOCX generation instructions
filealchemy --sample-data json    # Show JSON sample data
```##
 📋 Supported Formats

### Input Formats (File → Markdown)
| Format | Extensions | Features | Dependencies |
|--------|------------|----------|--------------|
| **Text** | .txt, .md, .rst | Encoding detection, title extraction | None |
| **JSON** | .json | Structure analysis, pretty formatting | None |
| **PDF** | .pdf | Page extraction, metadata | PyPDF2 |
| **Word** | .docx | Headings, tables, formatting | python-docx |
| **Excel** | .xlsx | Multiple sheets, statistics | openpyxl, pandas |
| **PowerPoint** | .pptx | Slide content, notes | python-pptx |
| **Images** | .jpg, .png, .gif | EXIF data, dimensions | Pillow |
| **CSV** | .csv | Table formatting, statistics | pandas |
| **XML** | .xml | Structure analysis | beautifulsoup4 |
| **HTML** | .html | Semantic conversion | beautifulsoup4 |
| **URLs** | http/https | Content extraction, metadata | requests, readability-lxml |

### Output Formats (Markdown → File)
| Format | MIME Type | Features |
|--------|-----------|----------|
| **HTML** | text/html | Full documents, CSS styling, responsive |
| **JSON** | application/json | Multiple structures (document/data/api) |
| **CSV** | text/csv | Table extraction, list conversion |
| **Text** | text/plain | Clean formatting, structure preservation |
| **Word** | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Professional formatting, styles |
| **Excel** | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Multiple modes, table extraction |
| **PowerPoint** | application/vnd.openxmlformats-officedocument.presentationml.presentation | Slide generation, layouts |

## 🔧 Advanced Usage

### HTML Generation with Styling
```python
# Generate styled HTML
converter.generate_file(
    markdown_content,
    "styled.html",
    "text/html",
    include_css=True,
    full_document=True
)

JSON with Different Structures

# Document structure
converter.generate_file(content, "doc.json", "application/json", structure_type="document")

# Data extraction
converter.generate_file(content, "data.json", "application/json", structure_type="data")

# API format
converter.generate_file(content, "api.json", "application/json", structure_type="api")

Office Documents with Custom Formatting

# Word document with custom styling
converter.generate_file(
    content, "report.docx", 
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    font_name="Arial",
    font_size=12,
    line_spacing=1.2
)

# Excel with specific mode
converter.generate_file(
    content, "data.xlsx",
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    mode='tables'  # 'tables', 'outline', 'data', 'auto'
)

# PowerPoint with slide mode
converter.generate_file(
    content, "slides.pptx",
    "application/vnd.openxmlformats-officedocument.presentationml.presentation", 
    mode='headers'  # 'headers', 'sections', 'auto'
)

Web Scraping Options

# Advanced URL conversion
result = converter.convert_url(
    "https://example.com",
    timeout=30,
    user_agent="FileAlchemy/1.0",
    include_links=True,
    include_images=True,
    clean_content=True,
    max_content_length=50000
)

Batch Processing

from pathlib import Path

# Convert multiple files to markdown
input_dir = Path("documents")
output_dir = Path("markdown")

for file_path in input_dir.glob("*"):
    if file_path.is_file():
        result = converter.convert(str(file_path))
        output_file = output_dir / f"{file_path.stem}.md"
        with open(output_file, 'w') as f:
            f.write(result.content)

# Generate multiple formats from markdown
with open("source.md", 'r') as f:
    content = f.read()

formats = [
    ("text/html", "output.html"),
    ("application/json", "output.json"),
    ("text/csv", "output.csv"),
    ("application/vnd.openxmlformats-officedocument.wordprocessingml.document", "output.docx")
]

for mime_type, filename in formats:
    converter.generate_file(content, filename, mime_type)
```## 🏗️
 Architecture

### Project Structure

filealchemy/ ├── init.py # Main API exports ├── converter.py # Core FileAlchemyConverter class ├── result.py # ConversionResult data structure ├── cli.py # Command-line interface ├── converters/ # File-to-markdown converters │ ├── base.py # Abstract base converter │ ├── text.py # Text/Markdown files │ ├── json.py # JSON structure analysis │ ├── pdf.py # PDF text extraction │ ├── docx.py # Word documents │ ├── xlsx.py # Excel spreadsheets │ ├── pptx.py # PowerPoint presentations │ ├── image.py # Image metadata │ ├── csv.py # CSV data analysis │ ├── xml.py # XML structure │ ├── html.py # HTML to markdown │ └── url.py # Web scraping └── generators/ # Markdown-to-file generators ├── base.py # Abstract base generator ├── text.py # Plain text output ├── json.py # JSON generation ├── html.py # HTML with CSS ├── csv.py # CSV from tables ├── docx.py # Word documents ├── xlsx.py # Excel spreadsheets └── pptx.py # PowerPoint presentations


### Extensible Design
```python
# Add custom converter
from filealchemy.converters.base import BaseConverter
from filealchemy.result import ConversionResult

class MyConverter(BaseConverter):
    def convert(self, file_path, **kwargs):
        # Your conversion logic
        return ConversionResult(
            content="# Converted Content",
            title="My Document",
            metadata={"custom": "data"}
        )

# Register with main converter
converter = FileAlchemyConverter()
converter.converters['application/my-type'] = MyConverter()

# Add custom generator
from filealchemy.generators.base import BaseGenerator

class MyGenerator(BaseGenerator):
    def generate(self, markdown_content, output_path, **kwargs):
        # Your generation logic
        with open(output_path, 'w') as f:
            f.write(processed_content)
        return ConversionResult(content=f"Generated: {output_path}")

converter.generators['application/my-output'] = MyGenerator()

🧪 Testing

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=filealchemy --cov-report=html

# Run comprehensive test suite
python run_tests.py

🎭 Demos

Explore FileAlchemy's capabilities:

# Bidirectional conversion demo
python demos/demo_bidirectional.py

# Office file generation demo  
python demos/demo_office_generation.py

# URL scraping demo
python demos/url_demo.py

# Complete feature demonstration
python demos/complete_demo.py

🎯 Use Cases

AI/LLM Workflows

Document Preprocessing: Convert files to LLM-ready markdown
Content Generation: Generate professional documents from AI output
Data Extraction: Extract structured data for AI training
Workflow Automation: End-to-end document processing pipelines

Business Applications

Report Generation: Markdown → Professional Word/PowerPoint documents
Data Analysis: Extract tables to Excel for analysis
Content Management: Multi-format publishing from single source
Documentation: Technical docs in multiple formats

Web Content Processing

Content Scraping: Clean extraction from web pages
Archive Creation: Convert web content to structured documents
Research Automation: Batch processing of online sources
Content Analysis: Extract and analyze web content

✨ Key Features

LLM Optimized: Minimal markup, maximum content with structure preservation
Professional Output: Office-compatible documents with clean formatting
Fast Processing: Instant text/JSON, 1-3 seconds for Office documents
Error Handling: Graceful degradation with detailed error messages
Cross-Platform: Consistent results across all operating systems

🛠️ Development

Quick Setup

# Clone and install in development mode
git clone <repository-url>
cd filealchemy
pip install -e ".[dev]"

# Run tests
python run_tests.py

Dependencies

Core: Python 3.8+ (no external dependencies for basic functionality)
Optional: PDF, Office, Image, and Web scraping libraries (auto-installed)

🔮 Roadmap

PDF Generation - Convert markdown to PDF documents
Template System - Customizable output templates
Plugin Architecture - Third-party format extensions
Cloud Integration - Direct cloud storage support
API Server - REST API for web-based conversion

📄 License

MIT License - see LICENSE file for details.

🤝 Support

Documentation

Usage Guide: Complete examples and tutorials
API Reference: Detailed method documentation
Format Guides: Specific instructions for each format
Troubleshooting: Common issues and solutions

Community

GitHub Issues: Bug reports and feature requests
Discussions: Community support and questions
Contributing: Guidelines for contributors
Examples: Sample code and use cases

Links

FileAlchemy - Transform any content into any format with AI-optimized processing. Perfect for modern document workflows, content automation, and LLM integration. 🚀✨

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
demos		demos
docs		docs
filealchemy		filealchemy
instructions		instructions
samples		samples
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
install.py		install.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run_tests.py		run_tests.py
setup.py		setup.py

License

chamiles/FileAlchemy

Folders and files

Latest commit

History

Repository files navigation