Skip to content

File conversion library that transforms any file format into LLM-ready markdown and generates professional documents from markdown. Perfect for AI workflows, content processing, and document automation.

License

Notifications You must be signed in to change notification settings

chamiles/FileAlchemy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

FileAlchemy

Complete bidirectional file format conversion for AI workflows - seamlessly convert files to markdown and back to any format.

๐Ÿš€ Overview

FileAlchemy is a comprehensive file conversion library that transforms any file format into LLM-ready markdown and generates professional documents from markdown. Perfect for AI workflows, content processing, and document automation.

Key Capabilities

  • File โ†’ Markdown: Convert 17+ file formats to structured markdown
  • Markdown โ†’ File: Generate HTML, JSON, CSV, Office documents from markdown
  • Web Scraping: Extract clean content from any URL
  • Bidirectional Workflows: Complete file processing pipelines
  • AI-Optimized: Output designed for LLM consumption and generation

๐Ÿ“ฆ Installation

Basic Installation

pip install filealchemy

Full Installation (All Features)

pip install filealchemy
pip install PyPDF2 python-docx openpyxl python-pptx Pillow beautifulsoup4 lxml pandas requests readability-lxml

Development Installation

# Clone the repository
git clone <repository-url>
cd filealchemy

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

๐Ÿ”„ Quick Start

File to Markdown

from filealchemy import FileAlchemyConverter

converter = FileAlchemyConverter()

# Convert any file
result = converter.convert("document.pdf")
print(result.content)

# Convert URL
result = converter.convert("https://example.com")
print(result.content)

# Auto-detect input type
result = converter.convert_auto("data.json")
print(result.content)

Markdown to File

markdown_content = """
# My Report

## Data Analysis
| Metric | Value |
|--------|-------|
| Sales  | $1000 |
| Growth | 15%   |

## Summary
Key findings from our analysis...
"""

# Generate multiple formats
converter.generate_file(markdown_content, "report.html", "text/html")
converter.generate_file(markdown_content, "report.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
converter.generate_file(markdown_content, "data.json", "application/json")
converter.generate_file(markdown_content, "table.csv", "text/csv")

๐Ÿ–ฅ๏ธ Command Line Usage

Convert to Markdown

# Convert file
filealchemy document.pdf -o output.md

# Convert URL  
filealchemy https://example.com --url -o webpage.md

# Auto-detect input type
filealchemy data.json --auto -o data.md

Generate from Markdown

# Generate HTML
filealchemy report.md --generate text/html -o report.html

# Generate Word document
filealchemy report.md --generate application/vnd.openxmlformats-officedocument.wordprocessingml.document -o report.docx

# Generate JSON
filealchemy report.md --generate application/json -o data.json

# List available formats
filealchemy --list-types          # Input formats
filealchemy --list-output-types   # Output formats
filealchemy --list-instructions   # Available instruction types
filealchemy --list-sample-data    # Available sample data types

# Show instructions and sample data
filealchemy --instructions docx   # Show DOCX generation instructions
filealchemy --sample-data json    # Show JSON sample data
```##
 ๐Ÿ“‹ Supported Formats

### Input Formats (File โ†’ Markdown)
| Format | Extensions | Features | Dependencies |
|--------|------------|----------|--------------|
| **Text** | .txt, .md, .rst | Encoding detection, title extraction | None |
| **JSON** | .json | Structure analysis, pretty formatting | None |
| **PDF** | .pdf | Page extraction, metadata | PyPDF2 |
| **Word** | .docx | Headings, tables, formatting | python-docx |
| **Excel** | .xlsx | Multiple sheets, statistics | openpyxl, pandas |
| **PowerPoint** | .pptx | Slide content, notes | python-pptx |
| **Images** | .jpg, .png, .gif | EXIF data, dimensions | Pillow |
| **CSV** | .csv | Table formatting, statistics | pandas |
| **XML** | .xml | Structure analysis | beautifulsoup4 |
| **HTML** | .html | Semantic conversion | beautifulsoup4 |
| **URLs** | http/https | Content extraction, metadata | requests, readability-lxml |

### Output Formats (Markdown โ†’ File)
| Format | MIME Type | Features |
|--------|-----------|----------|
| **HTML** | text/html | Full documents, CSS styling, responsive |
| **JSON** | application/json | Multiple structures (document/data/api) |
| **CSV** | text/csv | Table extraction, list conversion |
| **Text** | text/plain | Clean formatting, structure preservation |
| **Word** | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Professional formatting, styles |
| **Excel** | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Multiple modes, table extraction |
| **PowerPoint** | application/vnd.openxmlformats-officedocument.presentationml.presentation | Slide generation, layouts |

## ๐Ÿ”ง Advanced Usage

### HTML Generation with Styling
```python
# Generate styled HTML
converter.generate_file(
    markdown_content,
    "styled.html",
    "text/html",
    include_css=True,
    full_document=True
)

JSON with Different Structures

# Document structure
converter.generate_file(content, "doc.json", "application/json", structure_type="document")

# Data extraction
converter.generate_file(content, "data.json", "application/json", structure_type="data")

# API format
converter.generate_file(content, "api.json", "application/json", structure_type="api")

Office Documents with Custom Formatting

# Word document with custom styling
converter.generate_file(
    content, "report.docx", 
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    font_name="Arial",
    font_size=12,
    line_spacing=1.2
)

# Excel with specific mode
converter.generate_file(
    content, "data.xlsx",
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    mode='tables'  # 'tables', 'outline', 'data', 'auto'
)

# PowerPoint with slide mode
converter.generate_file(
    content, "slides.pptx",
    "application/vnd.openxmlformats-officedocument.presentationml.presentation", 
    mode='headers'  # 'headers', 'sections', 'auto'
)

Web Scraping Options

# Advanced URL conversion
result = converter.convert_url(
    "https://example.com",
    timeout=30,
    user_agent="FileAlchemy/1.0",
    include_links=True,
    include_images=True,
    clean_content=True,
    max_content_length=50000
)

Batch Processing

from pathlib import Path

# Convert multiple files to markdown
input_dir = Path("documents")
output_dir = Path("markdown")

for file_path in input_dir.glob("*"):
    if file_path.is_file():
        result = converter.convert(str(file_path))
        output_file = output_dir / f"{file_path.stem}.md"
        with open(output_file, 'w') as f:
            f.write(result.content)

# Generate multiple formats from markdown
with open("source.md", 'r') as f:
    content = f.read()

formats = [
    ("text/html", "output.html"),
    ("application/json", "output.json"),
    ("text/csv", "output.csv"),
    ("application/vnd.openxmlformats-officedocument.wordprocessingml.document", "output.docx")
]

for mime_type, filename in formats:
    converter.generate_file(content, filename, mime_type)
```## ๐Ÿ—๏ธ
 Architecture

### Project Structure

filealchemy/ โ”œโ”€โ”€ init.py # Main API exports โ”œโ”€โ”€ converter.py # Core FileAlchemyConverter class โ”œโ”€โ”€ result.py # ConversionResult data structure โ”œโ”€โ”€ cli.py # Command-line interface โ”œโ”€โ”€ converters/ # File-to-markdown converters โ”‚ โ”œโ”€โ”€ base.py # Abstract base converter โ”‚ โ”œโ”€โ”€ text.py # Text/Markdown files โ”‚ โ”œโ”€โ”€ json.py # JSON structure analysis โ”‚ โ”œโ”€โ”€ pdf.py # PDF text extraction โ”‚ โ”œโ”€โ”€ docx.py # Word documents โ”‚ โ”œโ”€โ”€ xlsx.py # Excel spreadsheets โ”‚ โ”œโ”€โ”€ pptx.py # PowerPoint presentations โ”‚ โ”œโ”€โ”€ image.py # Image metadata โ”‚ โ”œโ”€โ”€ csv.py # CSV data analysis โ”‚ โ”œโ”€โ”€ xml.py # XML structure โ”‚ โ”œโ”€โ”€ html.py # HTML to markdown โ”‚ โ””โ”€โ”€ url.py # Web scraping โ””โ”€โ”€ generators/ # Markdown-to-file generators โ”œโ”€โ”€ base.py # Abstract base generator โ”œโ”€โ”€ text.py # Plain text output โ”œโ”€โ”€ json.py # JSON generation โ”œโ”€โ”€ html.py # HTML with CSS โ”œโ”€โ”€ csv.py # CSV from tables โ”œโ”€โ”€ docx.py # Word documents โ”œโ”€โ”€ xlsx.py # Excel spreadsheets โ””โ”€โ”€ pptx.py # PowerPoint presentations


### Extensible Design
```python
# Add custom converter
from filealchemy.converters.base import BaseConverter
from filealchemy.result import ConversionResult

class MyConverter(BaseConverter):
    def convert(self, file_path, **kwargs):
        # Your conversion logic
        return ConversionResult(
            content="# Converted Content",
            title="My Document",
            metadata={"custom": "data"}
        )

# Register with main converter
converter = FileAlchemyConverter()
converter.converters['application/my-type'] = MyConverter()

# Add custom generator
from filealchemy.generators.base import BaseGenerator

class MyGenerator(BaseGenerator):
    def generate(self, markdown_content, output_path, **kwargs):
        # Your generation logic
        with open(output_path, 'w') as f:
            f.write(processed_content)
        return ConversionResult(content=f"Generated: {output_path}")

converter.generators['application/my-output'] = MyGenerator()

๐Ÿงช Testing

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=filealchemy --cov-report=html

# Run comprehensive test suite
python run_tests.py

๐ŸŽญ Demos

Explore FileAlchemy's capabilities:

# Bidirectional conversion demo
python demos/demo_bidirectional.py

# Office file generation demo  
python demos/demo_office_generation.py

# URL scraping demo
python demos/url_demo.py

# Complete feature demonstration
python demos/complete_demo.py

๐ŸŽฏ Use Cases

AI/LLM Workflows

  • Document Preprocessing: Convert files to LLM-ready markdown
  • Content Generation: Generate professional documents from AI output
  • Data Extraction: Extract structured data for AI training
  • Workflow Automation: End-to-end document processing pipelines

Business Applications

  • Report Generation: Markdown โ†’ Professional Word/PowerPoint documents
  • Data Analysis: Extract tables to Excel for analysis
  • Content Management: Multi-format publishing from single source
  • Documentation: Technical docs in multiple formats

Web Content Processing

  • Content Scraping: Clean extraction from web pages
  • Archive Creation: Convert web content to structured documents
  • Research Automation: Batch processing of online sources
  • Content Analysis: Extract and analyze web content

โœจ Key Features

  • LLM Optimized: Minimal markup, maximum content with structure preservation
  • Professional Output: Office-compatible documents with clean formatting
  • Fast Processing: Instant text/JSON, 1-3 seconds for Office documents
  • Error Handling: Graceful degradation with detailed error messages
  • Cross-Platform: Consistent results across all operating systems

๐Ÿ› ๏ธ Development

Quick Setup

# Clone and install in development mode
git clone <repository-url>
cd filealchemy
pip install -e ".[dev]"

# Run tests
python run_tests.py

Dependencies

  • Core: Python 3.8+ (no external dependencies for basic functionality)
  • Optional: PDF, Office, Image, and Web scraping libraries (auto-installed)

๐Ÿ”ฎ Roadmap

  • PDF Generation - Convert markdown to PDF documents
  • Template System - Customizable output templates
  • Plugin Architecture - Third-party format extensions
  • Cloud Integration - Direct cloud storage support
  • API Server - REST API for web-based conversion

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿค Support

Documentation

  • Usage Guide: Complete examples and tutorials
  • API Reference: Detailed method documentation
  • Format Guides: Specific instructions for each format
  • Troubleshooting: Common issues and solutions

Community

  • GitHub Issues: Bug reports and feature requests
  • Discussions: Community support and questions
  • Contributing: Guidelines for contributors
  • Examples: Sample code and use cases

Links


FileAlchemy - Transform any content into any format with AI-optimized processing. Perfect for modern document workflows, content automation, and LLM integration. ๐Ÿš€โœจ

About

File conversion library that transforms any file format into LLM-ready markdown and generates professional documents from markdown. Perfect for AI workflows, content processing, and document automation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages