Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing DOCX files #96

Open
hawkeyexl opened this issue Mar 17, 2024 · 1 comment
Open

Add support for parsing DOCX files #96

hawkeyexl opened this issue Mar 17, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@hawkeyexl
Copy link
Contributor

Lots of writers author in Word format. Internally, Word files use an XML document representation that, with relevant metadata, can be parsed. Because the underlying content is much more structured than something like markdown, we'll have to approach the style config and parsing differently, but we should still be able to identify heading levels, bolding, and more.

@hawkeyexl hawkeyexl added the enhancement New feature or request label Mar 17, 2024
@hawkeyexl
Copy link
Contributor Author

  1. Convert DOCX to HTML: https://github.com/mwilliamson/mammoth.js
  2. Convert HTML to Markdown: https://github.com/mixmark-io/turndown
  3. Parse Markdown with default style options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant