From 70f94cf4bab8e81edf225e30d4f4e822566a213b Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Thu, 15 May 2025 15:22:44 -0700 Subject: [PATCH 1/3] Updating supported file types for the Unstructured UI/API --- api-reference/supported-file-types.mdx | 2 +- .../supported-file-types-platform.mdx | 24 ++++++------------- 2 files changed, 8 insertions(+), 18 deletions(-) diff --git a/api-reference/supported-file-types.mdx b/api-reference/supported-file-types.mdx index 740aba45..e3fd06d3 100644 --- a/api-reference/supported-file-types.mdx +++ b/api-reference/supported-file-types.mdx @@ -2,6 +2,6 @@ title: Supported file types --- -import SupportedFileTypes from '/snippets/general-shared-text/supported-file-types.mdx'; +import SupportedFileTypes from '/snippets/general-shared-text/supported-file-types-platform.mdx'; \ No newline at end of file diff --git a/snippets/general-shared-text/supported-file-types-platform.mdx b/snippets/general-shared-text/supported-file-types-platform.mdx index 7c7ed136..75b3ebfb 100644 --- a/snippets/general-shared-text/supported-file-types-platform.mdx +++ b/snippets/general-shared-text/supported-file-types-platform.mdx @@ -1,4 +1,4 @@ -Unstructured supports processing of the following file types: +The Unstructured user interface (UI) and Unstructured API support processing of the following file types: By file extension: @@ -8,10 +8,7 @@ By file extension: | `.bmp` | | `.csv` | | `.cwk` | -| `.dbf` | -| `.dif` | | `.doc` | -| `.docm` | | `.docx` | | `.dot` | | `.dotm` | @@ -19,8 +16,6 @@ By file extension: | `.epub` | | `.et` | | `.eth` | -| `.fods` | -| `.gif` | | `.heic` | | `.htm` | | `.html` | @@ -29,8 +24,8 @@ By file extension: | `.jpg` | | `.md` | | `.mcw` | +| `.msg` | | `.mw` | -| `.odt` | | `.org` | | `.p7s` | | `.pages` | @@ -56,10 +51,7 @@ By file extension: | `.uos1` | | `.uos2` | | `.web` | -| `.webp` | -| `.wk2` | | `.xls` | -| `.xlsb` | | `.xlsm` | | `.xlsx` | | `.xlw` | @@ -72,23 +64,21 @@ By file type: | --- | --- | | Apple | `.cwk`, `.mcw`, `.pages` | CSV | `.csv` | -| Data interchange | `.dif` | -| dBase | `.dbf` | -| E-mail | `.eml`, `.p7s` | +| E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | HTML | `.htm`, `.html` | -| Image | `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.svg`, `.tiff`, `.webp` | +| Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.svg`, `.tiff` | | Markdown | `.md` | | Org Mode | `.org` | -| Open Office | `.odt`, `.sgl` | +| Open Office | `.sgl` | | Other | `.eth`, `.mw`, `.pbd`, `.sdp`, `.uof`, `.web` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.potm`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | -| Spreadsheet | `.et`, `.fods`, `.uos1`, `.uos2`, `.wk2`, `.xls`, `.xlsb`, `.xlsm`, `.xlsx`, `.xlw` | +| Spreadsheet | `.et`, `.uos1`, `.uos2`, `.xls`, `.xlsm`, `.xlsx`, `.xlw` | | StarOffice | `.sxg` | | TSV | `.tsv` | -| Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | +| Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | From e2f5c17cea6861ac86829bb64ea390eff0e7fe7e Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Fri, 16 May 2025 09:18:30 -0700 Subject: [PATCH 2/3] Noted remaining file types recently tested --- .../supported-file-types-platform.mdx | 20 +++++-------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/snippets/general-shared-text/supported-file-types-platform.mdx b/snippets/general-shared-text/supported-file-types-platform.mdx index 75b3ebfb..7c31b4dc 100644 --- a/snippets/general-shared-text/supported-file-types-platform.mdx +++ b/snippets/general-shared-text/supported-file-types-platform.mdx @@ -11,7 +11,6 @@ By file extension: | `.doc` | | `.docx` | | `.dot` | -| `.dotm` | | `.eml` | | `.epub` | | `.et` | @@ -28,12 +27,10 @@ By file extension: | `.mw` | | `.org` | | `.p7s` | -| `.pages` | | `.pbd` | | `.pdf` | | `.png` | | `.pot` | -| `.potm` | | `.ppt` | | `.pptm` | | `.pptx` | @@ -41,20 +38,14 @@ By file extension: | `.rst` | | `.rtf` | | `.sdp` | -| `.sgl` | | `.svg` | | `.sxg` | | `.tiff` | | `.txt` | | `.tsv` | -| `.uof` | -| `.uos1` | -| `.uos2` | -| `.web` | | `.xls` | | `.xlsm` | | `.xlsx` | -| `.xlw` | | `.xml` | | `.zabw` | @@ -62,7 +53,7 @@ By file type: | Category | File types | | --- | --- | -| Apple | `.cwk`, `.mcw`, `.pages` +| Apple | `.cwk`, `.mcw` | CSV | `.csv` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | @@ -70,15 +61,14 @@ By file type: | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.svg`, `.tiff` | | Markdown | `.md` | | Org Mode | `.org` | -| Open Office | `.sgl` | -| Other | `.eth`, `.mw`, `.pbd`, `.sdp`, `.uof`, `.web` | +| Other | `.eth`, `.mw`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | -| PowerPoint | `.pot`, `.potm`, `.ppt`, `.pptm`, `.pptx` | +| PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | -| Spreadsheet | `.et`, `.uos1`, `.uos2`, `.xls`, `.xlsm`, `.xlsx`, `.xlw` | +| Spreadsheet | `.et`, `.xls`, `.xlsm`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | -| Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | +| Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.hwp`, `.zabw` | | XML | `.xml` | From 15844d717ddfaf80ace14c3c95075bfcc012547b Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Fri, 16 May 2025 09:24:58 -0700 Subject: [PATCH 3/3] Noted support for .dif files --- .../general-shared-text/supported-file-types-platform.mdx | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/snippets/general-shared-text/supported-file-types-platform.mdx b/snippets/general-shared-text/supported-file-types-platform.mdx index 7c31b4dc..860c377e 100644 --- a/snippets/general-shared-text/supported-file-types-platform.mdx +++ b/snippets/general-shared-text/supported-file-types-platform.mdx @@ -8,6 +8,7 @@ By file extension: | `.bmp` | | `.csv` | | `.cwk` | +| `.dif`[*](#notes) | | `.doc` | | `.docx` | | `.dot` | @@ -61,7 +62,7 @@ By file type: | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.svg`, `.tiff` | | Markdown | `.md` | | Org Mode | `.org` | -| Other | `.eth`, `.mw`, `.pbd`, `.sdp` | +| Other | `.dif`[*](#notes), `.eth`, `.mw`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | @@ -72,3 +73,8 @@ By file type: | TSV | `.tsv` | | Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.hwp`, `.zabw` | | XML | `.xml` | + +## Notes + +* For `.dif`, `\n` characters in `.dif` files are supported, but `\r\n` characters will raise the error + `UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type`. \ No newline at end of file