Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 21 additions & 16 deletions snippets/concepts/document-elements.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -144,22 +144,27 @@ print(element.metadata.coordinates.system.height)

### Additional metadata fields by document type

| Field name | Applicable file types | Description |
|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
| `email_message_id` | EML | The related [email](#email) message ID. |
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
| `link_urls` | HTML | The URL that is associated with a link in a document. |
| `link_texts` | HTML | The text that is associated with a link in a document. |
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
| `section` | EPUB | The book section title corresponding to a table of contents. |
| `sent_from` | EML | The related [email](#email) sender. |
| `sent_to` | EML | The related [email](#email) recipient. |
| `signature` | EML | The related [email](#email) signature. |
| `subject` | EML | The related [email](#email) subject. |
| Field name | Applicable file types | Description |
|------------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
| `email_message_id` | EML | The related [email](#email) message ID. |
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
| `image_path` | PDF | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. |
| `image_mime_type` | PDF | The MIME type of the image. |
| `image_url` | HTML | The URL to the image. |
| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. |
| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. |
| `link_urls` | HTML | A list of URLs within the extracted content. |
| `links` | PDF | A list of links within the extracted content. |
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
| `section` | EPUB | The book section title corresponding to a table of contents. |
| `sent_from` | EML | The related [email](#email) sender. |
| `sent_to` | EML | The related [email](#email) recipient. |
| `signature` | EML | The related [email](#email) signature. |
| `subject` | EML | The related [email](#email) subject. |

Notes on additional metadata by document type:

Expand Down
8 changes: 6 additions & 2 deletions ui/document-elements.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,12 @@ The `coordinates` metadata field contains:
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
| `email_message_id` | EML | The related [email](#email) message ID. |
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
| `link_urls` | HTML | The URL that is associated with a link in a document. |
| `link_texts` | HTML | The text that is associated with a link in a document. |
| `image_mime_type` | HTML, image, PDF | The MIME type of the image. |
| `image_url` | HTML | The URL to the image. |
| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. |
| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. |
| `link_urls` | HTML | A list of URLs within the extracted content. |
| `links` | PDF | A list of links within the extracted content. |
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
| `section` | EPUB | The book section title corresponding to a table of contents. |
Expand Down