diff --git a/snippets/concepts/document-elements.mdx b/snippets/concepts/document-elements.mdx index fe3534b2..ca1e5904 100644 --- a/snippets/concepts/document-elements.mdx +++ b/snippets/concepts/document-elements.mdx @@ -144,22 +144,27 @@ print(element.metadata.coordinates.system.height) ### Additional metadata fields by document type -| Field name | Applicable file types | Description | -|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------| -| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. | -| `bcc_recipient` | EML | The related [email](#email) BCC recipient. | -| `cc_recipient` | EML | The related [email](#email) CC recipient. | -| `email_message_id` | EML | The related [email](#email) message ID. | -| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | -| `link_urls` | HTML | The URL that is associated with a link in a document. | -| `link_texts` | HTML | The text that is associated with a link in a document. | -| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | -| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | -| `section` | EPUB | The book section title corresponding to a table of contents. | -| `sent_from` | EML | The related [email](#email) sender. | -| `sent_to` | EML | The related [email](#email) recipient. | -| `signature` | EML | The related [email](#email) signature. | -| `subject` | EML | The related [email](#email) subject. | +| Field name | Applicable file types | Description | +|------------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. | +| `bcc_recipient` | EML | The related [email](#email) BCC recipient. | +| `cc_recipient` | EML | The related [email](#email) CC recipient. | +| `email_message_id` | EML | The related [email](#email) message ID. | +| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | +| `image_path` | PDF | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. | +| `image_mime_type` | PDF | The MIME type of the image. | +| `image_url` | HTML | The URL to the image. | +| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. | +| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. | +| `link_urls` | HTML | A list of URLs within the extracted content. | +| `links` | PDF | A list of links within the extracted content. | +| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | +| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | +| `section` | EPUB | The book section title corresponding to a table of contents. | +| `sent_from` | EML | The related [email](#email) sender. | +| `sent_to` | EML | The related [email](#email) recipient. | +| `signature` | EML | The related [email](#email) signature. | +| `subject` | EML | The related [email](#email) subject. | Notes on additional metadata by document type: diff --git a/ui/document-elements.mdx b/ui/document-elements.mdx index feaf2e81..9a4544a1 100644 --- a/ui/document-elements.mdx +++ b/ui/document-elements.mdx @@ -135,8 +135,12 @@ The `coordinates` metadata field contains: | `cc_recipient` | EML | The related [email](#email) CC recipient. | | `email_message_id` | EML | The related [email](#email) message ID. | | `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | -| `link_urls` | HTML | The URL that is associated with a link in a document. | -| `link_texts` | HTML | The text that is associated with a link in a document. | +| `image_mime_type` | HTML, image, PDF | The MIME type of the image. | +| `image_url` | HTML | The URL to the image. | +| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. | +| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. | +| `link_urls` | HTML | A list of URLs within the extracted content. | +| `links` | PDF | A list of links within the extracted content. | | `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | | `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | | `section` | EPUB | The book section title corresponding to a table of contents. |