Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File CDK unstructured parser: Improve file type detection #31997

Merged
merged 4 commits into from
Nov 2, 2023

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented Oct 31, 2023

The current version of the unstructured parsers detects the file type based on the file name - if the file name doesn't have a proper extension (e.g. .pdf for pdf files), it won't process a file.

The logic used for this is here: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L225

This PR extends the logic the following way:

  • Extend RemoteFile with mime type (as some sources like Google Drive can provide it)
  • Try to detect the file type using the mime type is available
  • If this fails, try to detect the file type using the file name
  • If this fails, pass in the raw file to detect the file type by reading the first few bytes

@flash1293 flash1293 requested a review from a team as a code owner October 31, 2023 11:16
@vercel
Copy link

vercel bot commented Oct 31, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 31, 2023 0:04am

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Oct 31, 2023
@flash1293
Copy link
Contributor Author

@clnoll @aaronsteers As discussed on #31458

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@flash1293 flash1293 merged commit 66dd29f into master Nov 2, 2023
23 checks passed
@flash1293 flash1293 deleted the flash1293/file-cdk-detection branch November 2, 2023 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants