Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: file info dataframe from filenames and file content #204

Merged
merged 7 commits into from
Feb 8, 2023

Conversation

MthwRobinson
Copy link
Contributor

Summary

Adds functions for building a dataframe of file info from a list of filenames or a list of file contents. Supports creating this plot for our snazzy demo UI.

image

Testing

Run the following Python code.

import os
from unstructured.file_utils.exploration import get_file_info

filenames = [os.path.join("example-docs/", f) for f in os.listdir("example-docs/")]
get_file_info(filenames)

You should get a dataframe that looks like this:

                         filename  ...       filetype
0                  fake-html.html  ...  FileType.HTML
1                example-10k.html  ...  FileType.HTML
2                    factbook.xml  ...   FileType.XML
3           fake-email-header.eml  ...   FileType.UNK
4                       fake.docx  ...  FileType.DOCX
5   fake-email-image-embedded.eml  ...   FileType.EML
6                   fake-text.txt  ...   FileType.TXT
7    layout-parser-paper-fast.pdf  ...   FileType.PDF
8            email-with-image.eml  ...   FileType.EML
9           fake-power-point.pptx  ...  FileType.PPTX
10                 fake-email.txt  ...   FileType.TXT
11                      README.md  ...   FileType.TXT
12                   factbook.xsl  ...   FileType.XML
13                fake-excel.xlsx  ...  FileType.XLSX
14                 fake-email.eml  ...   FileType.EML
15        layout-parser-paper.pdf  ...   FileType.PDF
16      fake-email-attachment.eml  ...   FileType.EML
17                    example.jpg  ...   FileType.JPG

[18 rows x 5 columns]

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested locally on a dir that included empty file, works for me!

Though, the following is emitted, which isn't super helpful as it doesn't say what file is not supported.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

@MthwRobinson
Copy link
Contributor Author

Thanks, added #208 to capture updating that warning to make it more helpful.

@MthwRobinson MthwRobinson enabled auto-merge (squash) February 8, 2023 20:39
@MthwRobinson MthwRobinson merged commit 47ab808 into main Feb 8, 2023
@MthwRobinson MthwRobinson deleted the feat/explore-file-list branch February 8, 2023 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants