Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More content extraction #10

Open
jonnybazookatone opened this issue Apr 7, 2015 · 0 comments
Open

More content extraction #10

jonnybazookatone opened this issue Apr 7, 2015 · 0 comments

Comments

@jonnybazookatone
Copy link
Contributor

This is a general comment on content extraction. It should be possible to extract more sensical content from files that are ingested. For example, PDF, OCR and TXT files do not differentiate between their content unlike HTML and XML files. Once a schema is in place, it can be applied to all of the simple-text files that are extracted.

@jonnybazookatone jonnybazookatone self-assigned this Apr 23, 2015
@jonnybazookatone jonnybazookatone added this to the Smart extraction for all types milestone Apr 23, 2015
@jonnybazookatone jonnybazookatone removed their assignment Jul 20, 2016
@marblestation marblestation removed this from the Smart extraction for all types milestone Feb 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants