New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text file without extension #85

Closed
eon01 opened this Issue May 21, 2015 · 8 comments

Comments

Projects
None yet
2 participants
@eon01

eon01 commented May 21, 2015

A text file having a filename without an extension is considered as not supported.

    raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension  is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented May 21, 2015

hmmm...that's an interesting use case. Did you know that the file some_file was a "plain text" file from the outset? If so, you can either rename the file to some_file.txt and use textract in the usual way textract.process("some_file.txt").

This isn't documented, but you could also do something like this if you'd rather not rename the file

from textract.parsers.txt_parser import Parser

parser = Parser()
text = parser.process("path/to/file/without/an/extension")

If this is a more general problem, I could imagine a few ways of extending textract to address this issue. The first would be to allow users to specify an extension when the call textract with something like textract.process("some_file", extension="txt").

This requires a user to know the extension up front though, which is kind of the entire point of using textract in the first place. Another option would be to have a fallback method of determining the file type based on the output of the file unix command. If the filename extension exists on the filename, then we use that as the default. If no filename extension is detected, then we use file (or any other similar program) to get the mimetype of the file.

Do any of these strike a chord with you @eon01? Can you share a bit more about your situation so others can chime in with their experience as well?

I'm admittedly hesitant to open up this can of worms, but I think this is a solvable problem that textract can certainly address if extensionless-filenames are pervasive.

@eon01

This comment has been minimized.

eon01 commented Jun 8, 2015

Yes, I don't think that this should be an issue. But may be the simplest way is to consider all "extension-less" files as text files.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jun 15, 2015

If the goal is to just consider all extension-less files as text files, would a satisfactory work around be for you to just cp some_filename some_filename.txt?

I've been trying to get the mimetype detection to work this morning and this is turning out to be a bit more tricky than I would have otherwise thought. For example, mp3 files have a mimetypes of audio/mpeg, which doesn't nicely map to .mp3. Can you share a bit more about your use case and how this is coming up? Do you have a reason to believe that all filenames without an extension are text files?

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jun 23, 2015

I shared what I was working on the other day in #89. As you can see from all the failing tests, it isn't totally obvious how to automatically detect the extension on a file from its mimetype for even a fraction of the files that textract supports.

If the ultimate answer is to assume that extensionless files are plain text, I'm a little torn on how to handle this. On the one hand, we could easily add this to the EXTENSION_SYNONYMS and everything would be handled in the same way with something like this:

EXTENSION_SYNONYMS = {
    '': 'txt',
    ...
}

On the other hand, this doesn't really feel like a terribly necessary addition if the issue can just be resolved by cat-ing the file or just open('some_filename', 'r').read() to get all the text. @eon01--can you share a bit more about your use case and how this came up? I want this to be useful, but I do not want to over-engineer a solution.

@eon01

This comment has been minimized.

eon01 commented Jun 23, 2015

My use case was very simple: I generally use extension-less files to write some notes with gedit ..etc
I was testing textract with one of those files, that's it.

echo "Some notes .." > path/to/file_without_extension
$ cat path/to/file_without_extension
Some notes ..

python code

import textract
text = textract.process('path/to/file_without_extension')

output

 raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension  is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

I don't think considering all extension-less files as ".txt" files a good idea, referring to mime types should be enough if we want to keep things logic.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jun 24, 2015

OK, so going with the mimetype option seems like the way to go here unless anyone else has any other ideas for detecting file types dynamically.

I'll adapt the tests in #89 to only test for mimetypes that only have known (and supported) file extensions and aim to release this in the next week or two.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 29, 2015

I've been messing with different methods for guessing the mimetype and it basically doesn't work. It often guesses incorrectly (guessing .doc when it is an excel file, for example) and will cause all kinds of trouble. I tend to agree that most files that do not have an extension are probably text files and, if not, it will be pretty obvious to end users that the textract output looks like 💩. This is incorporated in master now and it will be incorporated into the next release, perhaps sometime this week.

Thanks again for starting the conversation @eon01 and for explaining your use case. Hopefully this is helpful for you.

@eon01

This comment has been minimized.

eon01 commented Aug 29, 2015

Great @deanmalmgren , I agree with your last comment 👍 ! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment