cocoa-sniff helps with trying text encoding conversions with Cocoa.
It lets you quickly:
- list the encodings known to Cocoa, by their Cocoa names and nearest IANA char set name
- try to read a file using a given encoding, and see if this errors
- see what encoding Cocoa guesses for a file
- try a sequence of encodings, and use the first which succeeds without error
Basically this is a wrapper for the two methods
+[NSString stringWithContentsOfFile:encoding:error:] and
+[NSString stringWithContentsOfFile:usedEncoding:error:], and some methods for converting between IANA char set names and NSStringEncoding values.
What's the point of that? If you know the encoding of a text file, then there is no point. Read the file with that encoding and get on with your life. However, if you are obliged to process text files of unknown encodings -- for instance, files found on the internet or generated by non-technical users -- then your problem is harder, as there is no reliable way to deduce the encoding of a file only from the file itself. So you must guess.
One way to guess is to ask Cocoa to guess with the method
+[NSString stringWithContentsOfFile:usedEncoding:error:], but Cocoa's guesses are pretty poor. So you can use this utility to see how well Cocoa guesses for your files in particular.
Another way to guess is to use a trial encoding -- that is, to try an encoding and rely on Cocoa to produce an error if it can see that you guessed wrong. This is sometimes a bad strategy because of false positives -- for instance, it seems Cocoa will read almost anything with a "macintosh" encoding, incorrectly. But sometimes it's a pretty good strategy -- if something's not utf-8, it will probably produce an error if you try to read it as UTF-8. So before guessing in this way, you will probably want to try a few trial encodings on your files and see which encoding succeeds only when desired. This utility can help with that.
Last, you might want to try converting a file by using a sequence of trial encodings. That is, first try utf-8. If it fails, try windows-1252. If it fails, try macintosh. This utility can help with that.
If you're receiving English-language text files, produced on Mac OS X or on Windows, from MSWord, Notepad, or TextEdit, and if the main extended ascii characters you're likely to encounter are typographer's quotes, en-dashes, em-dashes, bullet points, and perhaps the trademark or copyright sign, then a good sequence of trial encodings is probably SNIFF, windows-1252, macintosh. This is because it seems that SNIFF will usually detect utf-8, utf-16, and fail on other encodings; windows-1252 will sometimes correctly fail on macintosh encoding; and macintosh will accept anything.
Check availabile encodings:
Check what encoding Cocoa guesses via the method
cocoa-sniff --encodings=SNIFF foo.txt
Try a sequence of encodings, and convert using the first one that succeeds. For instance, suppose you have a file foo.txt with an unknown encoding. What you would like to do is try to autodetect the encoding and, failing that, read it as windows-1252, and failing that, read it as macintosh. Then you would do:
cocoa-sniff --encodings=SNIFF,windows-1252,macintosh foo.txt
Convert a file using the above encoding strategy:
cocoa-sniff --convert --encodings=SNIFF,windows-1252,macintosh foo.txt > foo-converted.txt
This utility is just a front-end to part of Cocoa's text system, which you might be obliged to use in order to deploy on Apple's App Store. However, if you're running detections and conversions at build time, and can use any utilities available on a plain vanilla Xcode install, you might prefer these:
This built-in Mac OS X utility does conversion between text encodings, as well as various other text-oriented formats such as HTML, RTF, .webarchive, etc.. In particular, you can use it to try reading a file with a given encoding and check for encoding errors via its return code. This would be great for finding a suitable encoding for reading an unknown file with Cocoa. However, there is one fly in the ointment. textutil seems to apply its own implicit conversion logic, so that it will accept files as, e.g., utf-8, that Cocoa itself will not accept as utf-8. So it seems you can't use it to diagnose what
+[NSString stringWithContentsOfFile:encoding:error:]will accept.
Here are some helpful incantation.
Convert FILE from macintosh to utf-8, outputting to stdout:
textutil -cat txt -stdout -encoding utf-8 -format txt -inputencoding macintosh FILE
Print the error code after attempting to convert FILE from utf-8:
textutil -cat txt -stdout -encoding utf-8 -format txt -inputencoding utf-8 FILE > /dev/null 2>&1; echo $?
- "file -I" will try to guess the encoding of a file. Seems to detect utf-8 and utf-16le reliably, but cannot distinguish windows-1252 and macintosh.
- An old piece of Mac OS X technology that lets you configure detailed encoding detection, where you specify the set of possible encodings, and it ranks their likelihood based on the number of encoding errors each generates. This is much more elaborate than what's used by
+[NSString stringWithContentsOfFile:usedEncoding:errror:]but probably still less intelligent than Mozilla's charset detection algorithm. However, it has a painful interface, so I've never gotten it working and haven't seen anyone else that's using it either.
And if you can install anything you want, then you will probably be better served by one of these more generic utilities:
- This is a Python library that re-implements Mozilla's automatic encoding-detection logic, which is elaborate and uses a mixture of technical and linguistic cues. Should be awesome, but produced less than awesome results on some of my test samples.
- An encoding conversion utility. Part of the Single Unix Specification, so presumably better to rely on than the following.
- An encoding conversion utility
- Detects the encoding, but "tries to determine your language and preferred charset from locale settings", so it seems to need a lot of hinting.
- This utility seems to do everything. The "--known=pairs" option seems especially useful for deducing the charset given some knowledge of what a few characters are supposed to be. I suspect licensing means it can't be embedded in iOS apps.