Skip to content
Browse files

Do not try to convert from ascii to utf8

ASCII is compatible with UTF8 so we need no conversion.
If ASCII detection was bogus then we hope the file is really UTF8
already, in which case we loose NO characters. If it's something
else we'll just loose the character (as it is already the case).

Basically this change fixes the "misdetected UTF8" in a better way.

See #904 for further details
  • Loading branch information...
1 parent e987b4f commit 3d91e1152a392a5dcdbc51560ec5ab4a6d65eea4 @strk strk committed Oct 10, 2012
Showing with 13 additions and 16 deletions.
  1. +13 −16 lib/importer/lib/cartodb-importer/lib/utils.rb
View
29 lib/importer/lib/cartodb-importer/lib/utils.rb
@@ -78,22 +78,19 @@ def fix_encoding
charset = nil
end
end
- unless ['unknown-8bit','',nil,'binary'].include? charset
- tf = Tempfile.new(@path)
- `iconv -f #{charset} -t UTF-8//IGNORE #{@path} > #{tf.path}`
- `mv -f #{tf.path} #{@path}`
- tf.close!
- else
- lines = []
- File.open(@path) do |f|
- 1000.times do
- line = f.gets || break
- lines << line
- end
- end
- # detect encoding for sample
- cd = CharDet.detect(lines.join)
- # Only do non-UTF8 if we're quite sure. (May fail)
+
+ lines = []
+ File.open(@path) do |f|
+ 1000.times do
+ line = f.gets || break
+ lines << line
+ end
+ end
+ # detect encoding for sample
+ cd = CharDet.detect(lines.join)
+ #puts "Chardet detected #{cd.encoding} with #{cd.confidence} confidence"
+ # Only do non-UTF8 if we're quite sure. (May fail)
+ unless cd.encoding.include? 'utf-8' or cd.encoding.include? 'ascii'
if (cd.confidence > 0.6)
tf = Tempfile.new(@path)
`iconv -f #{cd.encoding} -t UTF-8//IGNORE #{@path} > #{tf.path}`

0 comments on commit 3d91e11

Please sign in to comment.
Something went wrong with that request. Please try again.