Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

Do not try to convert from ascii to utf8

ASCII is compatible with UTF8 so we need no conversion.
If ASCII detection was bogus then we hope the file is really UTF8
already, in which case we loose NO characters. If it's something
else we'll just loose the character (as it is already the case).

Basically this change fixes the "misdetected UTF8" in a better way.

See #904 for further details
  • Loading branch information...
commit 3d91e1152a392a5dcdbc51560ec5ab4a6d65eea4 1 parent e987b4f
Sandro Santilli strk authored

Showing 1 changed file with 13 additions and 16 deletions. Show diff stats Hide diff stats

  1. +13 16 lib/importer/lib/cartodb-importer/lib/utils.rb
29 lib/importer/lib/cartodb-importer/lib/utils.rb
@@ -78,22 +78,19 @@ def fix_encoding
78 78 charset = nil
79 79 end
80 80 end
81   - unless ['unknown-8bit','',nil,'binary'].include? charset
82   - tf = Tempfile.new(@path)
83   - `iconv -f #{charset} -t UTF-8//IGNORE #{@path} > #{tf.path}`
84   - `mv -f #{tf.path} #{@path}`
85   - tf.close!
86   - else
87   - lines = []
88   - File.open(@path) do |f|
89   - 1000.times do
90   - line = f.gets || break
91   - lines << line
92   - end
93   - end
94   - # detect encoding for sample
95   - cd = CharDet.detect(lines.join)
96   - # Only do non-UTF8 if we're quite sure. (May fail)
  81 +
  82 + lines = []
  83 + File.open(@path) do |f|
  84 + 1000.times do
  85 + line = f.gets || break
  86 + lines << line
  87 + end
  88 + end
  89 + # detect encoding for sample
  90 + cd = CharDet.detect(lines.join)
  91 + #puts "Chardet detected #{cd.encoding} with #{cd.confidence} confidence"
  92 + # Only do non-UTF8 if we're quite sure. (May fail)
  93 + unless cd.encoding.include? 'utf-8' or cd.encoding.include? 'ascii'
97 94 if (cd.confidence > 0.6)
98 95 tf = Tempfile.new(@path)
99 96 `iconv -f #{cd.encoding} -t UTF-8//IGNORE #{@path} > #{tf.path}`

0 comments on commit 3d91e11

Please sign in to comment.
Something went wrong with that request. Please try again.