Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding issue #49

Closed
riffraff opened this issue Dec 24, 2014 · 6 comments
Closed

encoding issue #49

riffraff opened this issue Dec 24, 2014 · 6 comments

Comments

@riffraff
Copy link

with recent GeoLiteCity.dat files I think the text encoding has changed, i.e.

>>geoip.city("201.68.235.228")
=> #<struct GeoIP::City
 request="201.68.235.228",
 ip="201.68.235.228",
 country_code2="BR",
 country_code3="BRA",
 country_name="Brazil",
 continent_code="SA",
 region_name="27",
 city_name="Guaruj\xE1",
 postal_code="",
 latitude=-24.0,
 longitude=-46.266699999999986,
 dma_code=nil,
 area_code=nil,
 timezone="America/Sao_Paulo",
 real_region_name="Sao Paulo">

notice that city_name is badly encoded. Calling .encode('utf-8') on it correctly converts it.
in #6 the code was updated to force encoding of the strings to iso-8859-1, so that seems to cause this issue.
This is a problem because the string is in fact utf-8 but considered latin-1 and when concatenated with a string in utf8 it causes a runtime error. I can fix this in user code by re-encoding in utf-8 but it seems wasteful.

I don't know if there is a way to detect the encoding of the database file, but I think we might want to provide a flag to the db constructor or default to utf8 if necessary.

@oschwald
Copy link

The encoding of the database file has not changed. The strings are stored internally as ISO-8859-1.

@cjheath
Copy link
Owner

cjheath commented Dec 24, 2014

I think Gregory is right, which means I need to add conversions to all strings. That won't happen for a while, unless I get a solid PR from someone else.

@riffraff
Copy link
Author

Thanks for the replies and sorry for misinterpreting IRB's output.

If the db encoding is not changed than I think the problem is limited to changing https://github.com/cjheath/geoip/blob/master/lib/geoip.rb#L792, i.e. almost all fields are already in UTF-8 except for city_name (I believe that all other fields are in fact plain ASCII and city_name is the only one likely to yield characters outside of ASCII).

E.g. notice the difference between "Sao Paulo" as timezone, city_name and region_name.

geoip.city("200.220.180.2").each.map {|k| [k, k.encoding] if k.respond_to? :encoding }}
=> [["200.220.180.2", #<Encoding:UTF-8>],
 ["200.220.180.2", #<Encoding:UTF-8>],
 ["BR", #<Encoding:UTF-8>],
 ["BRA", #<Encoding:UTF-8>],
 ["Brazil", #<Encoding:UTF-8>],
 ["SA", #<Encoding:UTF-8>],
 ["27", #<Encoding:ASCII-8BIT>],
 ["S\xE3o Paulo", #<Encoding:ISO-8859-1>],
 ["", #<Encoding:ASCII-8BIT>],
 nil,
 nil,
 nil,
 nil,
 ["America/Sao_Paulo", #<Encoding:UTF-8>],
 ["Sao Paulo", #<Encoding:UTF-8>]]

(I am not sure why region_name and postal_code are ASCII-8BIT but since it's a subset of utf-8 anyway that doesn't cause issues)

So the only change would be

city.force_encoding('iso-8859-1') if city.respond_to?(:force_encoding)

to

if city.respond_to?(:force_encoding)
  city.force_encoding('iso-8859-1').encode('utf-8')
end

I can add a couple tests and make the change and send a pull request, but I couldn't find tests for the city database, are there any?

@epitron
Copy link

epitron commented Mar 10, 2015

I just ran into this issue today as well.

The one-line fix that the previous poster suggested looks good to me:

city.force_encoding('iso-8859-1').encode('utf-8') if city.respond_to?(:force_encoding)

@cjheath
Copy link
Owner

cjheath commented Mar 10, 2015

Apologies for the delay releasing this fix. I had introduced a breaking change and not had time to diagnose it. The gem is now released as 1.5.0

@cjheath cjheath closed this as completed Mar 10, 2015
@epitron
Copy link

epitron commented Mar 11, 2015

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants