Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data is double-encoded #5

Closed
mjpieters opened this issue Jul 31, 2018 · 3 comments · Fixed by #28
Closed

Data is double-encoded #5

mjpieters opened this issue Jul 31, 2018 · 3 comments · Fixed by #28

Comments

@mjpieters
Copy link

mjpieters commented Jul 31, 2018

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

@mjpieters
Copy link
Author

mjpieters commented Jul 31, 2018

Workaround is to use iconv:

for file in IRAhandle_tweets_*.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again.

This shaves of 10% of the total bytecount, dropping from 731MB to 656MB.

@dmil dmil closed this as completed in d944a52 Jul 31, 2018
@dmil
Copy link
Contributor

dmil commented Jul 31, 2018

Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested.

@EvanCarroll
Copy link

Cool work, seems there is more to do though (if we can recover this) #20

@dmil dmil mentioned this issue Aug 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants