# Character Encodings
## Table of Contents <a id='TOC'></a>
- [Package Import](#package-import)
- [What are encodings?](#encodings)
- [Reading file with encoding problems](#data-import)
 - [Using `chardet` to figure out encoding](#chardet)
 - [My turn](#my-turn)
- [Saving files with UTF-8 encoding](#saving-files)
- [More Practice](#more-practice)

## Package Import <a id='package-import'></a>
[TOC](#TOC)

In [1]:
import pandas as pd
import numpy as np

import chardet

np.random.seed(0)

## What are encodings? <a id='encodings'></a>
[TOC](#TOC)

>UTF-8 is **the** standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

In [3]:
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

str

In [4]:
before

'This is the euro symbol: €'

In [5]:
# encode the string to a different encoding
# replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# Now check the type
type(after)

bytes

In [6]:
after

b'This is the euro symbol: \xe2\x82\xac'

Bytes are printed out as if they were characters encoded in ASCII.

In [7]:
after.decode("utf-8")

'This is the euro symbol: €'

But this won't work with the incorrect coding, like trying to decode using ascii.
>You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a cd player. If you try to play a cassette in a CD player, it just won't work.

In [8]:
# encode the og string to a different encoding and
# replace the characters that raise errors
after = before.encode('ascii', errors='replace')

# let's see what it looks like converted back to utf-8
after

b'This is the euro symbol: ?'

We lost our original euro symbol by decoding the string using ascii. We want to encode non-UTF-8 text to UTF-8 as soon as possible so we don't lose any symbols.

In [9]:
symbols = '$, #, 你好 and नमस्ते'

In [10]:
symbols

'$, #, 你好 and नमस्ते'

In [11]:
type(symbols)

str

In [12]:
symbols.encode('utf-8')

b'$, #, \xe4\xbd\xa0\xe5\xa5\xbd and \xe0\xa4\xa8\xe0\xa4\xae\xe0\xa4\xb8\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa5\x87'

In [14]:
symbols.encode('ascii', errors='replace')

b'$, #, ?? and ??????'

## Reading file with encoding problems <a id='data-import'></a>
[TOC](#TOC)

Yep there's a problem here.

In [2]:
kickstarter_2016 = pd.read_csv("data/ks-projects-201612.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

>We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

### Using `chardet` to figure out encoding <a id='chardet'></a>
[TOC](#TOC)

In [15]:
# look at the first ten thousand bytes to guess the character encoding
with open("data/ks-projects-201612.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

In [16]:
result

{'confidence': 0.73, 'encoding': 'Windows-1252', 'language': ''}

In [17]:
kickstarter_2016 = pd.read_csv("data/ks-projects-201612.csv", encoding='Windows-1252')

  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
kickstarter_2016.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


### My turn <a id='my-turn'></a>
[TOC](#TOC)

The `UnicodeDecodeError` gets thrown on the second byte so looking at the first ten thousand bytes of the file with `chardet` as before should work.

In [24]:
with open("data/PoliceKillingsUS.csv", "rb") as f:
    result = chardet.detect(f.read(100000))

In [25]:
result

{'confidence': 0.73, 'encoding': 'Windows-1252', 'language': ''}

In [26]:
police_killings = pd.read_csv("data/PoliceKillingsUS.csv", encoding='Windows-1252')

In [27]:
police_killings.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False


In [28]:
police_killings.shape

(2535, 14)

In [31]:
sum(police_killings.body_camera == True)

271

In [33]:
sum(police_killings.armed == "gun")

1398

In [34]:
police_killings.age.mean()

36.605370219690805

In [37]:
police_killings.age.std()

13.030773649714495

## Saving files with UTF-8 encoding <a id='saving-files'></a>
[TOC](#TOC)

Files get saved in UTF-8 encoding by default in Python.

In [38]:
kickstarter_2016.to_csv("ks-test.csv")

In [39]:
police_killings.to_csv("police-test.csv")

## More Practice <a id='more-practice'></a>
[TOC](#TOC)

In [40]:
file_guide = pd.read_csv("data/char-encoding/file_guide.csv")

In [43]:
_file = "data/char-encoding/harpers_ASCII.txt"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(10000))

In [44]:
result

{'confidence': 1.0, 'encoding': 'ascii', 'language': ''}

In [46]:
with open(_file, encoding='ascii') as f:
    lines = f.readlines(5000)

In [47]:
last_line = lines[len(lines) - 1]
last_line

'"In course you mean to be," interrupted Mr. Simmons, gravely; "but I\n'

In [48]:
last_line.encode("ascii")

b'"In course you mean to be," interrupted Mr. Simmons, gravely; "but I\n'

Another file

In [52]:
_file = "data/char-encoding/olaf_Windows-1251.txt"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(10000))

In [53]:
result

{'confidence': 0.99, 'encoding': 'windows-1251', 'language': 'Bulgarian'}

In [54]:
with open(_file, encoding='windows-1251') as f:
    lines = f.readlines(5000)

In [58]:
lines[-1]

' сметнат за самилюбец. Не е чудно и да е такъв, понеже е творец, и както всичките негови добри колеги гали в душата си звярът на отвращението към светите чувства на филистрите, към скрижалните заповеди на чичо Сноба. Бащино наследство е у него обичта му към всичко достойно за обич; а при друго едно наследство - светата злоба - нашият поет се е постарал да попритури и от себе си нещичко. Това, що изповядват обикновено хероите на словото, е и негова изповед: гневът е мерило на моята любов! И горд е той не на шега, но с шеги мисли, че добре прикрива гордостта си пред нищите духом - некадърниците да проумеят в това трагизма на тая гордост. А неговий смях и присмех над себе си е такъв добър коментар, какъвто не са дори самотността и тъмеящата понякога в погледа му меланхолия!\n'

In [57]:
lines[-1].encode('windows-1251')

b' \xf1\xec\xe5\xf2\xed\xe0\xf2 \xe7\xe0 \xf1\xe0\xec\xe8\xeb\xfe\xe1\xe5\xf6. \xcd\xe5 \xe5 \xf7\xf3\xe4\xed\xee \xe8 \xe4\xe0 \xe5 \xf2\xe0\xea\xfa\xe2, \xef\xee\xed\xe5\xe6\xe5 \xe5 \xf2\xe2\xee\xf0\xe5\xf6, \xe8 \xea\xe0\xea\xf2\xee \xe2\xf1\xe8\xf7\xea\xe8\xf2\xe5 \xed\xe5\xe3\xee\xe2\xe8 \xe4\xee\xe1\xf0\xe8 \xea\xee\xeb\xe5\xe3\xe8 \xe3\xe0\xeb\xe8 \xe2 \xe4\xf3\xf8\xe0\xf2\xe0 \xf1\xe8 \xe7\xe2\xff\xf0\xfa\xf2 \xed\xe0 \xee\xf2\xe2\xf0\xe0\xf9\xe5\xed\xe8\xe5\xf2\xee \xea\xfa\xec \xf1\xe2\xe5\xf2\xe8\xf2\xe5 \xf7\xf3\xe2\xf1\xf2\xe2\xe0 \xed\xe0 \xf4\xe8\xeb\xe8\xf1\xf2\xf0\xe8\xf2\xe5, \xea\xfa\xec \xf1\xea\xf0\xe8\xe6\xe0\xeb\xed\xe8\xf2\xe5 \xe7\xe0\xef\xee\xe2\xe5\xe4\xe8 \xed\xe0 \xf7\xe8\xf7\xee \xd1\xed\xee\xe1\xe0. \xc1\xe0\xf9\xe8\xed\xee \xed\xe0\xf1\xeb\xe5\xe4\xf1\xf2\xe2\xee \xe5 \xf3 \xed\xe5\xe3\xee \xee\xe1\xe8\xf7\xf2\xe0 \xec\xf3 \xea\xfa\xec \xe2\xf1\xe8\xf7\xea\xee \xe4\xee\xf1\xf2\xee\xe9\xed\xee \xe7\xe0 \xee\xe1\xe8\xf7; \xe0 \xef\xf0\xe8 \xe4\xf0\xf3\xe3

Another file

In [59]:
_file = "data/char-encoding/portugal_ISO-8859-1.txt"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(10000))

In [60]:
result

{'confidence': 0.73, 'encoding': 'ISO-8859-1', 'language': ''}

In [61]:
with open(_file, encoding='ISO-8859-1') as f:
    lines = f.readlines(5000)

In [68]:
lines[-1]

'            _Os escritos que saê da mão fóra\n'

In [69]:
lines[-1].encode('ISO-8859-1')

b'            _Os escritos que sa\xea da m\xe3o f\xf3ra\n'

Another file

In [70]:
_file = "data/char-encoding/shisei_UTF-8.txt"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(10000))

In [71]:
result

{'confidence': 1.0, 'encoding': 'UTF-8-SIG', 'language': ''}

In [72]:
with open(_file, encoding='UTF-8-SIG') as f:
    lines = f.readlines(5000)

In [73]:
lines[-1]

'\u3000丁度《ちやうど》四｜年目《ねんめ》の夏《なつ》のとあるゆふべ、深川《ふかがは》の料理屋《れうりや》平淸《ひらせい》の前《まへ》を通《とほ》りかかつた時《とき》、彼《かれ》はふと門口《かどぐち》に待《ま》つて居《ゐ》る駕籠《かご》の簾《すだれ》のかげから眞白《まつしろ》な女《をんな》の素足《すあし》のこぼれて居《ゐ》るのに氣がついた。銳《するど》い彼《かれ》の眼《め》には、人間《にんげん》の足《あし》はその顏《かほ》と同《おな》じやうに複雜《ふくざつ》な表情《へうじやう》を持《も》つて映《うつ》つた。その女《をんな》の足《あし》は、彼《かれ》に取《と》つては貴《たつと》き肉《にく》の寶玉《はうぎよく》であつた。拇指《おやゆび》から起《おこ》つて小指《こゆび》に終《をは》る繊細《せんさい》な五｜本《ほん》の指《ゆび》の整《とゝの》ひ方《かた》、繪《ゑ》の島《しま》の海邊《うみべ》で獲《と》れるうすべに色《いろ》の貝《かひ》にも劣《おと》らぬ爪《つめ》の色合《いろあひ》、珠《たま》のやうな踵《きびす》のまる味《み》、淸冽《せいれつ》な岩間《いはま》の水《みづ》が絕《た》えず足下《あしもと》を洗《あら》ふかと疑《うたが》はれる皮膚《ひふ》の潤澤《じゆんたく》。この足《あし》こそは、やがて男《をとこ》の生血《いきち》に肥《こ》え太《ふと》り、男《をとこ》のむくろを蹈《ふ》みつける足《あし》であつた。この足《あし》を持《も》つ女《をんな》こそは、彼《かれ》が永年《ながねん》たづねあぐむだ女《をんな》の中《なか》の女《をんな》であらうと思《おも》はれた。淸吉《せいきち》は躍《をど》りたつ胸《むね》をおさへて、其《そ》の人《ひと》の顏《かほ》が見《み》たさに駕籠《かご》の後《あと》を追《お》ひかけたが、二三｜町《ちやう》行《ゆ》くと、もう其《そ》の影《かげ》は見《み》えなかつた。\n'

In [74]:
lines[-1].encode("UTF-8-SIG")

b'\xef\xbb\xbf\xe3\x80\x80\xe4\xb8\x81\xe5\xba\xa6\xe3\x80\x8a\xe3\x81\xa1\xe3\x82\x84\xe3\x81\x86\xe3\x81\xa9\xe3\x80\x8b\xe5\x9b\x9b\xef\xbd\x9c\xe5\xb9\xb4\xe7\x9b\xae\xe3\x80\x8a\xe3\x81\xad\xe3\x82\x93\xe3\x82\x81\xe3\x80\x8b\xe3\x81\xae\xe5\xa4\x8f\xe3\x80\x8a\xe3\x81\xaa\xe3\x81\xa4\xe3\x80\x8b\xe3\x81\xae\xe3\x81\xa8\xe3\x81\x82\xe3\x82\x8b\xe3\x82\x86\xe3\x81\xb5\xe3\x81\xb9\xe3\x80\x81\xe6\xb7\xb1\xe5\xb7\x9d\xe3\x80\x8a\xe3\x81\xb5\xe3\x81\x8b\xe3\x81\x8c\xe3\x81\xaf\xe3\x80\x8b\xe3\x81\xae\xe6\x96\x99\xe7\x90\x86\xe5\xb1\x8b\xe3\x80\x8a\xe3\x82\x8c\xe3\x81\x86\xe3\x82\x8a\xe3\x82\x84\xe3\x80\x8b\xe5\xb9\xb3\xe6\xb7\xb8\xe3\x80\x8a\xe3\x81\xb2\xe3\x82\x89\xe3\x81\x9b\xe3\x81\x84\xe3\x80\x8b\xe3\x81\xae\xe5\x89\x8d\xe3\x80\x8a\xe3\x81\xbe\xe3\x81\xb8\xe3\x80\x8b\xe3\x82\x92\xe9\x80\x9a\xe3\x80\x8a\xe3\x81\xa8\xe3\x81\xbb\xe3\x80\x8b\xe3\x82\x8a\xe3\x81\x8b\xe3\x81\x8b\xe3\x81\xa4\xe3\x81\x9f\xe6\x99\x82\xe3\x80\x8a\xe3\x81\xa8\xe3\x81\x8d\xe3\x80\x8b\xe3\x80\x81\xe5\xbd\xbc\x

Another file

In [75]:
_file = "data/char-encoding/yan_BIG-5.txt"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(10000))

In [76]:
result

{'confidence': 0.99, 'encoding': 'Big5', 'language': 'Chinese'}

In [77]:
with open(_file, encoding='Big5') as f:
    lines = f.readlines(5000)

In [78]:
lines[-1]

'    《家語》曰：「君子不博，為其兼行惡道故也。」《論語》云：「不\n'

In [79]:
lines[-1].encode("Big5")

b'    \xa1m\xaea\xbby\xa1n\xa4\xea\xa1G\xa1u\xa7g\xa4l\xa4\xa3\xb3\xd5\xa1A\xac\xb0\xa8\xe4\xad\xdd\xa6\xe6\xb4c\xb9D\xacG\xa4]\xa1C\xa1v\xa1m\xbd\xd7\xbby\xa1n\xa4\xaa\xa1G\xa1u\xa4\xa3\n'

[TOC](#TOC)