# [Character Encodings](https://www.kaggle.com/code/alexisbcook/character-encodings)

In [1]:
import pandas as pd

import numpy as np
np.random.seed(0)

# Character encoding module
import charset_normalizer

## UTF-8 and ASCII encodings

You can convert a `string` object into a `bytes` object by specifying which encoding it's in:

In [2]:
# Start with a string
before = 'This is the euro symbol: €'

# Encode the string to a different encoding, replacing characters that raise errors
# NOTE: "errors='replace'" will not be invoked here, because none of the characters will cause any error
after = before.encode('utf-8', errors='replace')

print('before_type:', type(before))
print('after_type:', type(after))

before_type: <class 'str'>
after_type: <class 'bytes'>


In [3]:
print(before)
print(after)

This is the euro symbol: €
b'This is the euro symbol: \xe2\x82\xac'


The `bytes` object has a "b" in front of it. This is because `bytes` are printed as through they are characters encoded in `ASCII`. <br>
The euro symbol has also been printed as some mojibake (`\xe2\x82\xac`).

In [4]:
# Convert it back, specify which encoding it's in 
back_again = after.decode('utf-8')

print(type(back_again))
print(back_again)

<class 'str'>
This is the euro symbol: €


When we try to use a different encoding to map our bytes into a string, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

In [5]:
# Trying to decode our bytes with ASCII encoding
print(after.decode('ascii'))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

The error above tells us that the file isn't in ASCII, so it cannot be decoded with ASCII.

## Reading in files with encoding problems

Most files will be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [6]:
# Trying to read a file that is not UTF-8
kickstarter_2016 = pd.read_csv('ks-projects-201612.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 7955: invalid start byte

The error above tells us that this file isn't in UTF-8. We don't know what encoding it actually is though. <br>
One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the `charset_normalizer` module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

We'll look at just the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. <br>
Another reason to just look at the first part of the file is that we can see by looking at the error message that the problem is in position 7955. So we probably only need to look at the first little bit of the file to figure out what's going on.

In [7]:
# Look at the first 10k bytes to guess character encoding
with open('ks-projects-201612.csv', 'rb') as rawdata:
    result = charset_normalizer.detect(rawdata.read(10000))

print(result)

{'encoding': 'windows-1250', 'language': 'English', 'confidence': 1.0}


In [8]:
with open('ks-projects-201612.csv', 'rb') as rawdata:
    result = charset_normalizer.detect(rawdata.read(3000000))

print(result)

{'encoding': 'windows-1250', 'language': 'English', 'confidence': 1.0}


In [9]:
with open('ks-projects-201612.csv', 'rb') as rawdata:
    result = charset_normalizer.detect(rawdata.read())

print(result)

{'encoding': 'mac_iceland', 'language': 'English', 'confidence': 0.9992}


NOTE: Getting different encodings. The tutorial suggests a trial and error approach.

In [10]:
# Reading the file in mac_iceland encoding
kickstarter_2016 = pd.read_csv('ks-projects-201612.csv', encoding='mac_iceland')

  kickstarter_2016 = pd.read_csv('ks-projects-201612.csv', encoding='mac_iceland')


In [11]:
kickstarter_2016.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


## Saving files in UTF-8 encoding

In [12]:
# Saving file (will be saved as UTF-8 by default)
kickstarter_2016.to_csv('ks-projects-201801-utf8_from_mac_iceland.csv')

## Trial-and-error with different encodings

In [13]:
# Reading the file in windows-1250 encoding
kickstarter_2016 = pd.read_csv('ks-projects-201612.csv', encoding='windows-1250')
kickstarter_2016.head()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 215710: character maps to <undefined>

In [14]:
# Reading the file in windows-1252 encoding
kickstarter_2016 = pd.read_csv('ks-projects-201612.csv', encoding='windows-1252')
kickstarter_2016.head()

  kickstarter_2016 = pd.read_csv('ks-projects-201612.csv', encoding='windows-1252')


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


In [15]:
kickstarter_2016.to_csv('ks-projects-201801-utf8_from_windows-1252.csv')