Character Encoding
- Set of rules for mapping binary byte strings to characters that make up human readable text.
- UTF-8 is the standard text encoding.

In [None]:
vor = "This is the dollar symbol: $"
type(vor)

str

In [2]:
nach = vor.encode("utf-8", errors="replace")
type(nach)

bytes

In [3]:
nach

b'This is the dollar symbol: $'

In [4]:
nach.decode("utf-8")

'This is the dollar symbol: $'

In [10]:
before = "This is the euro symbol: â‚¹"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

This is the euro symbol: ?


It is best practice to convert all text to UTF-8 ASAP and keep it in that encoding. In order to quickly determine the encoding used in a file, we can use the `charset_normalizer` module to automatically guess what the right encoding is.

In [None]:
import pandas as pd
import numpy as np
import charset_normalizer

with open("catalog.csv") as rawdata:
    result = charset_normalizer.detect(rawdata.read(10000))
    
print(result)